Subsetting a dataframe based on another dataframe in R

Question

df:

  y x
  F T
  F F
  T T
  T F

df1:

 y z probs.x x probs.y   new
 F F    0.08 T     0.4 0.032
 F F    0.24 F     0.4 0.096
 F T    0.12 T     0.6 0.072
 F T    0.36 F     0.6 0.216
 T F    0.40 T     0.5 0.200
 T F    0.20 F     0.5 0.100
 T T    0.40 T     0.5 0.200
 T T    0.20 F     0.5 0.100

df and df1 are the two data frames. And for each row of df, I want to select the matching rows in df1, add the values in column “new”, and store output in a new data frame like this.

df_res:

  y x new
  F T .104
  F F .312
  T T .4
  T F .2

Kindly help me out! I have been toiling over this for a long time now. The table headers will change according to the variables, so please do do not hard code the table headers.

Thanks.

what is the criteria of matching df and df1? both y and x matching? and some them together? So, no matter what else the final will have only 4 rows? — Ananta
– Ananta, Commented Feb 22, 2014 at 3:39

Jd Baba · Accepted Answer · 2014-02-22 03:48:23Z

2

I don't know how long is your data but this can be one approach.

df<- read.table(text="y x
F T
F F
T T
T F",header=T,sep="")

df1 <- read.table(text="y z probs.x x probs.y   new
 F F    0.08 T     0.4 0.032
 F F    0.24 F     0.4 0.096
 F T    0.12 T     0.6 0.072
 F T    0.36 F     0.6 0.216
 T F    0.40 T     0.5 0.200
 T F    0.20 F     0.5 0.100
 T T    0.40 T     0.5 0.200
 T T    0.20 F     0.5 0.100", header=T, sep="")

df$yx <- paste0(df$y,df$x)
df1$yx <- paste0(df1$y, df1$x)

# Update automatically using the for loop

for (i in 1:4){
  new[i] <- sum(df1[which(df1[,7]==df[i,3]),6])
}

df$new <- new
df
      y     x         yx   new
1 FALSE  TRUE  FALSETRUE 0.104
2 FALSE FALSE FALSEFALSE 0.312
3  TRUE  TRUE   TRUETRUE 0.400
4  TRUE FALSE  TRUEFALSE 0.200

Using sapply

new <- sapply(1:4, function(x) sum(df1[which(df1[,7]==df[x,3]),6]))

edited Feb 22, 2014 at 3:48

answered Feb 22, 2014 at 3:43

Jd Baba

6,16819 gold badges65 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ananta · Accepted Answer · 2014-02-22 03:52:46Z

1

it seems like if all you want is F,T combination. this works. otherwise you have to write more clearly.

text=" y z probs.x x probs.y   new
 F F    0.08 T     0.4 0.032
 F F    0.24 F     0.4 0.096
 F T    0.12 T     0.6 0.072
 F T    0.36 F     0.6 0.216
 T F    0.40 T     0.5 0.200
 T F    0.20 F     0.5 0.100
 T T    0.40 T     0.5 0.200
 T T    0.20 F     0.5 0.100"

df<-read.table(text=text, header=T)
df_res<-aggregate(data=df, new~interaction(y,x),sum)

  interaction(y, x)   new
1       FALSE.FALSE 0.312
2        TRUE.FALSE 0.200
3        FALSE.TRUE 0.104
4         TRUE.TRUE 0.400

answered Feb 22, 2014 at 3:52

Ananta

3,7113 gold badges24 silver badges26 bronze badges

Comments

cbare · Accepted Answer · 2014-02-22 06:02:32Z

Here's an answer using merge and plyr.

Read in your example data.frame:

df1 <- read.table(text="y z probs.x x probs.y   new
 F F    0.08 T     0.4 0.032
 F F    0.24 F     0.4 0.096
 F T    0.12 T     0.6 0.072
 F T    0.36 F     0.6 0.216
 T F    0.40 T     0.5 0.200
 T F    0.20 F     0.5 0.100
 T T    0.40 T     0.5 0.200
 T T    0.20 F     0.5 0.100", header=T, sep="")

If I understand, there are 2 steps to what your asking. First is to select rows in df1 that match patterns in df. That can be done with merge. The df you gave has all combinations of True and False for x and y. Let's leave one out so we can see the effect:

df <- read.table(text="y x
F T
T T
T F",header=T,sep="")

df_merged <- merge(df, df1, all.y=F)

The results are a new data.frame the omits the rows where both x and y are F. This is equivalent to a left join in a SQL database.

      y     x     z probs.x probs.y   new
1 FALSE  TRUE FALSE    0.08     0.4 0.032
2 FALSE  TRUE  TRUE    0.12     0.6 0.072
3  TRUE FALSE FALSE    0.20     0.5 0.100
4  TRUE FALSE  TRUE    0.20     0.5 0.100
5  TRUE  TRUE FALSE    0.40     0.5 0.200
6  TRUE  TRUE  TRUE    0.40     0.5 0.200

The second part of the question is to group the data and apply a sum to the groups. Plyr is a great tool for this kind of data manipulation:

library(plyr)
ddply(df_merged, .(y,x), function(df) c(new=sum(df$new)))

The dd means we are giving a data.frame and want a data.frame as a result. The next argument .(y,x) is a quoted expression and names the variables we're grouping by. The result is this:

      y     x   new
1 FALSE  TRUE 0.104
2  TRUE FALSE 0.200
3  TRUE  TRUE 0.400

Collectives™ on Stack Overflow

Subsetting a dataframe based on another dataframe in R

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related