How to subset a dataframe based on columns from another dataframe?

Question

I have two data frames (df1 and df2) and I want to subset df2 based on the first two columns contained in df1. For example,

df1 = data.frame(x=c(1,1,1,1,1),y=c(1,2,3,4,5),value=c(3,4,5,6,7))
df2 = data.frame(x=c(1,1,1,1,1,2), y=c(5,3,4,2,1,6), value=c(8,9,10,11,12,13))

As we can see, row 6 (2,6) in df2 is not included in df1, so I will just select row 1 to row 5 in df2.

Also, I want to rearrange df2 based on df1. The final result should be like this:

Thanks for any help.

One possible solution is df1 %>% select(x,y) %>% inner_join(df2, by=c("x","y")) — AntoniosK
– AntoniosK, Commented May 25, 2018 at 23:44

IceCreamToucan · Accepted Answer · 2018-05-26 19:15:39Z

3

When using merge, by default the data frames are joined by the variables they have in common, and the results are sorted. So you can do:

merge(df2, df1[c('x', 'y')])

#   x y value
# 1 1 1    12
# 2 1 2    11
# 3 1 3     9
# 4 1 4    10
# 5 1 5     8

To sort by the order of df1, use @Mankind_008's method

merge(df1[c('x','y')], df2 , sort = F)

Example:

set.seed(0)
df1 <- df1[sample(seq_len(nrow(df1))),]
df2 <- df2[sample(seq_len(nrow(df2))),]
df1
#   x y value
# 5 1 5     7
# 2 1 2     4
# 4 1 4     6
# 3 1 3     5
# 1 1 1     3    
merge(df1[c('x','y')], df2 , sort = F)
#   x y value
# 1 1 5     8
# 2 1 2    11
# 3 1 4    10
# 4 1 3     9
# 5 1 1    12

edited May 26, 2018 at 19:15

answered May 26, 2018 at 0:05

IceCreamToucan

28.8k2 gold badges27 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Mankind_2000 Over a year ago

To keep the order of df1, should be: merge(df1[c('x','y')], df2 , sort = False)

Mankind_2000 Over a year ago

yup values will be the same only but order will not be. That is because df1 is already sorted in the given dummy case, in general case i.e. df1 is original scrambled, it will not retain the structure if sort = True

Yang Yang Over a year ago

Thanks for the help. Just one thing to mention that sort=FALSE should be the right format (not False).

Mankind_2000 Over a year ago

@Yang Yang You are welcome. you are right. Ryan added with a synonym 'F' in answer. so no worries.

Carlos Eduardo Lagosta · Accepted Answer · 2018-05-26 00:57:47Z

Use data tables:

library(data.table)

Create your data as data.table:

df1 <- data.table( x = c(1,1,1,1,1), y = c(1,2,3,4,5), value = c(3,4,5,6,7) )
df2 <- data.table( x = c(1,1,1,1,1,2), y = c(5,3,4,2,1,6), value = c(8,9,10,11,12,13) )

Or convert your existing data.frames:

df1 <- as.data.table( df1 )
df2 <- as.data.table( df2 )

Then:

df2[ df1, on = .(x,y) ]

Any column in df1 that have the same name in df2 will be renamed as i.columnname:

   x y value i.value
1: 1 1    12       3
2: 1 2    11       4
3: 1 3     9       5
4: 1 4    10       6
5: 1 5     8       7

Note that it already order by x and y. If you want to order by the column 'value' (or any other):

df2[ df1, on = .(x,y) ][ order(value) ]

The advantage of using data.table (or dplyr, as the solution proposed by AntoniosK) is that you can keep the two data sets separated.

Collectives™ on Stack Overflow

How to subset a dataframe based on columns from another dataframe?

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related