R Populate column based on matching rows values in two different data frames

Question

I have two different data frames 'df1' and 'df2' with six matching column names. I want to scan df2 for rows that match exactly in df1, and if they do enter a 1 in the 'detect' column of df1 and if not enter a 0 in that column. Currently all values of 'detect' in df1 are 0's, but I want those to change to 1 when there's an exact match between the two data frames. It would look like this:

df1

site	ddate	ssegment	spp	vtype	tperiod
BMA	6/1/2021	1	AMRO	Song	1
BMC	6/15/2021	1	WISN	Drum	1
BMA	6/15/2021	1	NOFL	Song	2
BMC	6/29/2021	2	AMRO	Call	1
BMA	6/29/2021	2	WISN	Call	2

df2

site	ddate	ssegment	spp	vtype	tperiod
BMA	6/1/2021	1	AMRO	Call	1
BMC	6/15/2021	1	WISN	Drum	1
BMA	6/15/2021	1	NOFL	Song	2
BMC	6/29/2021	2	AMRO	Drum	1
BMA	6/29/2021	2	WISN	Call	2

After scanning these, df1 would now look like:

df1

site	ddate	ssegment	spp	vtype	tperiod	detect
BMA	6/1/2021	1	AMRO	Song	1	0
BMC	6/15/2021	1	WISN	Drum	1	1
BMA	6/15/2021	1	NOFL	Song	2	1
BMC	6/29/2021	2	AMRO	Call	1	0
BMA	6/29/2021	2	WISN	Call	2	1

I was thinking that R base function 'merge' might be useful, but I can't quite figure it out. Thank you for your help!

Gregor Thomas · Accepted Answer · 2021-12-08 16:45:23Z

1

Start with the detect column only in df2, then merge:

df1$detect = NULL
df2$detect = 1
result = merge(df1, unique(df2), all.x = TRUE)

This will create the detect column as 1s when there are exact matches and NAs when there are not. If you want, you can change the NAs to 0s.

The same method can work with dplyr:

library(dplyr)
df1 %>% 
  select(-detect) %>%
  left_join(
    df2 %>% mutate(detect = 1) %>% unique)
  )

edited Dec 8, 2021 at 16:45

answered Dec 8, 2021 at 16:02

Gregor Thomas

147k22 gold badges185 silver badges320 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jacob Over a year ago

This answer seems to work, however the math doesn't add up. Essentially, my actual df1 has 38880 rows and df2 has 5854 rows. 'result' should have 38880 rows (same as df1) because all I want is for the 'detect' column data to change to a 1 for the 5854 rows of df1 that match df2 exactly. I know there is a matching row in df1 for each row in df2. Your result leaves me with 42702 rows in 'result'. Any ideas what might be going on?

Gregor Thomas Over a year ago

That means you have some rows with multiple matches. Deduplicating df2 first should fix that. I'll edit to use unique(df2) in the merge().

Jacob Over a year ago

This seems to be correct. Thank you for your efforts on this!

danlooo · Accepted Answer · 2021-12-08 16:07:01Z

There is anti_join and semi_join for filter joining of two tables:

library(tidyverse)

df1 <- tribble(
  ~site,      ~ddate, ~ssegment,   ~spp, ~vtype, ~tperiod, ~detect,
  "BMA",  "6/1/2021",        1L, "AMRO", "Song",       1L,      0L,
  "BMC", "6/15/2021",        1L, "WISN", "Drum",       1L,      0L,
  "BMA", "6/15/2021",        1L, "NOFL", "Song",       2L,      0L,
  "BMC", "6/29/2021",        2L, "AMRO", "Call",       1L,      0L,
  "BMA", "6/29/2021",        2L, "WISN", "Call",       2L,      0L
  )

df2 <- tibble::tribble(
~site,      ~ddate, ~ssegment,   ~spp, ~vtype, ~tperiod,
"BMA",  "6/1/2021",        1L, "AMRO", "Call",       1L,
"BMC", "6/15/2021",        1L, "WISN", "Drum",       1L,
"BMA", "6/15/2021",        1L, "NOFL", "Song",       2L,
"BMC", "6/29/2021",        2L, "AMRO", "Drum",       1L,
"BMA", "6/29/2021",        2L, "WISN", "Call",       2L
)


bind_rows(
  df1 %>% select(-detect) %>% anti_join(df2) %>% mutate(detect = 0),
  df1 %>% select(-detect) %>% semi_join(df2) %>% mutate(detect = 1)
)
#> Joining, by = c("site", "ddate", "ssegment", "spp", "vtype", "tperiod")
#> Joining, by = c("site", "ddate", "ssegment", "spp", "vtype", "tperiod")
#> # A tibble: 5 x 7
#>   site  ddate     ssegment spp   vtype tperiod detect
#>   <chr> <chr>        <int> <chr> <chr>   <int>  <dbl>
#> 1 BMA   6/1/2021         1 AMRO  Song        1      0
#> 2 BMC   6/29/2021        2 AMRO  Call        1      0
#> 3 BMC   6/15/2021        1 WISN  Drum        1      1
#> 4 BMA   6/15/2021        1 NOFL  Song        2      1
#> 5 BMA   6/29/2021        2 WISN  Call        2      1

^{Created on 2021-12-08 by the reprex package (v2.0.1)}

This answer seems to work just fine, just like the one below. I don't know enough about R to suggest one over the other. Thank you for your time!

lovalery · Accepted Answer · 2021-12-08 17:23:16Z

1

Please find one possible and very simple solution with the data.table library

Reprex

Code

library(data.table)

setDT(df1)
setDT(df2)

df1[df2, on = .(site, ddate, ssegment, spp, vtype, tperiod), detect := TRUE][]

Output


#>    site     ddate ssegment  spp vtype tperiod detect
#> 1:  BMA  6/1/2021        1 AMRO  Song       1      0
#> 2:  BMC 6/15/2021        1 WISN  Drum       1      1
#> 3:  BMA 6/15/2021        1 NOFL  Song       2      1
#> 4:  BMC 6/29/2021        2 AMRO  Call       1      0
#> 5:  BMA 6/29/2021        2 WISN  Call       2      1

^{Created on 2021-12-08 by the reprex package (v2.0.1)}

edited Dec 8, 2021 at 17:23

answered Dec 8, 2021 at 17:16

lovalery

4,6623 gold badges16 silver badges30 bronze badges

5 Comments

Jacob Over a year ago

You forgot to add 'spp' to your code, so I added it after 'ssegment' and ran it, but received this error: Error in [.data.frame(df1, df2, on = .(site, ddate, ssegment, : unused argument (on = .(site, ddate, ssegment, spp, vtype, tperiod))

lovalery Over a year ago

Sorry for the mistake... So, I added the missing variable in my code and it still works. Please, let me know.

lovalery Over a year ago

I guess your problem is that df1 and df2 are dataframe. So you need to convert it into data.table before with setDT(df1) and setDT(df2) I will add that in my answer. Please, let me know.

Jacob Over a year ago

This solution works as well. Thank you for the effort!

lovalery Over a year ago

Thanks for your feedback. I wish you the best in your work. Cheers

Collectives™ on Stack Overflow

R Populate column based on matching rows values in two different data frames

3 Answers 3

3 Comments

1 Comment

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related