0

I have two different data frames 'df1' and 'df2' with six matching column names. I want to scan df2 for rows that match exactly in df1, and if they do enter a 1 in the 'detect' column of df1 and if not enter a 0 in that column. Currently all values of 'detect' in df1 are 0's, but I want those to change to 1 when there's an exact match between the two data frames. It would look like this:

df1

site ddate ssegment spp vtype tperiod detect
BMA 6/1/2021 1 AMRO Song 1 0
BMC 6/15/2021 1 WISN Drum 1 0
BMA 6/15/2021 1 NOFL Song 2 0
BMC 6/29/2021 2 AMRO Call 1 0
BMA 6/29/2021 2 WISN Call 2 0

df2

site ddate ssegment spp vtype tperiod
BMA 6/1/2021 1 AMRO Call 1
BMC 6/15/2021 1 WISN Drum 1
BMA 6/15/2021 1 NOFL Song 2
BMC 6/29/2021 2 AMRO Drum 1
BMA 6/29/2021 2 WISN Call 2

After scanning these, df1 would now look like:

df1

site ddate ssegment spp vtype tperiod detect
BMA 6/1/2021 1 AMRO Song 1 0
BMC 6/15/2021 1 WISN Drum 1 1
BMA 6/15/2021 1 NOFL Song 2 1
BMC 6/29/2021 2 AMRO Call 1 0
BMA 6/29/2021 2 WISN Call 2 1

I was thinking that R base function 'merge' might be useful, but I can't quite figure it out. Thank you for your help!

3 Answers 3

1

Start with the detect column only in df2, then merge:

df1$detect = NULL
df2$detect = 1
result = merge(df1, unique(df2), all.x = TRUE)

This will create the detect column as 1s when there are exact matches and NAs when there are not. If you want, you can change the NAs to 0s.

The same method can work with dplyr:

library(dplyr)
df1 %>% 
  select(-detect) %>%
  left_join(
    df2 %>% mutate(detect = 1) %>% unique)
  )
Sign up to request clarification or add additional context in comments.

3 Comments

This answer seems to work, however the math doesn't add up. Essentially, my actual df1 has 38880 rows and df2 has 5854 rows. 'result' should have 38880 rows (same as df1) because all I want is for the 'detect' column data to change to a 1 for the 5854 rows of df1 that match df2 exactly. I know there is a matching row in df1 for each row in df2. Your result leaves me with 42702 rows in 'result'. Any ideas what might be going on?
That means you have some rows with multiple matches. Deduplicating df2 first should fix that. I'll edit to use unique(df2) in the merge().
This seems to be correct. Thank you for your efforts on this!
1

There is anti_join and semi_join for filter joining of two tables:

library(tidyverse)

df1 <- tribble(
  ~site,      ~ddate, ~ssegment,   ~spp, ~vtype, ~tperiod, ~detect,
  "BMA",  "6/1/2021",        1L, "AMRO", "Song",       1L,      0L,
  "BMC", "6/15/2021",        1L, "WISN", "Drum",       1L,      0L,
  "BMA", "6/15/2021",        1L, "NOFL", "Song",       2L,      0L,
  "BMC", "6/29/2021",        2L, "AMRO", "Call",       1L,      0L,
  "BMA", "6/29/2021",        2L, "WISN", "Call",       2L,      0L
  )

df2 <- tibble::tribble(
~site,      ~ddate, ~ssegment,   ~spp, ~vtype, ~tperiod,
"BMA",  "6/1/2021",        1L, "AMRO", "Call",       1L,
"BMC", "6/15/2021",        1L, "WISN", "Drum",       1L,
"BMA", "6/15/2021",        1L, "NOFL", "Song",       2L,
"BMC", "6/29/2021",        2L, "AMRO", "Drum",       1L,
"BMA", "6/29/2021",        2L, "WISN", "Call",       2L
)


bind_rows(
  df1 %>% select(-detect) %>% anti_join(df2) %>% mutate(detect = 0),
  df1 %>% select(-detect) %>% semi_join(df2) %>% mutate(detect = 1)
)
#> Joining, by = c("site", "ddate", "ssegment", "spp", "vtype", "tperiod")
#> Joining, by = c("site", "ddate", "ssegment", "spp", "vtype", "tperiod")
#> # A tibble: 5 x 7
#>   site  ddate     ssegment spp   vtype tperiod detect
#>   <chr> <chr>        <int> <chr> <chr>   <int>  <dbl>
#> 1 BMA   6/1/2021         1 AMRO  Song        1      0
#> 2 BMC   6/29/2021        2 AMRO  Call        1      0
#> 3 BMC   6/15/2021        1 WISN  Drum        1      1
#> 4 BMA   6/15/2021        1 NOFL  Song        2      1
#> 5 BMA   6/29/2021        2 WISN  Call        2      1

Created on 2021-12-08 by the reprex package (v2.0.1)

1 Comment

This answer seems to work just fine, just like the one below. I don't know enough about R to suggest one over the other. Thank you for your time!
1

Please find one possible and very simple solution with the data.table library

Reprex

  • Code
library(data.table)

setDT(df1)
setDT(df2)

df1[df2, on = .(site, ddate, ssegment, spp, vtype, tperiod), detect := TRUE][]
  • Output

#>    site     ddate ssegment  spp vtype tperiod detect
#> 1:  BMA  6/1/2021        1 AMRO  Song       1      0
#> 2:  BMC 6/15/2021        1 WISN  Drum       1      1
#> 3:  BMA 6/15/2021        1 NOFL  Song       2      1
#> 4:  BMC 6/29/2021        2 AMRO  Call       1      0
#> 5:  BMA 6/29/2021        2 WISN  Call       2      1

Created on 2021-12-08 by the reprex package (v2.0.1)

5 Comments

You forgot to add 'spp' to your code, so I added it after 'ssegment' and ran it, but received this error: Error in [.data.frame(df1, df2, on = .(site, ddate, ssegment, : unused argument (on = .(site, ddate, ssegment, spp, vtype, tperiod))
Sorry for the mistake... So, I added the missing variable in my code and it still works. Please, let me know.
I guess your problem is that df1 and df2 are dataframe. So you need to convert it into data.table before with setDT(df1) and setDT(df2) I will add that in my answer. Please, let me know.
This solution works as well. Thank you for the effort!
Thanks for your feedback. I wish you the best in your work. Cheers

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.