0

I have a dataframe looks like below:

Year   Name    Place   Job
2010   Jim     USA     CEO
2010   Jim     Canada  Advisor
2010   Jim     Canada  Board Member
2011   Jim     USA     CEO

2017   Peter   Mexico  COO
2019   Peter   Korea   CEO
2019   Peter   China   Advisor

2013   Harry   USA     Chairman
2014   Harry   Canada  CEO
2015   Harry   Canada  CEO
2015   Harry   Canada  Advisor

I want to remove certain rows in the above dataframe based on the "Year" and "Name" column. basically, all "Year/Name" occurs in the below list (in dataframe format) should be removed:

Year  Name
2010  Jim
2019  Peter
2013  Harry
2014  Harry

Thus, the final output looks like:

Year   Name    Place   Job
2011   Jim     USA     CEO
2017   Peter   Mexico  COO
2015   Harry   Canada  CEO
2015   Harry   Canada  Advisor

5 Answers 5

4

base R

While dplyr (below) has anti_join, in base R one needs to merge and find the rows that did not match and remove them by hand.

# using the `rem` frame, augmenting a little
rem$keep <- FALSE
tmp <- merge(dat, rem, by = c("Year", "Name"), all.x = TRUE)
tmp
#    Year  Name  Place          Job  keep
# 1  2010   Jim    USA          CEO FALSE
# 2  2010   Jim Canada      Advisor FALSE
# 3  2010   Jim Canada Board Member FALSE
# 4  2011   Jim    USA          CEO    NA
# 5  2013 Harry    USA     Chairman FALSE
# 6  2014 Harry Canada          CEO FALSE
# 7  2015 Harry Canada          CEO    NA
# 8  2015 Harry Canada      Advisor    NA
# 9  2017 Peter Mexico          COO    NA
# 10 2019 Peter  Korea          CEO FALSE
# 11 2019 Peter  China      Advisor FALSE
tmp[ is.na(tmp$keep), ]
#   Year  Name  Place     Job keep
# 4 2011   Jim    USA     CEO   NA
# 7 2015 Harry Canada     CEO   NA
# 8 2015 Harry Canada Advisor   NA
# 9 2017 Peter Mexico     COO   NA

dplyr

dplyr::anti_join(dat, rem, by = c("Year", "Name"))
#   Year  Name  Place     Job
# 1 2011   Jim    USA     CEO
# 2 2017 Peter Mexico     COO
# 3 2015 Harry Canada     CEO
# 4 2015 Harry Canada Advisor

Data

dat <- structure(list(Year = c(2010L, 2010L, 2010L, 2011L, 2017L, 2019L, 2019L, 2013L, 2014L, 2015L, 2015L), Name = c("Jim", "Jim", "Jim", "Jim", "Peter", "Peter", "Peter", "Harry", "Harry", "Harry", "Harry"), Place = c("USA", "Canada", "Canada", "USA", "Mexico", "Korea", "China", "USA", "Canada", "Canada", "Canada"), Job = c("CEO", "Advisor", "Board Member", "CEO", "COO", "CEO", "Advisor", "Chairman", "CEO", "CEO", "Advisor")), row.names = c(NA, -11L), class = "data.frame")
rem <- structure(list(Year = c(2010L, 2019L, 2013L, 2014L), Name = c("Jim", "Peter", "Harry", "Harry")), class = "data.frame", row.names = c(NA, -4L))
Sign up to request clarification or add additional context in comments.

4 Comments

Oh no... i posted the anti_join version just a few seconds too late.
26 seconds ... it's a race ;-)
Kudos for the base-r version.
Awesome! I didn't know dplyr has this amazing anti_join function! Thank you both!
2

Another approach:

library(dplyr)
library(stringr)

dat %>% mutate(x = str_c(Year, Name)) %>% 
filter(str_detect(x, str_c(str_c(rem$Year,rem$Name), collapse = '|'), negate = TRUE)) %>% 
select(-x)

  Year  Name  Place     Job
1 2011   Jim    USA     CEO
2 2017 Peter Mexico     COO
3 2015 Harry Canada     CEO
4 2015 Harry Canada Advisor

1 Comment

I think it's likely very safe here (very low likelihood of overlap between Year and Name, so it works great here) ... but I'd be cautious using this with any combination of fields that might pose ambiguity (where the collapse=/sep= choice is critical). Is there any reason you chose str_c over paste? Also, I think you can reduce it to a single call using str_c(rem$Year, rem$Name, sep="|").
2

We could use str_c:

library(dplyr)
library(stringr)

dat %>% 
  filter(!Year %in% str_c(rem$Year))

Output:

  Year  Name  Place     Job
1 2011   Jim    USA     CEO
2 2017 Peter Mexico     COO
3 2015 Harry Canada     CEO
4 2015 Harry Canada Advisor

Comments

2

Using data.table

library(data.table)
setDT(df)[!remove, on = .(Year, Name)]

-ouptut

#   Year  Name  Place     Job
#1: 2011   Jim    USA     CEO
#2: 2017 Peter Mexico     COO
#3: 2015 Harry Canada     CEO
#4: 2015 Harry Canada Advisor

Comments

1

A base R option using merge + subset

subset(
  merge(
    dat,
    cbind(rem, Removal = 1),
    all = TRUE
  ), 
  is.na(Removal),
  select = -Removal
)

gives

 Year  Name  Place     Job
4 2011   Jim    USA     CEO
7 2015 Harry Canada     CEO
8 2015 Harry Canada Advisor
9 2017 Peter Mexico     COO

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.