Find a subset of dataframe based on partial string matching from another column of values in R

Question

I have a column of values with unique identifiers that look like this:

df$1 <– c("identifier:ab134:4sfh", "identifier:gh164:9sgh", "identifier:3h1v4:kk9gh"

Some of them are in another column in a separate data frame with 71 columns but in that data frame, they are often clustered like this:

df2$1 <– c(""identifier:ab134:4sfh|identifier:gh164:9sgh", "identifier:sfghskg8:kk9gh|identifier:fj893n:9sgh|identifier:gh164:9sgh",...)

I need to find all rows which have any of the identifiers in them in the second dataframe. I would strsplit the column but I want to keep the rest of the second dataset as it is.

I have tried using this code both ways (i.e. df1 %in% df2 and df2 %in% df1) but obviously it's not giving me all the matches because it's trying to match whole strings rather than substrings:

new_subset <- subset(df$1, trimws(1) %in% trimws(df2$1))

Any suggestions? Thanks in advance for your help!

I’m really not sure what I can add. I need a match for every row and I’ve used the code above (which doesn’t work). — OliverL
– OliverL, Commented Dec 4, 2019 at 14:06
If you can provide expected output for the vectors you showed, it would help lapply(v1, function(x) unlist(lapply(strsplit(v2, "|", fixed = TRUE), function(y) match(x, y)))) Also try grep(df2$1, df$1) — akrun
– akrun, Commented Dec 4, 2019 at 14:07
So I tried this and I got a very long list that looks like this: List of 8806 $ : int [1:14037] NA NA NA NA NA NA NA NA NA NA ... $ : int [1:14037] NA NA NA NA NA NA NA NA NA NA ... $ : int [1:14037] NA NA NA NA NA NA NA NA NA NA ... $ : int [1:14037] NA NA NA NA NA NA NA NA NA NA ... — OliverL
– OliverL, Commented Dec 4, 2019 at 16:38
I want an output that looks like this: df$2 <– c("identifier:ab134:4sfh", "identifier:gh164:9sgh") but only includes matches from df$1 — OliverL
– OliverL, Commented Dec 4, 2019 at 16:39

GKi · Accepted Answer · 2019-12-04 15:40:43Z

1

Maybe you can use grep to find matching strings.

new_subset <- df[grep(paste0("^(",paste(df2$z, collapse = "|"),")$"), df$z),]
new_subset
#[1] identifier:ab134:4sfh identifier:gh164:9sgh

Data:

df <- data.frame(z=c("identifier:ab134:4sfh", "identifier:gh164:9sgh", "identifier:3h1v4:kk9gh"))
df2 <- data.frame(z=c("identifier:ab134:4sfh|identifier:gh164:9sgh", "identifier:sfghskg8:kk9gh|identifier:fj893n:9sgh|identifier:gh164:9sgh"))

edited Dec 4, 2019 at 15:40

answered Dec 4, 2019 at 14:29

GKi

40.1k3 gold badges36 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

OliverL Over a year ago

I get an error message in response: paste(df2, , collapse = "|") : argument is missing, with no default

GKi Over a year ago

@OliverL I wrote paste(df2, collapse = "|") and not paste(df2, , collapse = "|")

GKi Over a year ago

@OliverL I have updated the data-set in the question from vector to data.frame. Hope this solves the problem.

OliverL Over a year ago

Thanks so much for your help. So now it's saying

"Error in grep(paste0("^(", paste(df1, collapse = "|",  : invalid regular expression '^("identifier:ab134:4sfh|identifier:gh164:9sgh"..)

OliverL Over a year ago

Could it be something to do with collapse = "|" - do I need to change it somehow to make it comply with regex rules?

|

Collectives™ on Stack Overflow

Find a subset of dataframe based on partial string matching from another column of values in R

1 Answer 1

12 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

12 Comments

Your Answer

Sign up or log in

Post as a guest

Related