1

I have a column of values with unique identifiers that look like this:

df$1 <– c("identifier:ab134:4sfh", "identifier:gh164:9sgh", "identifier:3h1v4:kk9gh"

Some of them are in another column in a separate data frame with 71 columns but in that data frame, they are often clustered like this:

df2$1 <– c(""identifier:ab134:4sfh|identifier:gh164:9sgh", "identifier:sfghskg8:kk9gh|identifier:fj893n:9sgh|identifier:gh164:9sgh",...)

I need to find all rows which have any of the identifiers in them in the second dataframe. I would strsplit the column but I want to keep the rest of the second dataset as it is.

I have tried using this code both ways (i.e. df1 %in% df2 and df2 %in% df1) but obviously it's not giving me all the matches because it's trying to match whole strings rather than substrings:

new_subset <- subset(df$1, trimws(1) %in% trimws(df2$1))

Any suggestions? Thanks in advance for your help!

5
  • I’m really not sure what I can add. I need a match for every row and I’ve used the code above (which doesn’t work). Commented Dec 4, 2019 at 14:06
  • 1
    If you can provide expected output for the vectors you showed, it would help lapply(v1, function(x) unlist(lapply(strsplit(v2, "|", fixed = TRUE), function(y) match(x, y)))) Also try grep(df2$1, df$1) Commented Dec 4, 2019 at 14:07
  • So I tried this and I got a very long list that looks like this: List of 8806 $ : int [1:14037] NA NA NA NA NA NA NA NA NA NA ... $ : int [1:14037] NA NA NA NA NA NA NA NA NA NA ... $ : int [1:14037] NA NA NA NA NA NA NA NA NA NA ... $ : int [1:14037] NA NA NA NA NA NA NA NA NA NA ... Commented Dec 4, 2019 at 16:38
  • I want an output that looks like this: df$2 <– c("identifier:ab134:4sfh", "identifier:gh164:9sgh") but only includes matches from df$1 Commented Dec 4, 2019 at 16:39
  • 1
    You've got mismatched quotation marks in your code Commented Dec 4, 2019 at 18:29

1 Answer 1

1

Maybe you can use grep to find matching strings.

new_subset <- df[grep(paste0("^(",paste(df2$z, collapse = "|"),")$"), df$z),]
new_subset
#[1] identifier:ab134:4sfh identifier:gh164:9sgh

Data:

df <- data.frame(z=c("identifier:ab134:4sfh", "identifier:gh164:9sgh", "identifier:3h1v4:kk9gh"))
df2 <- data.frame(z=c("identifier:ab134:4sfh|identifier:gh164:9sgh", "identifier:sfghskg8:kk9gh|identifier:fj893n:9sgh|identifier:gh164:9sgh"))
Sign up to request clarification or add additional context in comments.

12 Comments

I get an error message in response: paste(df2, , collapse = "|") : argument is missing, with no default
@OliverL I wrote paste(df2, collapse = "|") and not paste(df2, , collapse = "|")
@OliverL I have updated the data-set in the question from vector to data.frame. Hope this solves the problem.
Thanks so much for your help. So now it's saying "Error in grep(paste0("^(", paste(df1, collapse = "|", : invalid regular expression '^("identifier:ab134:4sfh|identifier:gh164:9sgh"..)
Could it be something to do with collapse = "|" - do I need to change it somehow to make it comply with regex rules?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.