3
dat1 <- data.frame(id1 = c(1, 1, 2),
          pattern = c("apple", "applejack", "bananas, sweet"))
dat2 <- data.frame(id2 = c(1174, 1231),
          description = c("apple is sweet", "bananass are not"),
          description2 = c("melon", "bananas, sweet yes"))

> dat1
  id1        pattern
1   1          apple
2   1      applejack
3   2 bananas, sweet
> dat2
   id2      description       description2
1 1174   apple is sweet              melon
2 1231 bananass are not bananas, sweet yes

I have two data.frames, dat1 and dat2. I would like to take each pattern in dat1 and search for them in dat2's description and description2 using the regular expression, \\b[pattern]\\b.

Here is my attempt and the desired final output:

description_match <- description2_match <- vector()
for(i in 1:nrow(dat1)){
  for(j in 1:nrow(dat2)){
    search_pattern <- paste0("\\b", dat1$pattern[i], "\\b")
    description_match <- c(description_match, ifelse(grepl(search_pattern, dat2[j, "description"]), 1, 0))
    description2_match <- c(description2_match, ifelse(grepl(search_pattern, dat2[j, "description2"]), 1, 0))
  }
}
final_output <- data.frame(id1 = rep(dat1$id1, each = nrow(dat2)),
                           pattern = rep(dat1$pattern, each = nrow(dat2)),
                           id2 = rep(dat2$id2, length = nrow(dat1) * nrow(dat2)),
                           description_match = description_match,
                           description2_match = description2_match)

> final_output
  id1        pattern  id2 description_match description2_match
1   1          apple 1174                 1                  0
2   1          apple 1231                 0                  0
3   1      applejack 1174                 0                  0
4   1      applejack 1231                 0                  0
5   2 bananas, sweet 1174                 0                  0
6   2 bananas, sweet 1231                 0                  1

This approach is slow and not efficient if dat1 and dat2 have many rows. What's a quicker way to do this so that I can avoid a for loop?

3 Answers 3

1

Using outer and Vectorized grepl.

r <- sapply(dat2[-1], \(x) +outer(dat1$pattern, x, Vectorize(grepl)))
cbind(dat1[rep(seq_len(nrow(dat1)), each=nrow(dat2)), ], id2=dat2$id2, r)
#     id1        pattern  id2 description description2
# 1     1          apple 1174           1            0
# 1.1   1          apple 1231           0            0
# 2     1      applejack 1174           0            0
# 2.1   1      applejack 1231           0            0
# 3     2 bananas, sweet 1174           0            0
# 3.1   2 bananas, sweet 1231           0            1
Sign up to request clarification or add additional context in comments.

4 Comments

I got this error: Error: unexpected input in "r <- sapply(dat2[-1], \"
Use function(x) instead of \(x) or update your R. Cheers!
Thanks. Is using grepl above equivalent to using grepl("\\bapple\\b", "applesauce") (which would return FALSE)?
You could also try paste0("\\b", dat1$pattern, "\\b") instead of dat1$pattern if you need the boundaries.
1

A tidyverse solution with:

  • tidyr::crossing producing all combinations of dat1 and dat2
  • stringr::str_detect pairwise detecting the presence of a pattern in a string.
library(tidyverse)

crossing(dat1, dat2) %>%
  mutate(across(contains('description'), ~ +str_detect(.x, sprintf('\\b%s\\b', pattern))))

# A tibble: 6 × 5
    id1 pattern          id2 description description2
  <dbl> <chr>          <dbl>       <int>        <int>
1     1 apple           1174           1            0
2     1 apple           1231           0            0
3     1 applejack       1174           0            0
4     1 applejack       1231           0            0
5     2 bananas, sweet  1174           0            0
6     2 bananas, sweet  1231           0            1

2 Comments

str_detect("applesauce", pattern = "apple") returns TRUE, but what I want to implement is grepl("\\bapple\\b", "applesauce") which returns FALSE. Is there an alternative to str_detect?
@Adrian you could also pass regex to str_detect. See my edit.
0

Another option, but may be slower than @jay.sf's option

Your data frames:

dat1 <- data.frame(id1 = c(1, 1, 2),
                   pattern = c("apple", "applejack", "bananas, sweet"))
dat2 <- data.frame(id2 = c(1174, 1231),
                   description = c("apple is sweet", "bananass are not"),
                   description2 = c("melon", "bananas, sweet yes"))

Add a column with the pattern you'd like to use for matching:

dat1$pattern_grep = paste0("\\b", dat1$pattern, "\\b")

Perform a cartesian join: (i.e. join every row of dat2 to each row of dat1)

cj = merge(dat1, dat2, all = T, by = c())

Perform your grepl now:

cj$description_match <- mapply(grepl, cj$pattern_grep, cj$description)*1
cj$description2_match <- mapply(grepl, cj$pattern_grep, cj$description2)*1
  • Think about the mapply as performing the grepl on each row of your data frame
  • Multiplied by 1 to convert the boolean to 1/0

Keep relevant columns:

cj = cj[, c("id1", "pattern", "id2", "description_match", "description2_match")]

  id1        pattern  id2 description_match description2_match
1   1          apple 1174                 1                  0
2   1      applejack 1174                 0                  0
3   2 bananas, sweet 1174                 0                  0
4   1          apple 1231                 0                  0
5   1      applejack 1231                 0                  0
6   2 bananas, sweet 1231                 0                  1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.