Running a string against multiple match dataframes

Question

I have a dataset of text strings that look something like this:

strings <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger", 
"Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", "Jessica Wright Htx Satx", 
"Julie Green Lifestyle Blogger", "Mike S Thomas Football Player", 
"Tiny Fitness Houston Studio")), class = "data.frame", row.names = c(NA, 
-8L))

I am trying to evaluate matches in those strings against two different datasets called firstname and lastname that look as such:

firstname <- structure(list(firstnames = c("Jennifer", "Lisa", "Tina", "Jamie", 
"Jessica", "Julie", "Mike", "George")), class = "data.frame", row.names = c(NA, 
-8L))

lastname <- structure(list(lastnames = c("Hancock", "Smith", "Houston", "Fay", 
"Tucker", "Wright", "Green", "Thomas")), class = "data.frame", row.names = c(NA, 
-8L))

First thing I would like to do is remove everything after the first three words in each string, so "Jennifer Rae Hancock Brown" would just become "Jessica Rae Hancock" and "Lisa Smith Houston Blogger" would become "Lisa Smith Houston"

After that, I then want to evaluate the first word of each string to see if it matches to anything in the firstname dataframe. If it does match, it creates a new column called in the final table called firstname with the result. If it doesn't match, the result is simply "N/A".

After that, I'd like to then evaluate the remaining words against the lastname dataframe. There can be multiple matches (As seen in the "Lisa Smith Houston" example) and if that's the case, both results will be stored in the final dataframe.

The final dataframe should look like this:

final <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger", 
"Lisa Smith Houston Blogger", "Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", 
"Jessica Wright Htx Satx", "Julie Green Lifestyle Blogger", "Mike S Thomas Football Player", 
"Tiny George Fitness Houston Studio"), firstname = c("Jennifer", 
"Lisa", "Lisa", "Tina", "Jamie", "Jessica", "Julie", "Mike", 
"N/A"), lastname = c("Hancock", "Smith", "Houston", "Fay", "Tucker", 
"Wright", "Green", "Thomas", "N/A")), class = "data.frame", row.names = c(NA, 
-9L))

What would be the most effective way to go about doing this?

akrun · Accepted Answer · 2022-12-20 17:36:15Z

1

We may use str_extract_all on the substring of 'string2' with pattern as the firstnames, lastnames vector converted to a single string with | (OR as delimiter) and return a list of vectors, then use unnest to convert the list to vector

library(dplyr)
library(stringr)
library(tidyr)
strings %>%
   mutate(string2 = str_extract(trimws(string), "^\\S+\\s+\\S+\\s+\\S+"),
   firstname = str_extract_all(string2, 
    str_c(firstname$firstnames, collapse = "|")), 
   lastname =str_extract_all(string2, 
     str_c(lastname$lastnames, collapse = "|")) ) %>% 
   unnest(where(is.list), keep_empty = TRUE) %>% 
   select(-string2)%>% 
   mutate(lastname = case_when(complete.cases(firstname) ~ lastname))

-output

# A tibble: 9 × 3
  string                          firstname lastname
  <chr>                           <chr>     <chr>   
1 "Jennifer Rae Hancock Brown"    Jennifer  Hancock 
2 "Lisa Smith Houston Blogger"    Lisa      Smith   
3 "Lisa Smith Houston Blogger"    Lisa      Houston 
4 "Tina Fay Las Cruces"           Tina      Fay     
5 "\t\nJamie Tucker Style Expert" Jamie     Tucker  
6 "Jessica Wright Htx Satx"       Jessica   Wright  
7 "Julie Green Lifestyle Blogger" Julie     Green   
8 "Mike S Thomas Football Player" Mike      Thomas  
9 "Tiny Fitness Houston Studio"   <NA>      <NA>

OP's expected

> final
                              string firstname lastname
1         Jennifer Rae Hancock Brown  Jennifer  Hancock
2         Lisa Smith Houston Blogger      Lisa    Smith
3         Lisa Smith Houston Blogger      Lisa  Houston
4                Tina Fay Las Cruces      Tina      Fay
5      \t\nJamie Tucker Style Expert     Jamie   Tucker
6            Jessica Wright Htx Satx   Jessica   Wright
7      Julie Green Lifestyle Blogger     Julie    Green
8      Mike S Thomas Football Player      Mike   Thomas
9 Tiny George Fitness Houston Studio       N/A      N/A

edited Dec 20, 2022 at 17:36

answered Dec 20, 2022 at 17:04

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

wizkids121 Over a year ago

Thank you! This is very close to what I need with one exception. What about the example of "Lisa Smith Houston" which would have two matches? Is there a way we can account for that?

wizkids121 Over a year ago

Proper legend. Thank you!

Collectives™ on Stack Overflow

Running a string against multiple match dataframes

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related