1

I have a dataset of text strings that look something like this:

strings <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger", 
"Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", "Jessica Wright Htx Satx", 
"Julie Green Lifestyle Blogger", "Mike S Thomas Football Player", 
"Tiny Fitness Houston Studio")), class = "data.frame", row.names = c(NA, 
-8L))

I am trying to evaluate matches in those strings against two different datasets called firstname and lastname that look as such:

firstname <- structure(list(firstnames = c("Jennifer", "Lisa", "Tina", "Jamie", 
"Jessica", "Julie", "Mike", "George")), class = "data.frame", row.names = c(NA, 
-8L))

lastname <- structure(list(lastnames = c("Hancock", "Smith", "Houston", "Fay", 
"Tucker", "Wright", "Green", "Thomas")), class = "data.frame", row.names = c(NA, 
-8L))

First thing I would like to do is remove everything after the first three words in each string, so "Jennifer Rae Hancock Brown" would just become "Jessica Rae Hancock" and "Lisa Smith Houston Blogger" would become "Lisa Smith Houston"

After that, I then want to evaluate the first word of each string to see if it matches to anything in the firstname dataframe. If it does match, it creates a new column called in the final table called firstname with the result. If it doesn't match, the result is simply "N/A".

After that, I'd like to then evaluate the remaining words against the lastname dataframe. There can be multiple matches (As seen in the "Lisa Smith Houston" example) and if that's the case, both results will be stored in the final dataframe.

The final dataframe should look like this:

final <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger", 
"Lisa Smith Houston Blogger", "Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", 
"Jessica Wright Htx Satx", "Julie Green Lifestyle Blogger", "Mike S Thomas Football Player", 
"Tiny George Fitness Houston Studio"), firstname = c("Jennifer", 
"Lisa", "Lisa", "Tina", "Jamie", "Jessica", "Julie", "Mike", 
"N/A"), lastname = c("Hancock", "Smith", "Houston", "Fay", "Tucker", 
"Wright", "Green", "Thomas", "N/A")), class = "data.frame", row.names = c(NA, 
-9L)) 

What would be the most effective way to go about doing this?

0

1 Answer 1

1

We may use str_extract_all on the substring of 'string2' with pattern as the firstnames, lastnames vector converted to a single string with | (OR as delimiter) and return a list of vectors, then use unnest to convert the list to vector

library(dplyr)
library(stringr)
library(tidyr)
strings %>%
   mutate(string2 = str_extract(trimws(string), "^\\S+\\s+\\S+\\s+\\S+"),
   firstname = str_extract_all(string2, 
    str_c(firstname$firstnames, collapse = "|")), 
   lastname =str_extract_all(string2, 
     str_c(lastname$lastnames, collapse = "|")) ) %>% 
   unnest(where(is.list), keep_empty = TRUE) %>% 
   select(-string2)%>% 
   mutate(lastname = case_when(complete.cases(firstname) ~ lastname))

-output

# A tibble: 9 × 3
  string                          firstname lastname
  <chr>                           <chr>     <chr>   
1 "Jennifer Rae Hancock Brown"    Jennifer  Hancock 
2 "Lisa Smith Houston Blogger"    Lisa      Smith   
3 "Lisa Smith Houston Blogger"    Lisa      Houston 
4 "Tina Fay Las Cruces"           Tina      Fay     
5 "\t\nJamie Tucker Style Expert" Jamie     Tucker  
6 "Jessica Wright Htx Satx"       Jessica   Wright  
7 "Julie Green Lifestyle Blogger" Julie     Green   
8 "Mike S Thomas Football Player" Mike      Thomas  
9 "Tiny Fitness Houston Studio"   <NA>      <NA>    

OP's expected

> final
                              string firstname lastname
1         Jennifer Rae Hancock Brown  Jennifer  Hancock
2         Lisa Smith Houston Blogger      Lisa    Smith
3         Lisa Smith Houston Blogger      Lisa  Houston
4                Tina Fay Las Cruces      Tina      Fay
5      \t\nJamie Tucker Style Expert     Jamie   Tucker
6            Jessica Wright Htx Satx   Jessica   Wright
7      Julie Green Lifestyle Blogger     Julie    Green
8      Mike S Thomas Football Player      Mike   Thomas
9 Tiny George Fitness Houston Studio       N/A      N/A
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! This is very close to what I need with one exception. What about the example of "Lisa Smith Houston" which would have two matches? Is there a way we can account for that?
Proper legend. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.