I have a dataset of text strings that look something like this:
strings <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger",
"Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", "Jessica Wright Htx Satx",
"Julie Green Lifestyle Blogger", "Mike S Thomas Football Player",
"Tiny Fitness Houston Studio")), class = "data.frame", row.names = c(NA,
-8L))
I am trying to evaluate matches in those strings against two different datasets called firstname and lastname that look as such:
firstname <- structure(list(firstnames = c("Jennifer", "Lisa", "Tina", "Jamie",
"Jessica", "Julie", "Mike", "George")), class = "data.frame", row.names = c(NA,
-8L))
lastname <- structure(list(lastnames = c("Hancock", "Smith", "Houston", "Fay",
"Tucker", "Wright", "Green", "Thomas")), class = "data.frame", row.names = c(NA,
-8L))
First thing I would like to do is remove everything after the first three words in each string, so "Jennifer Rae Hancock Brown" would just become "Jessica Rae Hancock" and "Lisa Smith Houston Blogger" would become "Lisa Smith Houston"
After that, I then want to evaluate the first word of each string to see if it matches to anything in the firstname dataframe. If it does match, it creates a new column called in the final table called firstname with the result. If it doesn't match, the result is simply "N/A".
After that, I'd like to then evaluate the remaining words against the lastname dataframe. There can be multiple matches (As seen in the "Lisa Smith Houston" example) and if that's the case, both results will be stored in the final dataframe.
The final dataframe should look like this:
final <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger",
"Lisa Smith Houston Blogger", "Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert",
"Jessica Wright Htx Satx", "Julie Green Lifestyle Blogger", "Mike S Thomas Football Player",
"Tiny George Fitness Houston Studio"), firstname = c("Jennifer",
"Lisa", "Lisa", "Tina", "Jamie", "Jessica", "Julie", "Mike",
"N/A"), lastname = c("Hancock", "Smith", "Houston", "Fay", "Tucker",
"Wright", "Green", "Thomas", "N/A")), class = "data.frame", row.names = c(NA,
-9L))
What would be the most effective way to go about doing this?