1

Please help me figure out an efficient way to merge these two data frames without using a for loop. There are many more columns and rows, but I simplified the data for this example.

I am looking to:

  • left join, keep the df rows intact and bring over the D column from the lookup.
  • Join based on two columns.
    • First check column x with a fuzzy match. I want to take the x from the df and see if ANY x in the lookup is a partial string match (the lookup x string is inside the df x string). If there is no match, then I want it to use the "All Else" x variable.
    • Then after picking the x variable, I want to check the y variable for an exact match and return the D variable.

Here are the two tables I start with:

df = structure(list(x = c("San Francisco", "Work at Home", "Arlington VA", 
"Work at Home", "Arlington"), y = c(1, 5, 1, 6, 2)), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))

lookup = structure(list(x = c("Arlington", "Arlington", "Arlington", "Arlington", 
"Arlington", "Arlington", "Arlington", "Arlington", "Arlington", 
"Arlington", "Arlington", "Arlington", "Arlington", "Chicago", 
"Chicago", "Chicago", "Chicago", "Chicago", "Chicago", "Chicago", 
"Chicago", "Chicago", "Chicago", "Chicago", "Chicago", "Chicago", 
"San Diego", "San Diego", "San Diego", "San Diego", "San Diego", 
"San Diego", "San Diego", "San Diego", "San Diego", "San Diego", 
"San Diego", "San Diego", "San Diego", "Lisle", "Lisle", "Lisle", 
"Lisle", "Lisle", "Lisle", "Lisle", "Lisle", "Lisle", "Lisle", 
"Lisle", "Lisle", "Lisle", "Brea", "Brea", "Brea", "Brea", "Brea", 
"Brea", "Brea", "Brea", "Brea", "Brea", "Brea", "Brea", "Brea", 
"Boston", "Boston", "Boston", "Boston", "Boston", "Boston", "Boston", 
"Boston", "Boston", "Boston", "Boston", "Boston", "Boston", "Austin", 
"Austin", "Austin", "Austin", "Austin", "Austin", "Austin", "Austin", 
"Austin", "Austin", "Austin", "Austin", "Austin", "Dallas", "Dallas", 
"Dallas", "Dallas", "Dallas", "Dallas", "Dallas", "Dallas", "Dallas", 
"Dallas", "Dallas", "Dallas", "Dallas", "Miami", "Miami", "Miami", 
"Miami", "Miami", "Miami", "Miami", "Miami", "Miami", "Miami", 
"Miami", "Miami", "Miami", "Bedford", "Bedford", "Bedford", "Bedford", 
"Bedford", "Bedford", "Bedford", "Bedford", "Bedford", "Bedford", 
"Bedford", "Bedford", "Bedford", "All Else", "All Else", "All Else", 
"All Else", "All Else", "All Else", "All Else", "All Else", "All Else", 
"All Else", "All Else", "All Else", "All Else"), y = c(1, 2, 
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 
9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 
7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 
6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 
5, 6, 7, 8, 9, 10, 11, 12, 13), D = c(0.88, 0.7, 0.19, 0.12, 
0.26, 0.68, 0.1, 1, 0.68, 0.96, 0.75, 0.08, 0.25, 0.3, 0.64, 
0.35, 0.94, 0.21, 0.15, 0.19, 0.84, 0.94, 0.03, 0.39, 0.42, 0.76, 
0.48, 0.71, 0.75, 0.87, 0.18, 0.53, 0.45, 0.1, 0.66, 0.01, 0.22, 
0.11, 0.79, 0.82, 0.11, 0.66, 0.91, 0.59, 0.55, 0.66, 0.29, 0.58, 
0.26, 0.36, 0.07, 0.47, 0.47, 0.45, 0.15, 0.07, 0.49, 0.67, 0.8, 
0.82, 0.89, 0.36, 0.3, 0.57, 0.44, 0.09, 0.59, 0.65, 0.12, 0.05, 
0.87, 0.47, 0.24, 0.17, 0.56, 0.13, 0.84, 0.17, 0.61, 0.73, 0.31, 
0.79, 0.64, 0.6, 0.63, 0.36, 0.41, 0.15, 0.79, 0.59, 0.2, 0.59, 
0.89, 0.46, 0.77, 0.79, 0.5, 0.99, 0.22, 0.77, 0.9, 0.86, 0.6, 
0.41, 0.95, 0.38, 0.86, 0.82, 0.68, 0.3, 0.75, 0.29, 0.16, 0.88, 
0.3, 0.53, 0.14, 0.23, 0.16, 0.88, 0.93, 0.63, 0.41, 0.72, 0.58, 
0.58, 0.63, 0.66, 0.98, 0.25, 0.68, 0.92, 0.67, 0.67, 0.11, 0.16, 
0.3, 0.36, 0.32, 0.66, 0.34, 0.89, 0.33)), row.names = c(NA, 
-143L), class = c("tbl_df", "tbl", "data.frame"))

Here is my desired output:

output = structure(list(x = c("San Francisco", "Work at Home", "Arlington VA", 
"Work at Home", "Arlington"), y = c(1, 5, 1, 6, 2), D = c(0.68, 
0.11, 0.88, 0.16, 0.7)), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))

2 Answers 2

1

You can use the dplyr and stringr packages for this problem.

First, you can create a regex expression for multiple patterns using distinct, pull and paste.

library(dplyr)
library(stringr)

xvec <- paste(paste0(paste0("\\b",lookup %>% distinct(x) %>% pull()),"\\b"), collapse = '|')

>xvec
[1] "\\bArlington\\b|\\bChicago\\b|\\bSan Diego\\b|\\bLisle\\b|\\bBrea\\b|\\bBoston\\b|\\bAustin\\b|\\bDallas\\b|\\bMiami\\b|\\bBedford\\b|\\bAll Else\\b"

Now you can use the str_match function of the stringr package. case_when is used here to change the new column xnew to "All Else" in case there is no match. The result is in the dfnew table.

dfnew <- df %>%
         mutate(xnew=str_match(x, xvec)) %>%
         mutate(xnew=case_when(!is.na(xnew) ~ xnew, TRUE ~ "All Else"))

>dfnew
  x                 y xnew     
  <chr>         <dbl> <chr>    
1 San Francisco     1 All Else 
2 Work at Home      5 All Else 
3 Arlington VA      1 Arlington
4 Work at Home      6 All Else 
5 Arlington         2 Arlington

Finally, you can join the tables. For this you group xnew and y for the dfnew table and x and y for the lookup table. After this you get the desired output.

output <- dfnew %>%
          group_by(xnew,y) %>%
          left_join(lookup %>% group_by(x,y), by=c("xnew"="x","y"="y")) %>%
          ungroup() %>%
          select(-xnew)

>output
  x                 y     D
  <chr>         <dbl> <dbl>
1 San Francisco     1  0.68
2 Work at Home      5  0.11
3 Arlington VA      1  0.88
4 Work at Home      6  0.16
5 Arlington         2  0.7 
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the explanation and this is a great solution!
1

Here's a working solution using tidyverse and fuzzyjoin. There might be a more performant solution, but I hope this one is fast enough for your needs.

One caveat though: I suggest reading the documentation of stringdist_join and understand the matching algorithms and other parameters for the fuzzy matching - I just happened to pick max_dist = 3 because that's what worked with your example but I can't guarantee it's optimal on the rest of your data.

library(tidyverse)
library(fuzzyjoin)


fuzzy_matched_df <- df %>%
  fuzzyjoin::stringdist_inner_join(lookup, by = "x", max_dist = 3) %>%
  filter(y.x == y.y) %>%
  select(x = x.x, y = y.x, D)

unmatched_df <- df %>%
  fuzzyjoin::stringdist_anti_join(lookup, by = "x", max_dist = 3) %>%
  mutate(fallback = "All Else") %>%
  inner_join(lookup, by = c(fallback = "x", y = "y")) %>%
  select(x, y, D)

out <- fuzzy_matched_df %>%
  bind_rows(unmatched_df)

identical(out %>% arrange(x), output %>% arrange(x))

1 Comment

Thank you! This is also a nice solution and even a little more than I needed. Really appreciate your time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.