Conditional merge with multiple variables in R

Question

Please help me figure out an efficient way to merge these two data frames without using a for loop. There are many more columns and rows, but I simplified the data for this example.

I am looking to:

left join, keep the df rows intact and bring over the D column from the lookup.
Join based on two columns.
- First check column x with a fuzzy match. I want to take the x from the df and see if ANY x in the lookup is a partial string match (the lookup x string is inside the df x string). If there is no match, then I want it to use the "All Else" x variable.
- Then after picking the x variable, I want to check the y variable for an exact match and return the D variable.

Here are the two tables I start with:

df = structure(list(x = c("San Francisco", "Work at Home", "Arlington VA", 
"Work at Home", "Arlington"), y = c(1, 5, 1, 6, 2)), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))

lookup = structure(list(x = c("Arlington", "Arlington", "Arlington", "Arlington", 
"Arlington", "Arlington", "Arlington", "Arlington", "Arlington", 
"Arlington", "Arlington", "Arlington", "Arlington", "Chicago", 
"Chicago", "Chicago", "Chicago", "Chicago", "Chicago", "Chicago", 
"Chicago", "Chicago", "Chicago", "Chicago", "Chicago", "Chicago", 
"San Diego", "San Diego", "San Diego", "San Diego", "San Diego", 
"San Diego", "San Diego", "San Diego", "San Diego", "San Diego", 
"San Diego", "San Diego", "San Diego", "Lisle", "Lisle", "Lisle", 
"Lisle", "Lisle", "Lisle", "Lisle", "Lisle", "Lisle", "Lisle", 
"Lisle", "Lisle", "Lisle", "Brea", "Brea", "Brea", "Brea", "Brea", 
"Brea", "Brea", "Brea", "Brea", "Brea", "Brea", "Brea", "Brea", 
"Boston", "Boston", "Boston", "Boston", "Boston", "Boston", "Boston", 
"Boston", "Boston", "Boston", "Boston", "Boston", "Boston", "Austin", 
"Austin", "Austin", "Austin", "Austin", "Austin", "Austin", "Austin", 
"Austin", "Austin", "Austin", "Austin", "Austin", "Dallas", "Dallas", 
"Dallas", "Dallas", "Dallas", "Dallas", "Dallas", "Dallas", "Dallas", 
"Dallas", "Dallas", "Dallas", "Dallas", "Miami", "Miami", "Miami", 
"Miami", "Miami", "Miami", "Miami", "Miami", "Miami", "Miami", 
"Miami", "Miami", "Miami", "Bedford", "Bedford", "Bedford", "Bedford", 
"Bedford", "Bedford", "Bedford", "Bedford", "Bedford", "Bedford", 
"Bedford", "Bedford", "Bedford", "All Else", "All Else", "All Else", 
"All Else", "All Else", "All Else", "All Else", "All Else", "All Else", 
"All Else", "All Else", "All Else", "All Else"), y = c(1, 2, 
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 
9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 
7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 
6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 
5, 6, 7, 8, 9, 10, 11, 12, 13), D = c(0.88, 0.7, 0.19, 0.12, 
0.26, 0.68, 0.1, 1, 0.68, 0.96, 0.75, 0.08, 0.25, 0.3, 0.64, 
0.35, 0.94, 0.21, 0.15, 0.19, 0.84, 0.94, 0.03, 0.39, 0.42, 0.76, 
0.48, 0.71, 0.75, 0.87, 0.18, 0.53, 0.45, 0.1, 0.66, 0.01, 0.22, 
0.11, 0.79, 0.82, 0.11, 0.66, 0.91, 0.59, 0.55, 0.66, 0.29, 0.58, 
0.26, 0.36, 0.07, 0.47, 0.47, 0.45, 0.15, 0.07, 0.49, 0.67, 0.8, 
0.82, 0.89, 0.36, 0.3, 0.57, 0.44, 0.09, 0.59, 0.65, 0.12, 0.05, 
0.87, 0.47, 0.24, 0.17, 0.56, 0.13, 0.84, 0.17, 0.61, 0.73, 0.31, 
0.79, 0.64, 0.6, 0.63, 0.36, 0.41, 0.15, 0.79, 0.59, 0.2, 0.59, 
0.89, 0.46, 0.77, 0.79, 0.5, 0.99, 0.22, 0.77, 0.9, 0.86, 0.6, 
0.41, 0.95, 0.38, 0.86, 0.82, 0.68, 0.3, 0.75, 0.29, 0.16, 0.88, 
0.3, 0.53, 0.14, 0.23, 0.16, 0.88, 0.93, 0.63, 0.41, 0.72, 0.58, 
0.58, 0.63, 0.66, 0.98, 0.25, 0.68, 0.92, 0.67, 0.67, 0.11, 0.16, 
0.3, 0.36, 0.32, 0.66, 0.34, 0.89, 0.33)), row.names = c(NA, 
-143L), class = c("tbl_df", "tbl", "data.frame"))

Here is my desired output:

output = structure(list(x = c("San Francisco", "Work at Home", "Arlington VA", 
"Work at Home", "Arlington"), y = c(1, 5, 1, 6, 2), D = c(0.68, 
0.11, 0.88, 0.16, 0.7)), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))

Gerardo Flores · Accepted Answer · 2020-09-21 22:36:55Z

You can use the dplyr and stringr packages for this problem.

First, you can create a regex expression for multiple patterns using distinct, pull and paste.

library(dplyr)
library(stringr)

xvec <- paste(paste0(paste0("\\b",lookup %>% distinct(x) %>% pull()),"\\b"), collapse = '|')

>xvec
[1] "\\bArlington\\b|\\bChicago\\b|\\bSan Diego\\b|\\bLisle\\b|\\bBrea\\b|\\bBoston\\b|\\bAustin\\b|\\bDallas\\b|\\bMiami\\b|\\bBedford\\b|\\bAll Else\\b"

Now you can use the str_match function of the stringr package. case_when is used here to change the new column xnew to "All Else" in case there is no match. The result is in the dfnew table.

dfnew <- df %>%
         mutate(xnew=str_match(x, xvec)) %>%
         mutate(xnew=case_when(!is.na(xnew) ~ xnew, TRUE ~ "All Else"))

>dfnew
  x                 y xnew     
  <chr>         <dbl> <chr>    
1 San Francisco     1 All Else 
2 Work at Home      5 All Else 
3 Arlington VA      1 Arlington
4 Work at Home      6 All Else 
5 Arlington         2 Arlington

Finally, you can join the tables. For this you group xnew and y for the dfnew table and x and y for the lookup table. After this you get the desired output.

output <- dfnew %>%
          group_by(xnew,y) %>%
          left_join(lookup %>% group_by(x,y), by=c("xnew"="x","y"="y")) %>%
          ungroup() %>%
          select(-xnew)

>output
  x                 y     D
  <chr>         <dbl> <dbl>
1 San Francisco     1  0.68
2 Work at Home      5  0.11
3 Arlington VA      1  0.88
4 Work at Home      6  0.16
5 Arlington         2  0.7

RiskyMaor · Accepted Answer · 2020-09-21 23:22:32Z

1

Here's a working solution using tidyverse and fuzzyjoin. There might be a more performant solution, but I hope this one is fast enough for your needs.

One caveat though: I suggest reading the documentation of stringdist_join and understand the matching algorithms and other parameters for the fuzzy matching - I just happened to pick max_dist = 3 because that's what worked with your example but I can't guarantee it's optimal on the rest of your data.

library(tidyverse)
library(fuzzyjoin)


fuzzy_matched_df <- df %>%
  fuzzyjoin::stringdist_inner_join(lookup, by = "x", max_dist = 3) %>%
  filter(y.x == y.y) %>%
  select(x = x.x, y = y.x, D)

unmatched_df <- df %>%
  fuzzyjoin::stringdist_anti_join(lookup, by = "x", max_dist = 3) %>%
  mutate(fallback = "All Else") %>%
  inner_join(lookup, by = c(fallback = "x", y = "y")) %>%
  select(x, y, D)

out <- fuzzy_matched_df %>%
  bind_rows(unmatched_df)

identical(out %>% arrange(x), output %>% arrange(x))

edited Sep 21, 2020 at 23:22

answered Sep 21, 2020 at 22:15

RiskyMaor

3483 silver badges16 bronze badges

1 Comment

Gabriella Over a year ago

Thank you! This is also a nice solution and even a little more than I needed. Really appreciate your time.

Collectives™ on Stack Overflow

Conditional merge with multiple variables in R

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related