Match multiple strings with set values across data frames in R

Question

I have one dataset with words transcribed in ARPABET such as below:

dict <- data.frame(
              word=c("HH EH L P", "W IH TH", "S AH M . TH IY NG"))

I have another dataset that has possible ARPABET segments with a specific, corresponding (but ultimately arbitrary) value, something like below:

ref <- data.frame(
              letter=c("HH", "EH", "L", "P", "W", "IH", "TH", "S", "AH", "M", "IY", "NG", "AA", "B"),
              value=c(1.34, 1.91, 2.45, 4.12, 2.12, .69, 5.1, 1.47, 1.98, 3.12, 1.35, 4.11, 1.23, 3.45))

I am trying to calculate for each word in my data frame dict, the sum of the matching letter values. For instance, "HH EH L P" would equal 1.34 + 1.91 + 2.45 + 4.12 = 9.82. Ideally, I would like a data frame that looks as:

dict_goal <- data.frame(
  word=c("H EH L P", "W IH TH", "S AH M . TH IY NG"),
  sum=c(9.82, 7.91, 17.1))

My approach thus far has to been to split each word by their spaces, move each split word temporarily into a data frame, join the corresponding values, sum those values, and then append that sum back to my original data (dict) by row. I tried using the below code, but it is both cumbersome and ineffective, since it does not actually join the letter values (just returns NA values). The summing and appending are simple, but I can't seem to get to that point. Note: this code relies on the packages dplyr and stringr.

ref$value <- as.factor(ref.$value)
test <- data.frame()

for(i in 1:3){
  test <- str_split(dict[i,], " ")
  test <- as.data.frame(test)  
  colnames(test) <- c("letter")
  test <- left_join(test, ref,
                     by = c("letter" = "value"))
  }

Any help would be greatly appreciated!

That was my mistake! 17.1 is the correct output. Thanks!

Ian
– Ian

2020-04-29 03:00:13 +00:00
Commented Apr 29, 2020 at 3:00 — Ian
– Ian, Commented Apr 29, 2020 at 3:00

Ronak Shah · Accepted Answer · 2020-04-29 02:33:39Z

1

You can get the data in long format using separate_rows and join it with ref to get corresponding value. For each word, we can then sum the value together.

library(dplyr)

dict %>%
  mutate(row = row_number()) %>%
  tidyr::separate_rows(word, sep = "\\s+") %>%
  left_join(ref, by = c('word' = 'letter'))  %>%
  group_by(row) %>%
  summarise(word = paste(word, collapse = " "), 
            value = sum(value, na.rm = TRUE)) %>%
  select(-row)


#  word              value
#  <chr>             <dbl>
#1 HH EH L P          9.82
#2 W IH TH            7.91
#3 S AH M . TH IY NG 17.1

answered Apr 29, 2020 at 2:33

Ronak Shah

391k20 gold badges173 silver badges237 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Match multiple strings with set values across data frames in R

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related