1

I have one dataset with words transcribed in ARPABET such as below:

dict <- data.frame(
              word=c("HH EH L P", "W IH TH", "S AH M . TH IY NG"))

I have another dataset that has possible ARPABET segments with a specific, corresponding (but ultimately arbitrary) value, something like below:

ref <- data.frame(
              letter=c("HH", "EH", "L", "P", "W", "IH", "TH", "S", "AH", "M", "IY", "NG", "AA", "B"),
              value=c(1.34, 1.91, 2.45, 4.12, 2.12, .69, 5.1, 1.47, 1.98, 3.12, 1.35, 4.11, 1.23, 3.45))

I am trying to calculate for each word in my data frame dict, the sum of the matching letter values. For instance, "HH EH L P" would equal 1.34 + 1.91 + 2.45 + 4.12 = 9.82. Ideally, I would like a data frame that looks as:

dict_goal <- data.frame(
  word=c("H EH L P", "W IH TH", "S AH M . TH IY NG"),
  sum=c(9.82, 7.91, 17.1))

My approach thus far has to been to split each word by their spaces, move each split word temporarily into a data frame, join the corresponding values, sum those values, and then append that sum back to my original data (dict) by row. I tried using the below code, but it is both cumbersome and ineffective, since it does not actually join the letter values (just returns NA values). The summing and appending are simple, but I can't seem to get to that point. Note: this code relies on the packages dplyr and stringr.

ref$value <- as.factor(ref.$value)
test <- data.frame()

for(i in 1:3){
  test <- str_split(dict[i,], " ")
  test <- as.data.frame(test)  
  colnames(test) <- c("letter")
  test <- left_join(test, ref,
                     by = c("letter" = "value"))
  }

Any help would be greatly appreciated!

1
  • 1
    That was my mistake! 17.1 is the correct output. Thanks! Commented Apr 29, 2020 at 3:00

1 Answer 1

1

You can get the data in long format using separate_rows and join it with ref to get corresponding value. For each word, we can then sum the value together.

library(dplyr)

dict %>%
  mutate(row = row_number()) %>%
  tidyr::separate_rows(word, sep = "\\s+") %>%
  left_join(ref, by = c('word' = 'letter'))  %>%
  group_by(row) %>%
  summarise(word = paste(word, collapse = " "), 
            value = sum(value, na.rm = TRUE)) %>%
  select(-row)


#  word              value
#  <chr>             <dbl>
#1 HH EH L P          9.82
#2 W IH TH            7.91
#3 S AH M . TH IY NG 17.1 
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.