I have one dataset with words transcribed in ARPABET such as below:
dict <- data.frame(
word=c("HH EH L P", "W IH TH", "S AH M . TH IY NG"))
I have another dataset that has possible ARPABET segments with a specific, corresponding (but ultimately arbitrary) value, something like below:
ref <- data.frame(
letter=c("HH", "EH", "L", "P", "W", "IH", "TH", "S", "AH", "M", "IY", "NG", "AA", "B"),
value=c(1.34, 1.91, 2.45, 4.12, 2.12, .69, 5.1, 1.47, 1.98, 3.12, 1.35, 4.11, 1.23, 3.45))
I am trying to calculate for each word in my data frame dict, the sum of the matching letter values. For instance, "HH EH L P" would equal 1.34 + 1.91 + 2.45 + 4.12 = 9.82. Ideally, I would like a data frame that looks as:
dict_goal <- data.frame(
word=c("H EH L P", "W IH TH", "S AH M . TH IY NG"),
sum=c(9.82, 7.91, 17.1))
My approach thus far has to been to split each word by their spaces, move each split word temporarily into a data frame, join the corresponding values, sum those values, and then append that sum back to my original data (dict) by row. I tried using the below code, but it is both cumbersome and ineffective, since it does not actually join the letter values (just returns NA values). The summing and appending are simple, but I can't seem to get to that point. Note: this code relies on the packages dplyr and stringr.
ref$value <- as.factor(ref.$value)
test <- data.frame()
for(i in 1:3){
test <- str_split(dict[i,], " ")
test <- as.data.frame(test)
colnames(test) <- c("letter")
test <- left_join(test, ref,
by = c("letter" = "value"))
}
Any help would be greatly appreciated!