1

I am struggling with subsetting strings from the column of a dataframe. I am dealing with language data. In my dataframe, I have a 1st column with the verb stem, and a 2nd column with a full sentence containing several words, including one which is the conjugated verb. I would like to create a 3rd column with only the conjugated verb (therefore removing the other words) that contains the same verb stem as in column 1 within the same row. I cannot simply use a list of all verb stems for this, because some sentences contain 2 verbs, and I only want the verb with the same stem as in column 1 in that row.

This is how my data looks like now:

   Verb_stem       Full_sentence 
1. copt            to coptu to 
2. puns            punse kanchina 
3. khag            basana na lo khagunse nan

And this is the output that I would like:

   Verb_stem       Full_sentence              Conjugated verb         
1. copt            to coptu to                copto
2. puns            punse kanchina             punse
3. khag            basana na lo khagunse nan  khagunse

After doing some research, I tried the following formula:

Df$Conjugated_verb <- lapply(strsplit(Df$Full_sentence, " "), grep, pattern = Df$Verb_stem, value = TRUE)

The problem that I am facing right now is that the formula seems to look only for the verbs stem in the 1st row in all sentences, instead of switching to a new verb stem at each row. Here is the output that I get:

   Verb_stem       Full_sentence               Conjugated_verb 
1. copt            to coptu to                 coptu
2. puns            punse kanchina              character(0)
3. khag            basana na lo khagunse nan   character(0)

I tried many things, and I have been looking for a solution for days, but I really cannot figure out how to do it. If someone had an idea, I would be super grateful! Thanks in advance!

2 Answers 2

1

You can use mapply() to manipulate Verb_stem and Full_sentence pairwisely.

within(df, {
  Conjugated_verb <- mapply(\(x, y) { z <- strsplit(y, "\\s+")[[1]] ; z[grepl(x, z)] },
                            Verb_stem, Full_sentence)
})

or

within(df, {
  Conjugated_verb <- mapply(\(x, y) sub(sprintf(".*(\\w*%s\\w*).*", x), "\\1", y),
                            Verb_stem, Full_sentence)
})

Output:

#   Verb_stem             Full_sentence Conjugated_verb
# 1      copt               to coptu to           coptu
# 2      puns            punse kanchina           punse
# 3      khag basana na lo khagunse nan        khagunse
Sign up to request clarification or add additional context in comments.

Comments

0

We may use vectorized str_extract

library(dplyr)
library(stringr)
df1 %>%
    mutate(Conjugated = str_extract(Full_sentence, str_c(Verb_stem, "\\S*")))

-output

   Verb_stem             Full_sentence Conjugated
1.      copt               to coptu to      coptu
2.      puns            punse kanchina      punse
3.      khag basana na lo khagunse nan   khagunse

data

df1 <- structure(list(Verb_stem = c("copt", "puns", "khag"), 
Full_sentence = c("to coptu to", 
"punse kanchina", "basana na lo khagunse nan")), 
class = "data.frame", row.names = c("1.", 
"2.", "3."))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.