0

I have a dataset with 20,000 rows that in its purest form looks like this:

    v1                   v2
1   Case 1 (A v. B)      A v. B 
2   Case 2 (A v. C)      A v. B 
3   Case 2 (A v. C)      C v. B 
4   Case 4 (X v. Z)      X v. Z 
5   Case 5 (B v. A)      A v. B 
6   Case 6 (X v. A)      X v. A 
7   Case 6 (X v. A)      A v. X 
...

...except there are n-many variations of v1, v2 (actually around ~150, but still too many to list).

I want to return a third column v3 containing a logical indicator of whether any substring of v1 matches the string in v2.

    v1                   v2           v3
1   Case 1 (A v. B)      A v. B       TRUE
2   Case 2 (A v. C)      A v. B       FALSE
3   Case 2 (A v. C)      C v. B       FALSE
4   Case 4 (X v. Z)      X v. Z       TRUE
5   Case 5 (B v. A)      A v. B       FALSE
6   Case 6 (X v. A)      X v. A       TRUE
7   Case 6 (X v. A)      A v. X       FALSE

I've been playing around with something like this, which I think is on the right track:

library(stringr)
x$v3 <- with(x, str_detect(v1, v2))

I'd be very grateful if someone could point me in the right direction to a solution/workaround.

MWE shows that my str_detect() technique does not work:

x <- structure(list(v1 = c("Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation"
), v2 = c("Georgia v Russian Federation", " Ethiopia v South Africa Liberia v South Africa", 
             " Cameroon v United Kingdom", " New Zealand v France", " Australia v France", 
             " Nicaragua v United States of America", " Nicaragua v Honduras", 
             " Nauru v Anustralia", " Nnew Zealand v France", " Islamic Republic of Iran v United States of America", 
             " Bosnia and Herzegovina v Serbia and Montenegro", " Spain v Cananda", 
             " Libyan Arab Jamahiriya v United States of America", " Libyan Arab Jamahiriya v United Kingdom", 
             " Democratic Republic of the Congo v Burundi", " Germany v United States of America", 
             " Democratic Republic of the Congo v Belgium", " Liechtenstein v Germany", 
             " Democratic Republic of the Congo v Ugandan", " Democratic Republic of the Congo v Rwandan", 
             " Nicaragua v Colombia", " Djibouti v France", " Georgia v Russian Federation", 
             " Croatia v Serbia", " Mexico v United States of American", " Democratic Republic of the Congo v Rwanda", 
             " Spain v  Canada", " Australia v  France", " New Zealand v France", 
             " New Zealand v France")), .Names = c("v1", "v2"
             ), row.names = c(NA, 30L), class = "data.frame")

1 Answer 1

1

grepl can be used to compare a single value from v2 to possible substrings of v1

You need to apply it for each row separately, so a quick solution can be: apply(data.frame(v1,v2),MARGIN=1, FUN=function(x) {grepl(x[2],x[1])})

In case you want to ignore differences in number of spaces (like the in row #1), you can replace the value in x[2] with the appropriate regex using gsub, so " " will be replaced with " *" to allow multiple spaces.

In that case this apply will work:

apply(x,MARGIN=1, FUN=function(x) {grepl(gsub(" "," *",x[2]),x[1])})

Sign up to request clarification or add additional context in comments.

3 Comments

I don't think you are right. v1 in both 1st and 23rd line contains 2 spaces after Georgia and after the "v", it does not contain the double space in v2. I will add in the answer an explanation about the spaces and how to solve them
Can you post the function you used here? And maybe recheck the data you posted? I created the dataframe you posted in the question and applied the same function and got TRUE on 1 and 23, everything else is false
I reset my memory and it works - thanks! You saved me a ton of time. Glad the answer was so straightforward. I was also able to implement fuzzy string matching using the agrep() function: apply(x, MARGIN=1, FUN=function(x) { agrepl(gsub(" "," *", x[2]), x[1], max.distance=.25)})

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.