1

I have quite a big dataset which has 2 text variables A and B. length(A) <= length(B). B can be either variable A with some extra characters (without order) or can be totally different from A. So i need to to create new variable within my data table under this condition: If B contains A then C = TRUE. I believe partial string matching is more suitable for me here than normal string comparison.

My dataframe example:

Home      Pick  
Barc      Barcelona 0  
F Munch   FC munchen   
Lakers    Portland

I need to add new variable Side in this way:

Home     Pick         Side    
Barc     Barcelona 0  True  
F Munch  FC munchen   True  
Lakers   Portland     False  

i am trying to solve with this:

data_n$Side <- stringMatch(data_n$Home, data_n$Pick, normalize = "YES")

but it gives all negative results.
Hoverer

stringMatch('barcel', 'Barcelona 0', normalize='YES')    

gives needed answer. Any hints where i make mistake?

3
  • you should include any extra packages that you are using. agrep is a useful base solution for partial matching Commented May 24, 2014 at 14:55
  • doesn't stringMatch give a value within the interval of 0 to 1? by the way, using your dataset, I got a value of 0.444 for the dataframe and 0.545 for the single example. Commented May 24, 2014 at 15:01
  • Hi, thanks for the answer. It actually doesn make such a big difference which funkction to use (agrep, stringmatch or any other). The problem is that i can not make it work on my 17000 rows data file. How should i do that?? :( Commented May 25, 2014 at 12:17

1 Answer 1

1

I'm not sure of its reliability, but agrepl, the partial pattern matching function, seems to work on your data. Assume dat is your original data, then

## read in the original data
> txt <- "Home\tPick
  Barc\tBarcelona 0
  F Munch\tFC munchen
  Lakers\tPortland"
> dat <- read.table(text = txt, sep = '\t', header = TRUE)
##      Home        Pick
## 1    Barc Barcelona 0
## 2 F Munch  FC munchen
## 3  Lakers    Portland

using agrepl

> d1 <- dat[,1]
> d2 <- dat[,2]
> dat$Side <- sapply(seq(nrow(dat)), function(i){
      agrepl(d1[i], d2[i], ignore.case = TRUE)
      })
> dat
##      Home        Pick  Side
## 1    Barc Barcelona 0  TRUE
## 2 F Munch  FC munchen  TRUE
## 3  Lakers    Portland FALSE
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot, it seems to be working from the first sight. As i undersand agrepl function is base funkcion and no additional packages are needed? I get error: Error in FUN(1:17076[[1L]], ...) : could not find function "agrepl"
just updated R and works fine now. Not 100% accurate but still helped a lot! thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.