8

I have two vector of type character in R.

I want to be able to compare the reference list to the raw character list using jarowinkler and assign a % similarity score. So for example if i have 10 reference items and twenty raw data items, i want to be able to get the best score for the comparison and what the algorithm matched it to (so 2 vectors of 10). If i have raw data of size 8 and 10 reference items, i should only end up with a 2 vector result of 8 items with the best match and score per item

item, match, matched_to ice, 78, ice-cream

Below is my code which isn't much to look at.

NumItems.Raw = length(words)
NumItems.Ref = length(Ref.Desc)

for (item in words) 
{
  for (refitem in Ref.Desc)
  {
    jarowinkler(refitem,item)

    # Find Best match Score
    # Find Best Item in reference table
    # Add both items to vectors
    # decrement NumItems.Raw
    # Loop
  }
} 
2
  • 1
    Perhaps the RecordLinkage package, and a function that builds from this? compareJW <- function(string, vec, cutoff) { require(RecordLinkage) jarowinkler(string, vec) > cutoff } Commented Mar 17, 2015 at 15:36
  • What are your criteria for matching if there are multiple best fits with the same jarowinkler score? Do you pick the first match, or use a random selection of the best matches? Commented Mar 17, 2015 at 15:39

2 Answers 2

14

Using a toy example:

library(RecordLinkage)
library(dplyr)

ref <- c('cat', 'dog', 'turtle', 'cow', 'horse', 'pig', 'sheep', 'koala','bear','fish')
words <- c('dog', 'kiwi', 'emu', 'pig', 'sheep', 'cow','cat','horse')

wordlist <- expand.grid(words = words, ref = ref, stringsAsFactors = FALSE)
wordlist %>% group_by(words) %>% mutate(match_score = jarowinkler(words, ref)) %>%
summarise(match = match_score[which.max(match_score)], matched_to = ref[which.max(match_score)])

gives

 words     match matched_to
1   cat 1.0000000        cat
2   cow 1.0000000        cow
3   dog 1.0000000        dog
4   emu 0.5277778       bear
5 horse 1.0000000      horse
6  kiwi 0.5350000      koala
7   pig 1.0000000        pig
8 sheep 1.0000000      sheep

Edit: As a response to the OP's comment, the last command uses the pipeline approach from dplyr, and groups every combination of the raw words and references by the raw words, adds a column match_score with the jarowinkler score, and returns only a summary of the highest match score (indexed by which.max(match_score)), as well as the reference which also is indexed by the maximum match_score.

Sign up to request clarification or add additional context in comments.

4 Comments

Hi Jim M. Thank you for your answer, I never know about the expand.grid function. Can you explain to me how the last command is working?
Thanks @Jim M, I managed to get it working with your code no problems at all but came up against the problem outlined here which is basically down to NSL being used in the Group and the jarowinkler function. I will still play around with it, your answer is excellent and opened up new areas of R for me...Thank you very much
@Jim M can i achieve this for huge dataframes as well?
@KRU: It would depend on what you mean by huge. I believe the limit would be a data.frame of 2^31 - 1 rows that could be accessed at once, otherwise the data may have to be subdivided into chunks for analysis.
3

There is a package which already implements the Jaro-Winkler distance.

> install.packages("stringdist")
> library(stringdist)
> 1-stringdist('ice','ice-cream',method='jw')
[1] 0.7777778

2 Comments

Hi @Ken Yeoh, thank you for your reply. This in essence gives me the same thing as jarowinkler(refitem,item) but my problem is if i have many things to reference against, i only want it to return the top match and the percentage match. So i have 1 item i wish to check and 8 items to check it against, i want to be able to return just one answer, i.e. the highest match and what that match in the reference table is
Ah, sorry I misunderstood your question. You can do this with stringdistmatrix, but I think @Jim M's answer is easier to implement.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.