1

I have df which currently looks something like this:

Car Name      Number
Adam Leaf     9
Adamm Leaf    9
Adam Lea      NaN
Adam-Leaf     NaN
Adam/Leaf     9
Claire-Green  NaN
Cliare Green  3
Claire Green  3
Claire Gren   NaN
Claire/Green  3

I am trying to remove the variations to achieve something like this

Car Name      Number
Adam Leaf     9
Claire Green  3
3
  • Why did you mark r and python? Try to be more specific. Also you have to be more precise and explain what you mean by similar names. Commented Oct 17, 2019 at 14:20
  • Similar names meaning names which are incorrect variants i.e (an extra letter, extra symbols, letters missing etc) Commented Oct 17, 2019 at 14:23
  • Do fuzzy matching on your entries. Documentation here. Review the results checking which threshold you'll use to tag as "incorrect variants" and retain 1 to create your desired dataframe output. Commented Oct 17, 2019 at 15:54

1 Answer 1

3

here is one way from jellyfish

import jellyfish

s=df.groupby(df['Car Name'].apply(jellyfish.soundex)).first()
              Car Name  Number
Car Name                      
A354         Adam Leaf     9.0
C462      Claire-Green     3.0
Sign up to request clarification or add additional context in comments.

4 Comments

Cool package name!
This gives the correct car number, however it seems it's only selecting the first variant of the name
TIL python jellyfish
Avoid using groupby functions as that would result in full shuffling of the data before applying group by function. Try reduceByKey or similar instead.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.