Remove similar character string duplicates from a dataframe

Question

I have df which currently looks something like this:

Car Name      Number
Adam Leaf     9
Adamm Leaf    9
Adam Lea      NaN
Adam-Leaf     NaN
Adam/Leaf     9
Claire-Green  NaN
Cliare Green  3
Claire Green  3
Claire Gren   NaN
Claire/Green  3

I am trying to remove the variations to achieve something like this

Car Name      Number
Adam Leaf     9
Claire Green  3

Why did you mark r and python? Try to be more specific. Also you have to be more precise and explain what you mean by similar names. — Cettt
– Cettt, Commented Oct 17, 2019 at 14:20
Similar names meaning names which are incorrect variants i.e (an extra letter, extra symbols, letters missing etc) — user11555536
– user11555536, Commented Oct 17, 2019 at 14:23
Do fuzzy matching on your entries. Documentation here. Review the results checking which threshold you'll use to tag as "incorrect variants" and retain 1 to create your desired dataframe output. — Joe
– Joe, Commented Oct 17, 2019 at 15:54

BENY · Accepted Answer · 2019-10-17 14:23:22Z

3

here is one way from jellyfish

import jellyfish

s=df.groupby(df['Car Name'].apply(jellyfish.soundex)).first()
              Car Name  Number
Car Name                      
A354         Adam Leaf     9.0
C462      Claire-Green     3.0

answered Oct 17, 2019 at 14:23

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Yaakov Bressler Over a year ago

Cool package name!

user11555536 Over a year ago

This gives the correct car number, however it seems it's only selecting the first variant of the name

piRSquared Over a year ago

TIL python jellyfish

skjagini Over a year ago

Avoid using groupby functions as that would result in full shuffling of the data before applying group by function. Try reduceByKey or similar instead.

Collectives™ on Stack Overflow

Remove similar character string duplicates from a dataframe

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related