How to replace string in python from a list of possible strings

Question

I have a column of data that looks like this:

df = pd.DataFrame({'Ex1':['apple','apple1','Peear','peAr','b$nana','Bananas'],
'Ex2': ['Applet','banan','apples','PAIR','banana','apple'],
'Ex3':['Pears', 'Banaa', 'Apple', 'apple1', 'pear', 'abanana]}); df

And then I have three arrays that identify misspellings of fruit types as the canonical fruit type:

apple = ['apple1','Applet','apples','Apple']
pear = ['Peear','peAr','PAIR','Pears','p3ar']
banana = ['b$nana','Bananas','banan','Banaa','abanana']

How can I iterate over each of the columns to change the misspelled fruit into the correct ones. I.e. the final data frame should look like this:

    Ex1     Ex2     Ex3
0   apple   apple   pear
1   apple   banana  banana
2   pear    apple   apple
3   pear    pear    apple
4   banana  banana  pear
5   banana  apple   banana

I know I could achieve this outcome with the following code:

replacements = {
    "apple":'apple1',
    "apple":'Applet',
...}

df['Ex1'].replace(replacements, inplace=True)

But I have a list of 1000+ rows and I don't want go through and make each replacement in replacements because that will take a lot of time.

Any suggestions for doing this in a way that I can use my apple, pear, and banana variables as-is?

Is your example dict replacements backwards? Are you just asking how to construct it programmatically? — Davis Herring
– Davis Herring, Commented Apr 20, 2019 at 22:23
I'm not sure what you mean by the first question, but I would like to program the outcome dataframe given what I've already coded up with apple, banana and pear variables. — JAG2024
– JAG2024, Commented Apr 20, 2019 at 22:33
Your “I could achieve this outcome” example has the same key twice in a dictionary. Are you trying to avoid using such a dictionary, or just trying to make one from the separate list variables above? — Davis Herring
– Davis Herring, Commented Apr 20, 2019 at 22:39
Ah, right. I know I could assign each wrong spelling to the right fruit type using that replacements dictionary. But that would take a long time to type all those out. So I'm not avoid using a dictionary like that, but it would be good to utilize the list variables above. — JAG2024
– JAG2024, Commented Apr 20, 2019 at 22:45

amanb · Accepted Answer · 2019-04-20 23:48:51Z

4

A more accurate solution would be to compute the ratio of similarity between the misspelled word and the correctly spelled word. Among the few libraries available in Python, I used the Levenshtein library that has a ratio function that returns the similarity ratio. To get the ratio is quite simple, example:

from Levenshtein import ratio
ratio('banana', 'Banaa')
#0.7272727272727273

Now, if we have the following list of correct words correct_words, the ratio will be computed between each word in the series and in correct_words.

correct_words = ['apple', 'pear', 'banana']

This would mean each element will have three ratio values. However, we would only be concerned with the maximum ratio value and the correct word associated with it. The similarity function below creates an intermediate dictionary with ratio values and correct words(as key). The function returns the key with the max value. Finally, we map the key returned by the function into each element of the dataframe.

from Levenshtein import ratio
import operator

def similarity(x):
    l = {}    
    for i in correct_words:
        l[i] = ratio(x,i)
    return max(l.items(), key=operator.itemgetter(1))[0]


df.applymap(similarity)
    Ex1     Ex2     Ex3
0   apple   apple   pear
1   apple   banana  banana
2   pear    apple   apple
3   pear    apple   apple
4   banana  banana  pear
5   banana  apple   banana

edited Apr 20, 2019 at 23:48

answered Apr 20, 2019 at 23:32

amanb

5,4733 gold badges21 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

JAG2024 Over a year ago

thanks for this answer but I was hoping to use the existing lists of fruit types for 100% accuracy. This answer is good though if in the future I don't have access to set lists of incorrect spellings.

Davis Herring · Accepted Answer · 2019-04-20 23:41:41Z

2

The simple (perhaps even simplistic) approach involving the handwritten lists of misspellings can be automated merely by constructing the dictionary from the lists:

repl={s:n for n,l in [("apple",apple),("pear",pear),("banana",banana)]
      for s in l}

The list of correct names and misspellings for each can itself be constructed automatically if they reside in some data structure like a containing dictionary. (It’s possible to use globals() or locals() as that dictionary, but then you have to filter out the extraneous entries.)

answered Apr 20, 2019 at 23:41

Davis Herring

41.9k4 gold badges58 silver badges91 bronze badges

1 Comment

JAG2024 Over a year ago

This is exactly what I was hoping for: to use the existing lists in a dictionary as replacements. Thanks very much.

Collectives™ on Stack Overflow

How to replace string in python from a list of possible strings

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related