Python: Replace df1.col values with df2.col2 values

Question

I have two data frames df1 and df2. In df1 I have 50 columns and in df2 I have 50+ columns. What I want to achieve is that In df1 I have 13000 rows and a column name subject where names of all subjects are given. In df2 I have 250 rows and along 50+ I have two columns named subject code and subject_name.

        Here is an example of my datasets:

        df1 = 
        index     subjects
        0         Biology
        1         Physicss
        2         Chemistry
        3         Biology
        4         Physics
        5         Physics
        6         Biolgy

    df2 = 
        index     subject_name    subject_code
        0         Biology         BIO
        1         Physics         PHY
        2         Chemistry       CHE
        3         Medical         MED
        4         Programming     PRO
        5         Maths           MAT
        6         Literature      LIT 

My desired output in df1 (after replacing subject_name and fixing the spelling errors) is:
            index     subjects        subject_code
            0         Biology         BIO
            1         Physics         PHY
            2         Chemistry       CHE
            3         Biology         BIO
            4         Physics         PHY
            5         Physics         PHY
            6         Biology         BIO

What happens at my end is that I want to merge all subject values in df1 with values in df2 subject name values. In df1 there are around 500 rows where I get NAN after I merge both the columns in one as in these 500 rows there is some difference in the spellings of the subject. I have tried solution given at the following links but didn't work for me: replace df index values with values from a list but ignore empty strings

Python pandas: replace values multiple columns matching multiple columns from another dataframe

            Here is my code:

            df_merged = pd.merge(df1_subject,df2_subjectname, left_on='subjects', right_on='subject_name')
        df_merged.head()

Can anyone tell me how I can fix this issue as I have already spend 8 hours on this issue but am unable tor resolve it.

Cheers

Did you inspect those rows in df1, is there something different in the spelling? ...df1 column name is subjects but you used left_on = 'subject' in the merge argument. — wwii
– wwii, Commented Sep 28, 2016 at 16:41
That was a typo. I have fixed the spelling error and correct one is left_on = 'subjects' — Baig
– Baig, Commented Sep 28, 2016 at 16:45
Have you tried to merge based on the first three characters being the sane? — wwii
– wwii, Commented Sep 30, 2016 at 5:02
HI wwii, actually that will not work as in my dataframe I have few subjects like web design, web development, etc and if I match first three characters then it will take wrong values. — Baig
– Baig, Commented Sep 30, 2016 at 13:29
Have you characterized the misspellings? How many unique misspellings for each term are there? - something like df1.subjects not in df2.subject_names then a groupby. Maybe you can automate correcting the misspellings then do a merge. — wwii
– wwii, Commented Oct 7, 2016 at 0:18

conor · Accepted Answer · 2016-09-28 18:06:42Z

0

One of the issues you have is the incorrect spellings. You can try to harmonise the spelling of the subject across both dataframes using the difflib module and its get_close_matches method.

Using this code will return the closest matching subject for each match in df1 and and df2. df1's columns will be updated to reflect this. Therefore, even if the subject name is spelled incorrectly, it will now have the same spelling in both dataframes.

import pandas as pd
import difflib

df2['subject_name'] = df2.subject_name.map(lambda x: difflib.get_close_matches(x, df1.subject)[0])

After this you can try to merge. It may resolve your issue, but it would be easier to fix if you provide a reproducible example.

edited Sep 28, 2016 at 18:06

answered Sep 28, 2016 at 17:48

conor

1,28710 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Baig Over a year ago

Hi Conor, If you don't mind can you please tell me what this x is inside get_close_matches(x,...) as I am getting error list index out of range

conor Over a year ago

try this instead df2.subject_name.map(lambda x: difflib.get_close_matches(x, df1.subject[0])

conor Over a year ago

The x is simply is a place holder for the variable subject_name in df2.

Baig Over a year ago

Sorry Conor with this one I am getting unexpected EOF while parsing error

conor Over a year ago

That type of error relates to syntax, also there was a small mistake in my code. This line df2.subject_name.map(lambda x: difflib.get_close_matches(x, df1.subject)[0]) works in a sample dataset that I set up. It is worth one more try :)

wwii · Accepted Answer · 2016-10-08 05:53:38Z

Correct the spelling then merge...

import pandas as pd
import operator, collections

df1 = pd.DataFrame.from_items([("subjects",
                                ["Biology","Physicss","Phsicss","Chemistry",
                                 "Biology","Physics","Physics","Biolgy","navelgazing"])])
df2 = pd.DataFrame.from_items([("subject_name",
                                ["Biology","Physics","Chemistry","Medical",
                                 "Programming","Maths","Literature"]),
                               ("subject_code",
                                ["BIO","PHY","CHE","MED","PRO","MAT","LIT"])])

Find the misspellings:

misspelled = set(df1.subjects) - set(df2.subject_name)

Find the subject that matches the misspelling best and create a dictionary -> {mis_sp : subject_name}

difference = operator.itemgetter(1)
subject = operator.itemgetter(0)
def foo1(word, candidates):
    '''Returns the most likely match for a misspelled word
    '''
    temp = []
    for candidate in candidates:
        count1 = collections.Counter(word)
        count2 = collections.Counter(candidate)
        diff1 = count1 - count2
        diff2 = count2 - count1
        diff = sum(diff1.values())
        diff += sum(diff2.values())
        temp.append((candidate, diff))
    return subject(min(temp, key = difference))

def foo2(words):
    '''Yields (misspelled-word, corrected-word) tuples from misspelled words'''
    for word in words:
        name = foo1(word, df2.subject_name)
        if name:
            yield (word, name)

d = dict(foo2(misspelled))

Correct all the misspellings in df1

def foo3(thing):
    return d.get(thing, thing)

df3 = df1.applymap(foo3)

Merge

df2 = df2.set_index("subject_name")
df3 = df3.merge(df2, left_on = "subjects", right_index = True, how = 'left')

foo1 might possibly be sufficient for this purpose, but there are better, more sophisticated, algorithms to correct spelling. maybe, http://norvig.com/spell-correct.html

Just read @conner's solution. I didn't know difflib was there so a better foo1 would be,

def foo1(word, candidates):
    try:
        return difflib.get_close_matches(word, candidates, 1)[0]
    except IndexError as e:
        # there isn't a close match
        return None

Collectives™ on Stack Overflow

Python: Replace df1.col values with df2.col2 values

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related