0

I have two data frames df1 and df2. In df1 I have 50 columns and in df2 I have 50+ columns. What I want to achieve is that In df1 I have 13000 rows and a column name subject where names of all subjects are given. In df2 I have 250 rows and along 50+ I have two columns named subject code and subject_name.

        Here is an example of my datasets:

        df1 = 
        index     subjects
        0         Biology
        1         Physicss
        2         Chemistry
        3         Biology
        4         Physics
        5         Physics
        6         Biolgy

    df2 = 
        index     subject_name    subject_code
        0         Biology         BIO
        1         Physics         PHY
        2         Chemistry       CHE
        3         Medical         MED
        4         Programming     PRO
        5         Maths           MAT
        6         Literature      LIT 

My desired output in df1 (after replacing subject_name and fixing the spelling errors) is:
            index     subjects        subject_code
            0         Biology         BIO
            1         Physics         PHY
            2         Chemistry       CHE
            3         Biology         BIO
            4         Physics         PHY
            5         Physics         PHY
            6         Biology         BIO

What happens at my end is that I want to merge all subject values in df1 with values in df2 subject name values. In df1 there are around 500 rows where I get NAN after I merge both the columns in one as in these 500 rows there is some difference in the spellings of the subject. I have tried solution given at the following links but didn't work for me: replace df index values with values from a list but ignore empty strings

Python pandas: replace values multiple columns matching multiple columns from another dataframe

            Here is my code:

            df_merged = pd.merge(df1_subject,df2_subjectname, left_on='subjects', right_on='subject_name')
        df_merged.head()

Can anyone tell me how I can fix this issue as I have already spend 8 hours on this issue but am unable tor resolve it.

Cheers

5
  • Did you inspect those rows in df1, is there something different in the spelling? ...df1 column name is subjects but you used left_on = 'subject' in the merge argument. Commented Sep 28, 2016 at 16:41
  • That was a typo. I have fixed the spelling error and correct one is left_on = 'subjects' Commented Sep 28, 2016 at 16:45
  • Have you tried to merge based on the first three characters being the sane? Commented Sep 30, 2016 at 5:02
  • HI wwii, actually that will not work as in my dataframe I have few subjects like web design, web development, etc and if I match first three characters then it will take wrong values. Commented Sep 30, 2016 at 13:29
  • Have you characterized the misspellings? How many unique misspellings for each term are there? - something like df1.subjects not in df2.subject_names then a groupby. Maybe you can automate correcting the misspellings then do a merge. Commented Oct 7, 2016 at 0:18

2 Answers 2

0

One of the issues you have is the incorrect spellings. You can try to harmonise the spelling of the subject across both dataframes using the difflib module and its get_close_matches method.

Using this code will return the closest matching subject for each match in df1 and and df2. df1's columns will be updated to reflect this. Therefore, even if the subject name is spelled incorrectly, it will now have the same spelling in both dataframes.

import pandas as pd
import difflib

df2['subject_name'] = df2.subject_name.map(lambda x: difflib.get_close_matches(x, df1.subject)[0])

After this you can try to merge. It may resolve your issue, but it would be easier to fix if you provide a reproducible example.

Sign up to request clarification or add additional context in comments.

5 Comments

Hi Conor, If you don't mind can you please tell me what this x is inside get_close_matches(x,...) as I am getting error list index out of range
try this instead df2.subject_name.map(lambda x: difflib.get_close_matches(x, df1.subject[0])
The x is simply is a place holder for the variable subject_name in df2.
Sorry Conor with this one I am getting unexpected EOF while parsing error
That type of error relates to syntax, also there was a small mistake in my code. This line df2.subject_name.map(lambda x: difflib.get_close_matches(x, df1.subject)[0]) works in a sample dataset that I set up. It is worth one more try :)
0

Correct the spelling then merge...

import pandas as pd
import operator, collections

df1 = pd.DataFrame.from_items([("subjects",
                                ["Biology","Physicss","Phsicss","Chemistry",
                                 "Biology","Physics","Physics","Biolgy","navelgazing"])])
df2 = pd.DataFrame.from_items([("subject_name",
                                ["Biology","Physics","Chemistry","Medical",
                                 "Programming","Maths","Literature"]),
                               ("subject_code",
                                ["BIO","PHY","CHE","MED","PRO","MAT","LIT"])])

Find the misspellings:

misspelled = set(df1.subjects) - set(df2.subject_name)

Find the subject that matches the misspelling best and create a dictionary -> {mis_sp : subject_name}

difference = operator.itemgetter(1)
subject = operator.itemgetter(0)
def foo1(word, candidates):
    '''Returns the most likely match for a misspelled word
    '''
    temp = []
    for candidate in candidates:
        count1 = collections.Counter(word)
        count2 = collections.Counter(candidate)
        diff1 = count1 - count2
        diff2 = count2 - count1
        diff = sum(diff1.values())
        diff += sum(diff2.values())
        temp.append((candidate, diff))
    return subject(min(temp, key = difference))

def foo2(words):
    '''Yields (misspelled-word, corrected-word) tuples from misspelled words'''
    for word in words:
        name = foo1(word, df2.subject_name)
        if name:
            yield (word, name)

d = dict(foo2(misspelled))               

Correct all the misspellings in df1

def foo3(thing):
    return d.get(thing, thing)

df3 = df1.applymap(foo3)

Merge

df2 = df2.set_index("subject_name")
df3 = df3.merge(df2, left_on = "subjects", right_index = True, how = 'left')

foo1 might possibly be sufficient for this purpose, but there are better, more sophisticated, algorithms to correct spelling. maybe, http://norvig.com/spell-correct.html

Just read @conner's solution. I didn't know difflib was there so a better foo1 would be,

def foo1(word, candidates):
    try:
        return difflib.get_close_matches(word, candidates, 1)[0]
    except IndexError as e:
        # there isn't a close match
        return None

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.