Difflib error when applying onto two columns in pandas dataframe

Question

I have DataFrame that look like this:

Cities        Cities_Dict
"San Francisco" ["San Francisco", "New York", "Boston"]
"Los Angeles"   ["Los Angeles"]
"berlin"        ["Munich", "Berlin"]
"Dubai"         ["Dubai"]

I want to create new column that compares city from firest column to the list of cities from secon column and finds the one that is the closest match. I use difflib for that:

df["new_col"]=difflib.get_close_matches(df["Cities"],df["Cities_Dict"])

However I get error:

TypeError: object of type 'float' has no len()

jezrael · Accepted Answer · 2019-09-04 05:45:21Z

1

Use DataFrame.apply with lambda function and axis=1 for processing by rows:

import difflib, ast

#if necessary convert values to lists
#df['Cities_Dict'] = df['Cities_Dict'].apply(ast.literal_eval)

f = lambda x: difflib.get_close_matches(x["Cities"],x["Cities_Dict"])
df["new_col"] = df.apply(f, axis=1)
print (df)
          Cities                        Cities_Dict          new_col
0  San Francisco  [San Francisco, New York, Boston]  [San Francisco]
1    Los Angeles                      [Los Angeles]    [Los Angeles]
2         berlin                   [Munich, Berlin]         [Berlin]
3          Dubai                            [Dubai]          [Dubai]

EDIT:

For first value with empty string for empty list use:

f = lambda x: next(iter(difflib.get_close_matches(x["Cities"],x["Cities_Dict"])), '')
df["new_col"] = df.apply(f, axis=1)
print (df)
          Cities                        Cities_Dict        new_col
0  San Francisco  [San Francisco, New York, Boston]  San Francisco
1    Los Angeles                      [Los Angeles]    Los Angeles
2         berlin                   [Munich, Berlin]         Berlin
3          Dubai                            [Dubai]          Dubai

EDIT1: If possible problematic data is possible use try-except:

def f(x):
    try:
        return difflib.get_close_matches(x["Cities"],x["Cities_Dict"])[0]
    except:
        return ''

df["new_col"] = df.apply(f, axis=1)
print (df)
        Cities                        Cities_Dict new_col
0          NaN  [San Francisco, New York, Boston]        
1  Los Angeles                               [10]        
2       berlin                   [Munich, Berlin]  Berlin
3        Dubai                            [Dubai]   Dubai

edited Sep 4, 2019 at 5:45

answered Sep 3, 2019 at 7:46

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Alex T Over a year ago

Is it possible to get result not as a list but as a string?

jezrael Over a year ago

@AlexT - answer was edited - always return first value of list or empty string

Alex T Over a year ago

I found out that some values in Cities_Dict ended up floats or ints, is it possible to use try, except in the lambda function that would skip those rows and produce empty string for them?

Alex T Over a year ago

And second question why do you use next and iter?

jezrael Over a year ago

@AlexT - for first, answer was edited. For second this is a trick - problem here is use selecting by [0] for first value of list, because if empty list it return error - like L = ['Dubai'] and L[0] working, but if L = [] then L[0] failed. And for prevent failed is used next with iter - it return first value of list, if exist (if not empty list) else default value, here empty string

Collectives™ on Stack Overflow

Difflib error when applying onto two columns in pandas dataframe

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related