how to string compare two columns in pandas dataframe?

Question

I have a dataframe df that looks like this:

        a      b
  0     Jon     Jon
  1     Jon     John
  2     Jon     Johnny

And I'd like to compare these two strings to and make a new column like such:

  df['compare'] = df2['a'] = df2['b']


        a      b          compare
  0     Jon     Jon         True
  1     Jon     John        False
  2     Jon     Johnny      False

I'd also like to be able to pass columns a and b through this levenshtein function:

def levenshtein_distance(a, b):
    """Return the Levenshtein edit distance between two strings *a* and *b*."""
    if a == b:
        return 0
    if len(a) < len(b):
        a, b = b, a
    if not a:
        return len(b)
    previous_row = range(len(b) + 1)
    for i, column1 in enumerate(a):
        current_row = [i + 1]
        for j, column2 in enumerate(b):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (column1 != column2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    return previous_row[-1]

and add a column like such:

  df['compare'] = levenshtein_distance(df2['a'], df2['b'])      

        a      b          compare
   0    Jon     Jon         100
   1    Jon     John        .95
   2    Jon     Johnny      .87

However I am getting this error when I try:

  ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

How can I format my data/dataframe to allow it to compare the two columns and add taht comparison as a third column?

Have you looked at apply? That's how i normally apply functions like these — oppressionslayer
– oppressionslayer, Commented Nov 27, 2019 at 17:52
I'm a newb so still learning a lot. I'm just not sure why I can pass two strings by themselves to levenshtein, but not loop through the rows and do it? — user3486773
– user3486773, Commented Nov 27, 2019 at 17:57
@user3486773 when you call levenshtein_distance(df2['a'], df2['b']), the arguments you pass are now Series, not strings. Thus a comparison like a==b compares the entire Series. In pandas this performs an element-wise comparison, but you use that in an if statement. So what single truth value characterizes [True, False, True, True, False, …]? Your function is implemented to work on two individual strings, which is appropriate for levenshtein_distance. However you need to call it with the proper inputs, two strings, not Series, since that's what your function expects. — ALollz
– ALollz, Commented Nov 27, 2019 at 18:02
@user3486773 i modified my answer on this one to calculate the percentage, can i use it to answer your other question here stackoverflow.com/questions/59076746/… or do you want me to post it here — oppressionslayer
– oppressionslayer, Commented Nov 27, 2019 at 18:57

Dani Mesejo · Accepted Answer · 2019-11-27 17:51:20Z

7

Just do:

df['compare'] = [levenshtein_distance(a, b) for a, b in zip(df2['a'], df2['b'])]

Or, if you want equality comparison:

df['compare'] = (df['a'] == df['b'])

edited Nov 27, 2019 at 17:51

answered Nov 27, 2019 at 17:50

Dani Mesejo

62.2k6 gold badges56 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Dani Mesejo Over a year ago

Arent the elements of a an b iterables (strings)?

mcsoini Over a year ago

yes, but if it's meant to be applied to pd.Series, you can't just throw strings at it

Dani Mesejo Over a year ago

@mcsoini I dont understand what you mean, the function Levenshtein is made for strings, what do you mean by pd.Series?

mcsoini Over a year ago

OP df['compare'] = levenshtein_distance(df2['a'], df2['b']) means a and b are pandas series

Erfan Over a year ago

This is kinda reinventing the wheel, are you aware that there quite mature modules for this already: difflib, python-Levenshtein, fuzzywuzzy

|

oppressionslayer · Accepted Answer · 2019-11-27 18:50:53Z

I think you compares are wrong, change:

change:

if a == b

and not a

to

if a[0] == b[0]

and 

not a[0]

and you'll see that your function works, it just needs to iterate through the df's you pass. And your equal will return if you return a list

Here's a working version:

def levenshtein_distance(a, b):
  """Return the Levenshtein edit distance between two strings *a* and *b*."""
  y = len(a)
  thelist = []
  for x in range(0, y):
    c = a[x]
    d = b[x] 
    if c == d:
        thelist.append(0)
        continue
    if len(c) < len(d):
        c, d = d, c
    if not c:
        thelist.append(len(d))
        continue
    previous_row = range(len(d) + 1)
    for i, column1 in enumerate(c):
        current_row = [i + 1]
        for j, column2 in enumerate(d):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (column1 != column2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    thelist.append(previous_row[-1])
  return thelist

df['compare'] =  levenshtein_distance(df.a, df.b)                                                                                                                

df                                                                                                                                                               

#     a       b  compare
#0  Jon     Jon        0
#1  Jon    John        1
#2  Jon  Johnny        3

It just doesn't calculate the percentages, it just uses your code, which according to Levenshtein Calc is the right answers

Collectives™ on Stack Overflow

how to string compare two columns in pandas dataframe?

2 Answers 2

10 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related