3

I have a dataframe df that looks like this:

        a      b
  0     Jon     Jon
  1     Jon     John
  2     Jon     Johnny

And I'd like to compare these two strings to and make a new column like such:

  df['compare'] = df2['a'] = df2['b']


        a      b          compare
  0     Jon     Jon         True
  1     Jon     John        False
  2     Jon     Johnny      False

I'd also like to be able to pass columns a and b through this levenshtein function:

def levenshtein_distance(a, b):
    """Return the Levenshtein edit distance between two strings *a* and *b*."""
    if a == b:
        return 0
    if len(a) < len(b):
        a, b = b, a
    if not a:
        return len(b)
    previous_row = range(len(b) + 1)
    for i, column1 in enumerate(a):
        current_row = [i + 1]
        for j, column2 in enumerate(b):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (column1 != column2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    return previous_row[-1] 

and add a column like such:

  df['compare'] = levenshtein_distance(df2['a'], df2['b'])      

        a      b          compare
   0    Jon     Jon         100
   1    Jon     John        .95
   2    Jon     Johnny      .87

However I am getting this error when I try:

  ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

How can I format my data/dataframe to allow it to compare the two columns and add taht comparison as a third column?

4
  • Have you looked at apply? That's how i normally apply functions like these Commented Nov 27, 2019 at 17:52
  • I'm a newb so still learning a lot. I'm just not sure why I can pass two strings by themselves to levenshtein, but not loop through the rows and do it? Commented Nov 27, 2019 at 17:57
  • @user3486773 when you call levenshtein_distance(df2['a'], df2['b']), the arguments you pass are now Series, not strings. Thus a comparison like a==b compares the entire Series. In pandas this performs an element-wise comparison, but you use that in an if statement. So what single truth value characterizes [True, False, True, True, False, …]? Your function is implemented to work on two individual strings, which is appropriate for levenshtein_distance. However you need to call it with the proper inputs, two strings, not Series, since that's what your function expects. Commented Nov 27, 2019 at 18:02
  • @user3486773 i modified my answer on this one to calculate the percentage, can i use it to answer your other question here stackoverflow.com/questions/59076746/… or do you want me to post it here Commented Nov 27, 2019 at 18:57

2 Answers 2

7

Just do:

df['compare'] = [levenshtein_distance(a, b) for a, b in zip(df2['a'], df2['b'])]

Or, if you want equality comparison:

df['compare'] = (df['a'] == df['b'])
Sign up to request clarification or add additional context in comments.

10 Comments

Arent the elements of a an b iterables (strings)?
yes, but if it's meant to be applied to pd.Series, you can't just throw strings at it
@mcsoini I dont understand what you mean, the function Levenshtein is made for strings, what do you mean by pd.Series?
OP df['compare'] = levenshtein_distance(df2['a'], df2['b']) means a and b are pandas series
This is kinda reinventing the wheel, are you aware that there quite mature modules for this already: difflib, python-Levenshtein, fuzzywuzzy
|
1

I think you compares are wrong, change:

change:

if a == b

and not a

to

if a[0] == b[0]

and 

not a[0]

and you'll see that your function works, it just needs to iterate through the df's you pass. And your equal will return if you return a list

Here's a working version:

def levenshtein_distance(a, b):
  """Return the Levenshtein edit distance between two strings *a* and *b*."""
  y = len(a)
  thelist = []
  for x in range(0, y):
    c = a[x]
    d = b[x] 
    if c == d:
        thelist.append(0)
        continue
    if len(c) < len(d):
        c, d = d, c
    if not c:
        thelist.append(len(d))
        continue
    previous_row = range(len(d) + 1)
    for i, column1 in enumerate(c):
        current_row = [i + 1]
        for j, column2 in enumerate(d):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (column1 != column2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    thelist.append(previous_row[-1])
  return thelist
df['compare'] =  levenshtein_distance(df.a, df.b)                                                                                                                

df                                                                                                                                                               

#     a       b  compare
#0  Jon     Jon        0
#1  Jon    John        1
#2  Jon  Johnny        3

It just doesn't calculate the percentages, it just uses your code, which according to Levenshtein Calc is the right answers

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.