I have a dataframe df that looks like this:
a b
0 Jon Jon
1 Jon John
2 Jon Johnny
And I'd like to compare these two strings to and make a new column like such:
df['compare'] = df2['a'] = df2['b']
a b compare
0 Jon Jon True
1 Jon John False
2 Jon Johnny False
I'd also like to be able to pass columns a and b through this levenshtein function:
def levenshtein_distance(a, b):
"""Return the Levenshtein edit distance between two strings *a* and *b*."""
if a == b:
return 0
if len(a) < len(b):
a, b = b, a
if not a:
return len(b)
previous_row = range(len(b) + 1)
for i, column1 in enumerate(a):
current_row = [i + 1]
for j, column2 in enumerate(b):
insertions = previous_row[j + 1] + 1
deletions = current_row[j] + 1
substitutions = previous_row[j] + (column1 != column2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
and add a column like such:
df['compare'] = levenshtein_distance(df2['a'], df2['b'])
a b compare
0 Jon Jon 100
1 Jon John .95
2 Jon Johnny .87
However I am getting this error when I try:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
How can I format my data/dataframe to allow it to compare the two columns and add taht comparison as a third column?
levenshtein_distance(df2['a'], df2['b']), the arguments you pass are now Series, not strings. Thus a comparison likea==bcompares the entire Series. Inpandasthis performs an element-wise comparison, but you use that in anifstatement. So what single truth value characterizes [True, False, True, True, False, …]? Your function is implemented to work on two individual strings, which is appropriate for levenshtein_distance. However you need to call it with the proper inputs, two strings, not Series, since that's what your function expects.