How to add a pandas column based on partial string match?

Question

I have a pandas dataframe of credit card expenses of various yet-to-be-defined categories (gas, groceries, fast food, etc.).

df1: 

Category   Date         Description                 Cost 
nan        7.1.20       Chipotle Downtown West      $8.23
nan        7.1.20       Break Time - Springfield    $23.57
nan        7.3.20       State Farm - Agent          $94.23
nan        7.3.20       T-Mobile                    $132.42
nan        7.4.20       Venmo -xj8382dzavvd         $8.00
nan        7.6.20       Broadway McDonald's         $11.73
nan        7.8.20       Break Time - Townsville     $44.23

I would like to maintain a second dataframe which searches for keywords in the description and populates the "Category" column. Something as follows:

df2:

item           category
mcdonald       fast food
state farm     insurance
break time     gas
chipotle       fast food
mobile         cell phone

The idea here is that I would write lines of code to search for partial strings in df1['Description'] and populate df1['Category'] with the value in df2[category].

I'm sure there is a clean and pythonic way to handle this code, but below is the closest I can get. The erroneous result of the code below is that all rows of df1['Category'] containing a match are set to the last loop in df2 (e.g. in this case, all rows would be set to "cell phone").

    for x in df2['item']:
        for y in df2['category']:
            df1['Category'] = np.where(
                        df1['Description'].str.lower().str.contains(x),
                        y,
                        df1['Category'])

Thanks for your help!

If my solution worked for you, I'd appreciate if you'd mark it as the accepted answer. If it didn't work for you, lmk in the comments and I'll help you get to where you need to be. — Matthew Borish
– Matthew Borish, Commented Jul 12, 2020 at 0:38

Matthew Borish · Accepted Answer · 2020-07-12 01:43:02Z

1

You can do this with map, Python's builtin difflib get close matches function, and a lambda expression. The difflib call returns a list of string matches and you can adjust the cutoff param for more or less sensitivity as needed.

import difflib

# you'll need to change both cutoff values here for the lambda to work correctly

df1['Category'] = df1['Description'].map(lambda x: difflib.get_close_matches(x, df2['item'], cutoff=0.3)[0] if len(difflib.get_close_matches(x, df2['item'], cutoff=0.3)) > 1 else 'no match')

print(df1)


    Category    Date    Description                 Cost
0   chipotle    7.1.20  Chipotle Downtown West      $8.23
1   break time  7.1.20  Break Time - Springfield    $23.57
2   state farm  7.3.20  State Farm - Agent          $94.23
3   mobile      7.3.20  T-Mobile                    $132.42
4   no match    7.4.20  Venmo -xj8382dzavvd         $8.00
5   mcdonald    7.6.20  Broadway McDonald's         $11.73
6   break time  7.8.20  Break Time - Townsville     $44.23

edited Jul 12, 2020 at 1:43

answered Jul 11, 2020 at 21:56

Matthew Borish

3,1162 gold badges18 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Dylan Moore Over a year ago

Thanks Matthew, I appreciate the help--I've been stuck on this all day. I can't import new libraries, as my work computer is strict with downloads. So I'm trying to work along the lines of a for loop. And I'm so confused why my loop doesn't work. I've even failed when I tried to remove the nested for loop (with the y), and simply create a mask column (i.e. for x in...: df1['mask'] = np.where(...contains(x), x, np.nan). But again, it seems that np.where flushes all other iterations and acts as if the only item in the loop was the last instance. Any help is appreciated!

Matthew Borish Over a year ago

Hey Dylan, difflib is built into the python standard library so you don't need to install any additional libs/software. Your for loop is not working primarily because it's operating on each column in its entirety due to pandas not having a way to keep track of the rows/indicies using your method which is why map/lambda is a better approach here. FWIW, you generally want to avoid for loops with pandas, but if you have to use one, you want to use iterrows, or iterrupltes. (i.e. for idx, row in df.iterrows():) which allow you to keep track of the index.

Dylan Moore Over a year ago

I tried your solution and got the following error: IndexError: list index out of range

Matthew Borish Over a year ago

Ah, so that's happening because you're getting rows with no matches, which means difflib returns an empty list. I added a conditional to the lambda so you will get "no match" in the category column for that case.

Collectives™ on Stack Overflow

How to add a pandas column based on partial string match?

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related