1

I have a pandas dataframe of credit card expenses of various yet-to-be-defined categories (gas, groceries, fast food, etc.).

df1: 

Category   Date         Description                 Cost 
nan        7.1.20       Chipotle Downtown West      $8.23
nan        7.1.20       Break Time - Springfield    $23.57
nan        7.3.20       State Farm - Agent          $94.23
nan        7.3.20       T-Mobile                    $132.42
nan        7.4.20       Venmo -xj8382dzavvd         $8.00
nan        7.6.20       Broadway McDonald's         $11.73
nan        7.8.20       Break Time - Townsville     $44.23

I would like to maintain a second dataframe which searches for keywords in the description and populates the "Category" column. Something as follows:

df2:

item           category
mcdonald       fast food
state farm     insurance
break time     gas
chipotle       fast food
mobile         cell phone 

The idea here is that I would write lines of code to search for partial strings in df1['Description'] and populate df1['Category'] with the value in df2[category].

I'm sure there is a clean and pythonic way to handle this code, but below is the closest I can get. The erroneous result of the code below is that all rows of df1['Category'] containing a match are set to the last loop in df2 (e.g. in this case, all rows would be set to "cell phone").

    for x in df2['item']:
        for y in df2['category']:
            df1['Category'] = np.where(
                        df1['Description'].str.lower().str.contains(x),
                        y,
                        df1['Category'])

Thanks for your help!

1
  • If my solution worked for you, I'd appreciate if you'd mark it as the accepted answer. If it didn't work for you, lmk in the comments and I'll help you get to where you need to be. Commented Jul 12, 2020 at 0:38

1 Answer 1

1

You can do this with map, Python's builtin difflib get close matches function, and a lambda expression. The difflib call returns a list of string matches and you can adjust the cutoff param for more or less sensitivity as needed.

import difflib

# you'll need to change both cutoff values here for the lambda to work correctly

df1['Category'] = df1['Description'].map(lambda x: difflib.get_close_matches(x, df2['item'], cutoff=0.3)[0] if len(difflib.get_close_matches(x, df2['item'], cutoff=0.3)) > 1 else 'no match')

print(df1)


    Category    Date    Description                 Cost
0   chipotle    7.1.20  Chipotle Downtown West      $8.23
1   break time  7.1.20  Break Time - Springfield    $23.57
2   state farm  7.3.20  State Farm - Agent          $94.23
3   mobile      7.3.20  T-Mobile                    $132.42
4   no match    7.4.20  Venmo -xj8382dzavvd         $8.00
5   mcdonald    7.6.20  Broadway McDonald's         $11.73
6   break time  7.8.20  Break Time - Townsville     $44.23
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks Matthew, I appreciate the help--I've been stuck on this all day. I can't import new libraries, as my work computer is strict with downloads. So I'm trying to work along the lines of a for loop. And I'm so confused why my loop doesn't work. I've even failed when I tried to remove the nested for loop (with the y), and simply create a mask column (i.e. for x in...: df1['mask'] = np.where(...contains(x), x, np.nan). But again, it seems that np.where flushes all other iterations and acts as if the only item in the loop was the last instance. Any help is appreciated!
Hey Dylan, difflib is built into the python standard library so you don't need to install any additional libs/software. Your for loop is not working primarily because it's operating on each column in its entirety due to pandas not having a way to keep track of the rows/indicies using your method which is why map/lambda is a better approach here. FWIW, you generally want to avoid for loops with pandas, but if you have to use one, you want to use iterrows, or iterrupltes. (i.e. for idx, row in df.iterrows():) which allow you to keep track of the index.
I tried your solution and got the following error: IndexError: list index out of range
Ah, so that's happening because you're getting rows with no matches, which means difflib returns an empty list. I added a conditional to the lambda so you will get "no match" in the category column for that case.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.