Add column to dataframe using regex and dictionary

Question

I have data as such:

foo = pd.DataFrame({'id': ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10'], 
                    'amount': [10, 30, 40, 15, 20, 12, 55, 45, 60, 75], 
                    'description': [u'LYFT SAN FRANCISCO CA', u'XYZ STARBUCKS MINNEAPOLIS MN', u'HOLIDAY BEMIDJI MN', 
                                    u'MCDONALDS MADISON WI', u'ABC SUPERAMERICA MI', u'SUBWAY ROCHESTER MN', 
                                    u'NNT BURGER KING WI', u'UBER TRIP CA', u'superamerica CA', u'AMAZON NY']})

foo:

    id       amount description
    A1        10    LYFT SAN FRANCISCO CA
    A2        30    XYZ STARBUCKS MINNEAPOLIS MN
    A3        40    HOLIDAY BEMIDJI MN
    A4        15    MCDONALDS MADISON WI
    A5        20    ABC SUPERAMERICA MI
    A6        12    SUBWAY ROCHESTER MN
    A7        55    NNT BURGER KING WI
    A8        45    UBER TRIP CA
    A9        60    superamerica CA
    A10       75    AMAZON NY

I want to create a new column which categorizes each record based on a keyword match from the description column.

I have used help from this answer to do it in the following way:

import re    
dict1 = {
    "LYFT" : "cab_ride",
    "UBER" : "cab_ride",
    "STARBUCKS" : "Food",
    "MCDONALDS" : "Food",
    "SUBWAY" : "Food",
    "BURGER KING" : "Food",
    "HOLIDAY" : "Gas",
    "SUPERAMERICA": "Gas"
        }

def get_category_from_desc(x):
    try:
        return next(dict1[k] for k in dict1 if re.search(k, x, re.IGNORECASE))
    except:
        return "Other"

foo['category'] = foo.description.map(get_category_from_desc)

This works but I want to ask if this is the best way out for this problem. Since I have a much larger set of keywords that can indicate a category, I have to create a huge dictionary:

dict1 = {
        "STARBUCKS" : "Food",
        "MCDONALDS" : "Food",
        "SUBWAY" : "Food",
        "BURGER KING" : "Food",
             .
             .
             .
        # ~50 more keys for "Food"

        "HOLIDAY" : "Gas",
        "SUPERAMERICA": "Gas",
             .
             .
             .
        # ~20 more keys for "Gas"

        "WALMART" : "grocery",
        "COSTCO": "grocery",
             .
             .
        # ..... ~30 more keys for "grocery"
             .
             .
        # ~ Many more categories with a large number of keys for each
}

Edit: I also want to know if there's a way out that does not require me to create a huge dictionary like the one shown above. Can I achieve this with a smaller data structure, something like:

dict2 = {
    "cab_ride" : ["LYFT", "UBER"], #....
    "food" : ["STARBUCKS", "MCDONALDS", "SUBWAY", "BURGER KING"], #....
    "gas" : ["HOLIDAY", "SUPERAMERICA"] #....
        }

As for your edit: probably not using a dict that looks like that... — cs95
– cs95, Commented Apr 20, 2019 at 18:31
It needs to be flat, unfortunately... dict3 = {v: k for k, V in dict2.items() for v in V} — cs95
– cs95, Commented Apr 20, 2019 at 20:13

cs95 · Accepted Answer · 2019-04-18 19:49:04Z

3

I think this can be achieved pretty easily using df.replace with regex-based replacement. You can then use df.where to handle "Other" cases.

dict2 = {rf'.*{k}.*': v for k, v in dict1.items()}

cats = foo['description'].replace(dict2, regex=True)
cats.where(cats != foo['description'], 'Other')

0    cab_ride
1        Food
2         Gas
3        Food
4         Gas
5        Food
6        Food
7    cab_ride
8       Other
9       Other
Name: description, dtype: object

Another option is using str.extract with map:

from collections import defaultdict

dict2 = defaultdict(lambda: 'Other')
dict2.update(dict1)

foo['description'].str.extract(rf"({'|'.join(dict1)})", expand=False).map(dict2)

0    cab_ride
1        Food
2         Gas
3        Food
4         Gas
5        Food
6        Food
7    cab_ride
8       Other
9       Other
Name: description, dtype: object

answered Apr 18, 2019 at 19:49

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Scott Boston Over a year ago

rf... nice regular expression with f-string formatting. Thanks for teaching me something new! +1

cs95 Over a year ago

@ScottBoston I only hope OP has python3.6... taking a gamble here :P

Scott Boston · Accepted Answer · 2019-04-18 19:58:12Z

You can use .str accessor with extract and a compiled regular expression using join on dictionary keys.

foo = pd.DataFrame({'id': ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10'], 
                    'amount': [10, 30, 40, 15, 20, 12, 55, 45, 60, 75], 
                    'description': [u'LYFT SAN FRANCISCO CA', u'XYZ STARBUCKS MINNEAPOLIS MN', u'HOLIDAY BEMIDJI MN', 
                                    u'MCDONALDS MADISON WI', u'ABC SUPERAMERICA MI', u'SUBWAY ROCHESTER MN', 
                                    u'NNT BURGER KING WI', u'UBER TRIP CA', u'superamerica CA', u'AMAZON NY']})


dict1 = {
    "LYFT" : "cab_ride",
    "UBER" : "cab_ride",
    "STARBUCKS" : "Food",
    "MCDONALDS" : "Food",
    "SUBWAY" : "Food",
    "BURGER KING" : "Food",
    "HOLIDAY" : "Gas",
    "SUPERAMERICA": "Gas"
        }

regstr = '(' + '|'.join(dict1.keys()) + ')'
foo['category'] = foo['description'].str.extract(regstr).squeeze().map(dict1).fillna('Other')
print(foo)

Output:

    id  amount                   description  category
0   A1      10         LYFT SAN FRANCISCO CA  cab_ride
1   A2      30  XYZ STARBUCKS MINNEAPOLIS MN      Food
2   A3      40            HOLIDAY BEMIDJI MN       Gas
3   A4      15          MCDONALDS MADISON WI      Food
4   A5      20           ABC SUPERAMERICA MI       Gas
5   A6      12           SUBWAY ROCHESTER MN      Food
6   A7      55            NNT BURGER KING WI      Food
7   A8      45                  UBER TRIP CA  cab_ride
8   A9      60               superamerica CA     Other
9  A10      75                     AMAZON NY     Other

Collectives™ on Stack Overflow

Add column to dataframe using regex and dictionary

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related