I have data as such:
foo = pd.DataFrame({'id': ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10'],
'amount': [10, 30, 40, 15, 20, 12, 55, 45, 60, 75],
'description': [u'LYFT SAN FRANCISCO CA', u'XYZ STARBUCKS MINNEAPOLIS MN', u'HOLIDAY BEMIDJI MN',
u'MCDONALDS MADISON WI', u'ABC SUPERAMERICA MI', u'SUBWAY ROCHESTER MN',
u'NNT BURGER KING WI', u'UBER TRIP CA', u'superamerica CA', u'AMAZON NY']})
foo:
id amount description
A1 10 LYFT SAN FRANCISCO CA
A2 30 XYZ STARBUCKS MINNEAPOLIS MN
A3 40 HOLIDAY BEMIDJI MN
A4 15 MCDONALDS MADISON WI
A5 20 ABC SUPERAMERICA MI
A6 12 SUBWAY ROCHESTER MN
A7 55 NNT BURGER KING WI
A8 45 UBER TRIP CA
A9 60 superamerica CA
A10 75 AMAZON NY
I want to create a new column which categorizes each record based on a keyword match from the description column.
I have used help from this answer to do it in the following way:
import re
dict1 = {
"LYFT" : "cab_ride",
"UBER" : "cab_ride",
"STARBUCKS" : "Food",
"MCDONALDS" : "Food",
"SUBWAY" : "Food",
"BURGER KING" : "Food",
"HOLIDAY" : "Gas",
"SUPERAMERICA": "Gas"
}
def get_category_from_desc(x):
try:
return next(dict1[k] for k in dict1 if re.search(k, x, re.IGNORECASE))
except:
return "Other"
foo['category'] = foo.description.map(get_category_from_desc)
This works but I want to ask if this is the best way out for this problem. Since I have a much larger set of keywords that can indicate a category, I have to create a huge dictionary:
dict1 = {
"STARBUCKS" : "Food",
"MCDONALDS" : "Food",
"SUBWAY" : "Food",
"BURGER KING" : "Food",
.
.
.
# ~50 more keys for "Food"
"HOLIDAY" : "Gas",
"SUPERAMERICA": "Gas",
.
.
.
# ~20 more keys for "Gas"
"WALMART" : "grocery",
"COSTCO": "grocery",
.
.
# ..... ~30 more keys for "grocery"
.
.
# ~ Many more categories with a large number of keys for each
}
Edit: I also want to know if there's a way out that does not require me to create a huge dictionary like the one shown above. Can I achieve this with a smaller data structure, something like:
dict2 = {
"cab_ride" : ["LYFT", "UBER"], #....
"food" : ["STARBUCKS", "MCDONALDS", "SUBWAY", "BURGER KING"], #....
"gas" : ["HOLIDAY", "SUPERAMERICA"] #....
}
dict3 = {v: k for k, V in dict2.items() for v in V}