Python how to add column to dataframe by using a partial substring match?

Question

My data is a list of products in a dataframe with other sales and ordering information.

product_cat_dict = {"T-Shirt": "T-Shirt",
                   "Top": "T-Shirt",
                   "Vest": "T-Shirt",
                   "Sweater": "Sweater"}

products = pd.DataFrame({"Product Name": ["T-Shirt White", "T-Shirt Black", "Top Orange", "Navy Vest", "Red Top", "Sweater Black"],
                        "Sales": [100, 200, 250, 50, 150, 300]})

I'm trying to add a new column onto the dataframe which contains just the product category from the product name column but I'd also like some of the products to be grouped together into the same category (as per the dictionary code).

My desired result is the following table:

I tried using a dictionary so it's easy to update in case any new products with undefined categories are added to the data. From reading other SO posts it looks like I need to use contains to do partial substring matching but I can't seem to return the actual matched value (rather than the original data). The best I could get was to return a list of Boolean responses with the below code.

products["Product Name"].str.contains("|".join(product_cat_dict.keys()))

Any help on how I can get to my desired result would be much appreciated.

norie · Accepted Answer · 2021-03-08 08:17:16Z

2

We could use list comprehension to find the key of the category dictionary in the product name.

import pandas as pd 

product_cat_dict = {"T-Shirt": "T-Shirt",
                   "Top": "T-Shirt",
                   "Vest": "T-Shirt",
                   "Sweater": "Sweater"}

products = pd.DataFrame({"Product Name": ["T-Shirt White", "T-Shirt Black", "Top Orange", "Navy Vest", "Red Top", "Sweater Black"],
                        "Sales": [100, 200, 250, 50, 150, 300]}) 

products['category'] = products['Product Name'].apply(lambda name: [v for k, v in product_cat_dict.items() if k in name][0])

print(products)

edited Mar 8, 2021 at 8:17

answered Mar 7, 2021 at 12:17

norie

9,9372 gold badges14 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

FluffySheep1990 Over a year ago

Spot on! Thanks a lot, that's exactly what I was looking for. Although I think the first k in the list comprehension should be a y because I wanted to return the value rather than the key from the dictionary

norie Over a year ago

I've changed the code to return the value rather than the key.

Ezer K · Accepted Answer · 2021-03-07 12:02:18Z

0

Assuming that the key is either the first word or the second one, you could split and try to get any of the keys from the dicitionary, this is not efficient but will probably work OK for a resonable size of data

products['pn1'] = [x.split()[0] for x in products['Product Name']]
products['pn2'] = [x.split()[1] for x in products['Product Name']]
products['Product Category'] = \
[product_cat_dict.get(x[0], product_cat_dict.get(x[1])) 
for x in zip(products['pn1'], products['pn2'])]

edited Mar 7, 2021 at 12:02

answered Mar 7, 2021 at 11:57

Ezer K

3,7615 gold badges25 silver badges50 bronze badges

1 Comment

FluffySheep1990 Over a year ago

Thanks for your response, the product name column in my data would sometimes have a description larger than 2 words and in different orders, I probs should have put that in the OP. Norie has provided a solution that was able to extract the info I needed

Collectives™ on Stack Overflow

Python how to add column to dataframe by using a partial substring match?

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related