Adding values to new Pandas dataframe column based on partial string contents of existing column

Question

I have data stored as a dataframe using Python Pandas. Among the columns, I have a "Product" column which contains the brand name and model (e.g. Nike Air Jordan, Adidas Gazelle). I want to create a new column that just contains the brand (e.g. Nike, Adidas), which I will later use in groupby to summarize the data. From my research, I believe contains and regex can be used to do this. However, the implementation has not worked. I've also seen different approaches, some using "for i in range" while others do it as a replace in a single line of code.

import pandas as pd
import numpy as np

shoes_df = pd.DataFrame({'Product':['Nike vaporfly', 'Nike Jordans', 'Adidas supernova', 'Asics Kayano', 'Asics GT2010', 'Adidas gazelle', 'Nike air max',
                                  'Nike Lebron'], 'Unit sales':[1500, 1600,
2341, 1345, 4523, 2345, 1634, 3129]})

shoes_df['Brand'] = np.where(shoes_df['Product'].str.contains('Nike.*|Adidas.*').any(), 'Nike|Adidas', np.nan)

print(shoes_df)

Here was my attempt at doing the "for i in range" approach, which did not work either. Here, I got the error "TypeError: 'Series' objects are mutable, thus they cannot be hashed"

shoes_df = pd.DataFrame({'Product':['Nike vaporfly', 'Nike Jordans', 'Adidas supernova', 'Asics Kayano', 'Asics GT2010', 'Adidas gazelle', 'Nike air max',
                                  'Nike Lebron'], 'Unit sales':[1500, 1600, 2341, 1345, 4523,
                                   2345, 1634, 3129]})

for i in shoes_df.iterrows():
    if shoes_df['Product'].str.contains('Nike').any():
        shoes_df.set_value(i, 'Brand', 'Nike')
    elif shoes_df['Product'].str.contains('Adidas').any():
        shoes_df.set_value(i, 'Brand', 'Adidas')
    elif shoes_df['Product'].str.contains('Asics').any():
        shoes_df.set_value(i, 'Brand', 'Asics')
    else:
        shoes_df.set_value(i, 'Brand', np.nan)

Scott Boston · Accepted Answer · 2017-11-14 17:12:42Z

4

IIUC:

shoes_df['brand'] = shoes_df.Product.str.extract(pat='(Nike|Adidas|Asics)',expand=False)

Output:

            Product  Unit sales   brand
0     Nike vaporfly        1500    Nike
1      Nike Jordans        1600    Nike
2  Adidas supernova        2341  Adidas
3      Asics Kayano        1345   Asics
4      Asics GT2010        4523   Asics
5    Adidas gazelle        2345  Adidas
6      Nike air max        1634    Nike
7       Nike Lebron        3129    Nike

answered Nov 14, 2017 at 17:12

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

skibbereen Over a year ago

This did the trick and easy to follow. I had missed using extract. Thank you. Also tested when brand wasn't the first word and it still worked.

cs95 · Accepted Answer · 2017-11-14 17:11:32Z

4

Option 1 (the hard way)
str.extract

brands = ['Nike', 'Adidas', 'Asics']
df['Brand'] = df.Product.str.extract('({})'.format('|'.join(brands)), expand=True)

df

            Product  Unit sales   Brand
0     Nike vaporfly        1500    Nike
1      Nike Jordans        1600    Nike
2  Adidas supernova        2341  Adidas
3      Asics Kayano        1345   Asics
4      Asics GT2010        4523   Asics
5    Adidas gazelle        2345  Adidas
6      Nike air max        1634    Nike
7       Nike Lebron        3129    Nike

Option 2 (somewhat simpler)
str.split

df['Brand'] = df.Product.str.split().str[0]
df

            Product  Unit sales   Brand
0     Nike vaporfly        1500    Nike
1      Nike Jordans        1600    Nike
2  Adidas supernova        2341  Adidas
3      Asics Kayano        1345   Asics
4      Asics GT2010        4523   Asics
5    Adidas gazelle        2345  Adidas
6      Nike air max        1634    Nike
7       Nike Lebron        3129    Nike

You can extend this a bit to replace anything that isn't in brands with NaN:

df['Brand'] = np.where(df.Brand.isin(brands), df.Brand, np.nan)

answered Nov 14, 2017 at 17:11

cs95

406k106 gold badges744 silver badges797 bronze badges

3 Comments

skibbereen Over a year ago

Thanks. First one worked across different iterations. Option 2 worked when the brand was the first word, but if the brand came later in the string, it returned another word. Option 1 has worked regardless of where the brand was.

cs95 Over a year ago

@skibbereen Which is why I provided option 1 before option 2 ;/

jezrael Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ - bad dupe, stackoverflow.com/q/47292599/2901002, please find matched or open question.

ags29 · Accepted Answer · 2017-11-14 18:01:09Z

0

If you can assume that the brand is always the first word, then the solution gives you flexibility to capture brands beyond a known list, so just adding it for interest:

shoes_df['Product'].str.extract('^([^\s]+)\s')

answered Nov 14, 2017 at 18:01

ags29

2,7061 gold badge11 silver badges15 bronze badges

Collectives™ on Stack Overflow

Adding values to new Pandas dataframe column based on partial string contents of existing column

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related