2

Background

I have a dataset where I have the following:

product_title   price
Women's Pant    20.00
Men's Shirt     30.00
Women's Dress   40.00
Blue 4" Shorts  30.00
Blue Shorts     35.00
Green 2" Shorts 30.00

I created a new column called gender which contains the values Women, Men, or Unisex based on the specified string in product_title.

The output looks like this:

product_title   price   gender
Women's Pant    20.00   women
Men's Shirt     30.00   men
Women's Dress   40.00   women
Blue 4" Shorts  30.00   women
Blue Shorts     35.00   unisex
Green 2" Shorts 30.00   women

Approach

I approached creating a new column by using if/else statements:

df['gender'] = ['women' if 'women' in word or 'Blue 4"' in word or 'Green 2"' in word
                else "men" if "men" in word
                else "unisex" 
                for word in df.product_title.str.lower()]

Although this approach works, it becomes very long when I have a lot of conditions for labeling women vs men vs unisex. Is there cleaner way to do this? Is there a way I can pass a list of strings instead of having a long chain of or conditions?

I would really appreciate help as I am new to python and pandas library.

3 Answers 3

3

IIUC,

import numpy as np
s = df['product title'].str.lower()
df['gender'] = np.select([s.str.contains('men'), 
                          s.str.contains('women|blue 4 shorts|green 2 shorts')], 
                         ['men', 'women'],
                         default='unisex')
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your help! In my case though I have certain labels like "Blue 4" Shorts" or "Green 2" Shorts" which should be labeled women even if they do not have the string 'women' in the product__title.
you can include it, please check now, also you could only include shorts
1

Here is another idea with str.extract and series.map

d = {'women':['women','blue 4"','green 2"'],'men':['men']}
d1 = {val:k for k,v in d.items() for val in v}
pat = '|'.join(d1.keys())
import re
df['gender'] = (df['product_title'].str.extract('('+pat+')',flags=re.I,expand=False)
                .str.lower().map(d1).fillna('unisex'))

print(df)
           product_title  price  gender
0           Women's Pant   20.0   women
1            Men's Shirt   30.0     men
2          Women's Dress   40.0   women
3         Blue 4" Shorts   30.0   women
4            Blue Shorts   35.0  unisex
5  Green 2" Shorts 30.00    NaN   women

Comments

1

You can try to define your own function and run it with a apply+lambda espression:

Create the function which you can change as you need:

def sex(str):
    '''
    look for specific values and retun value
    '''
    for words in ['women','Blue 4"','Green 2"']:
      if words in str.lower():
          return 'women'
      elif 'men' in str.lower():
          return 'men'
      else:
          return 'unisex'

and after apply to the colum you need to check for values:

df['gender']=df['product_title'].apply(lambda str: sex(str))

Cheers!

EDIT 3: After looking around and checking about the numpy approac from @ansev following @anky comment I was able to find out this may be faster up to a certain point, tested with 5000 rows and still faster, but the numpy approach started to catch up. So it really depends on how big your dataset are. Will remove any comment on speed considered I was testing only on this small frame initially, still a learning process as you can see from my level.

2 Comments

How big needs to be before showing off the under the hood for loop? Really curios to push it up until finding the switch in running times.
Thanks, indeed figure out just by adding up to 5k rows that things started to get a different fold, numpy approach is increasing really slow, while apply doubles up in speed. Edited the answer to match the better understanding of it now. Thanks for pointing out how superficial I was, learning things everyday in this field.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.