3

Say I have a list BRANDS that contains brand names:

BRANDS = ['Samsung', 'Apple', 'Nike', .....]

Dataframe A has following structure

row     item_title      brand_name

1    |  Apple 6S      |  Apple
2    |  Nike BB Shoes |  na  <-- need to fill with Nike
3    |  Samsung TV    |  na  <--need fill with Samsung
4    | Used bike      |  na  <--No need to do anything because there is no brand_name in the title 
    ....

I want to fill the column brand_name of Row 2 with Nike, Row 3 with Samsung, because they null and the item_title contains keywords that can be found in list BRANDS. How can I do it?

3 Answers 3

3

Vectorized solution:

In [168]: x = df.item_title.str.split(expand=True)

In [169]: df['brand_name'] = \
              df['brand_name'].fillna(x[x.isin(BRANDS)]
                                         .ffill(axis=1)
                                         .bfill(axis=1)
                                         .iloc[:, 0])

In [170]: df
Out[170]:
   row     item_title brand_name
0    1       Apple 6S      Apple
1    2  Nike BB Shoes       Nike
2    3     Samsung TV    Samsung
3    4      Used bike        NaN
Sign up to request clarification or add additional context in comments.

Comments

1

One approach is to use apply():

import pandas as pd
BRANDS = ['Samsung', 'Apple', 'Nike']

def get_brand_name(row):
    if ~pd.isnull(row['brand_name']):
        # don't do anything if brand_name is not null
        return row['brand_name']

    item_title = row['item_title']
    title_words = map(str.title, item_title.split())
    for tw in title_words:
        if tw in BRANDS:
            # return first 'match'
            return tw
    # default return None
    return None

df['brand_name'] = df.apply(lambda x: get_brand_name(x), axis=1)
print(df)
#   row     item_title brand_name
#0    1       Apple 6S      Apple
#1    2  Nike BB Shoes       Nike
#2    3     Samsung TV    Samsung
#3    4      Used bike       None

Notes

  • I converted the tokenized title to title-case using str.title() because that's how you defined BRANDS.
  • If you have a lot of brands, it's recommended to use a set instead of a list because lookups will be faster. However, this won't work if you care about order.

Comments

0

You can achieve the result you are after by writing a simple function. You can then use .apply() with a lambda function to generate your desired column.

def contains_any(s, arr):
    for item in arr:
        if item in s: return item
    return np.nan
df['brand_name'] = df['product'].apply(lambda x: match_substring(x, product_map))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.