1

I have a column in a dataframe with column 'url_product' that contains a list of dictionaries as below (showing first 4 rows as an example). Each dictionary contains url and product associated with that url.

df.url_product[0] 
[{'url': 'https://www.abcstore.com/product/11-abc-gift-card/',
  'product': 'giftcard, abcstore'},
 {'url': 'https://www.abcstore.com/product/10-skin-lotion/',
  'product': 'lotion'},
 {'url': 'https://www.abcstore.com/product/10414-moisturising-cream',
  'product': 'cream'},
 {'url': 'https://www.abcstore.com/blog/best-skincare-lotions/',
  'product': 'lotion'},
 {'url': 'https://www.abcstore.com/article/140-best-anti-aging-serum',
  'product': 'serum'}]

df.url_product[1] 
[{'url': 'https://www.abcstore.com/product/7-night-cream',
  'product': 'nightcream'},
 {'url': 'http://www.abcstore.com/product/149-smoothing-serum/',
  'product': 'serum'},
 {'url': 'https://www.abcstore.com/blog/rapid-reveal-face-peel',
  'product': 'facepeel'}]

df.url_product[2] 
[{'url': 'https://www.abcstore.com/product/25-night-infusion-cream',
  'product': 'infusioncream'},
 {'url': 'https://www.abcstore.com/product/144-bio-cellulose-mask',
  'product': 'cellulosemask, mask'},
 {'url': 'https://www.abcstore.com/',
  'product': 'bestseller, homepage'},
 {'url': 'https://www.abcstore.com/blog/essential-skincare-products/',
  'product': 'essential, blog'}]

df.url_product[3] 
[{'url': 'https://www.abcstore.com/blog/top-skincare-products-2020',
  'product': 'skincare, 2020'},
 {'url': 'http://www.abcstore.com/article/smoothing-serum/',
  'product': 'serum'}]

For each of these rows, I am looking to do the following

  1. Filter for only the dictionaries where URL contains '/product/' and parse out the number following the 'product/' (will call this product_id as easy reference). expected product_id of dictionary below = 11
    {'url': 'https://www.abcstore.com/product/11-abc-gift-card/', 'product': 'giftcard, abcstore'}
  2. For each of the dictionaries where URL contains '/product/' also count the number of 'products'. For the example below, that count would be 2 (giftcard, abcstore)
    {'url': 'https://www.abcstore.com/product/11-abc-gift-card/', 'product': 'giftcard, abcstore'}
  3. For each row return the product_id that has the highest count and create a new column('top_product_id') in the dataframe to show this. If no single product_id has the highest count, leave as blank

Expected outcome for the first 3 three rows after the steps above

df.top_product_id

[0] '11'
[1] (blank)
[2] '114'
[3] (blank)

a few points to explain the expected outcome

Row[0]- expect 11 as product_id 11 has a count of 2 (giftcard, abcstore) while product_id 10 and 10414 only has 1 each. both the blog and article urls will be skipped as they do not contain '/product/' in the url
Row[1]- expect the outcome to be blank as the two product URLs are attached to 1 product each and since there is no single url with the highest count, row would be blank
Row[2]- expect 114 as product_id 114 has highest count of 2 (cellulosemask, mask)
Row[3]- expect the outcome to be blank as there are no product URLs

How would I create the new column ('top_product_id') in the dataframe with the expected outcome?

1
  • how did the data get into a dataframe? I believe it would be easier to work with the data outside a dataframe Commented Jun 19, 2020 at 5:24

1 Answer 1

1

Here is one possible approach:

def findID(data):
    df1 = pd.DataFrame(data)
    df1 = df1.assign(
        count=df1['product'].str.split(', ').str.len(),
        product_id=df1['url'].str.extract(r'.*/product/(\d+)', expand=False)
    ).dropna().drop_duplicates(subset=['count'], keep=False)

    if df1.empty:
        return '(Blank)'

    return df1.loc[df1['count'].idxmax(), 'product_id']


df['top_product_id'] = df['url_product'].apply(findID)

# print(df)
                                         url_product top_product_id
0  [{'url': 'https://www.abcstore.com/product/11-...             11
1  [{'url': 'https://www.abcstore.com/product/7-n...        (Blank)
2  [{'url': 'https://www.abcstore.com/product/25-...            144
3  [{'url': 'https://www.abcstore.com/blog/top-sk...        (Blank)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.