1

I have a dataframe with None values in one column. I would like to replace this None values with the maximum value of the "category" for the same combination of other columns.

Example: pandas dataframe

import pandas as pd
d = {'company': ['Company1', 'Company1', 'Company1', 'Company1', 'Company2', 'Company2'], 'product': ['Product A', 'Product A', 'Product F', 'Product A', 'Product F', 'Product F'], 'category': ['1', None, '3', '2', None, '5']}

df = pd.DataFrame(d)

   company   product       category
0  Company1  Product A        1
1  Company1  Product A     None
2  Company1  Product F        3
3  Company1  Product A        2
4  Company2  Product F     None
5  Company2  Product F        5

I want to replace the None values in 3. column with the max() value of this column for the unique combination (group of 1. + 2. column). The expected result would look like this:

   company   product       category
0  Company1  Product A        1
1  Company1  Product A        **2**
2  Company1  Product F        3
3  Company1  Product A        2
4  Company2  Product F        **5**
5  Company2  Product F        5

What I've tried: I've grouped the 1. + 2. column ("company + product") + get the max() of the 3. column + build a dictionary 'category'. (based on the idea of GroupBy results to dictionary of lists)

df_dict = df[~df['category'].isna()].groupby(['company','product'])['category'].max().apply(list).to_dict()

and I get this dict df_dict (shows the maximum category value for each combination):

{('Company1', 'Product A'): ['2'], ('Company1', 'Product F'): ['1'], ('Company2', 'Product F'): ['5']}

Now I want to replace None with the max values for each combination from my dcitionary

df[df['category'].isna()]

   ompany    product       category
1  Company1  Product A     None
4  Company2  Product F     None

The question, how can I do it? I tried with set_index()

df[df['category'].isna()].set_index(['company', 'product']).index

results in

MultiIndex([('Company1', 'Product A'),
        ('Company2', 'Product F')],
       names=['company', 'product'])

which would fit the entry in the dictionary and use it together map()

    df['category'] = df[df['category'].isna()].set_index(['company', 'product']).in
dex.map(df_dict)

... but I get an ValueError

ValueError: Length of values (2) does not match length of index (6)
1
  • very good question btw with a great reproducible example. Commented Aug 6, 2020 at 9:47

1 Answer 1

2

Two steps:

First, let's turn those None values to NaNs so we can use numeric operations.

df['category'] = pd.to_numeric(df['category']) # add arg ,errors='coerce' if needed.

Second, let's use groupby transform and max to fill only the NaN values:

df["category"] = df["category"].fillna(
    df.groupby(["company", "product"])["category"].transform("max")
)
print(df)

    company    product  category
0  Company1  Product A       1.0
1  Company1  Product A       2.0
2  Company1  Product F       3.0
3  Company1  Product A       2.0
4  Company2  Product F       5.0
5  Company2  Product F       5.0
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your answer and your help. It solved the problem. +1
I have another table almost same problem. But there is an additional colum with strings. I will add this case to the question on the top.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.