best way to fill nans, getting the mean from multiple columns

Ask Question

Asked 1 year, 1 month ago

Modified 1 year, 1 month ago

Viewed 66 times

I have a dataframe with reviews of restaurants. a columns "description sentiment" has been assigned a score ranging from -1.00 - +1.00. Some of the rows are 0 (due to some other factors in the dataframe), and so I'd like to interpolate some of the rows based on other info, like the Award score.

interpolate_cols = ["description_sentiment", "Award_ordinal"]

As I understand it, best practice for interpolation is to set a mask and use for each possible value, such as in the last block below.

nan_mask = (michelin["description_sentiment"]==np.nan)

award1_mask = (michelin["Award_ordinal"]==1)
award2_mask = (michelin["Award_ordinal"]==2)
award3_mask = (michelin["Award_ordinal"]==3)
award4_mask = (michelin["Award_ordinal"]==4)
award5_mask = (michelin["Award_ordinal"]==5)

michelin.loc[nan_mask & award1_mask, "description_sentiment"] = michelin.loc[award1_mask, "description_sentiment"].mean()
...

and then list all possible values

My question is, what happens when the complexity of data increases, ie, more features. is the best way really to list all individually? or is there a more simple programmatic way, like:

interpolate() np.nan for col.unique(), col.unique(), col.unique()

many thanks.

Edit: MRE

the first three columns should be considered ordinal, the next three nominal, the the next few onehot encoded [boolean from categorical nominal], and the final continuous.

df = pd.DataFrame({
    
    "A" : np.random.choice([1, 12], 1000),
    "B" : np.random.choice([1, 20], 1000),
    "C" : np.random.choice([1, 25], 1000),

    "D" : np.random.choice(["Japan", "China", "Indonesia", "Thailand", "Laos", "Cambodia", "Philippines"], 1000),
    "E" : np.random.choice(["Japan", "China", "Indonesia", "Thailand", "Laos", "Cambodia", "Philippines"], 1000),
    "F" : np.random.choice(["Japan", "China", "Indonesia", "Thailand", "Laos", "Cambodia", "Philippines"], 1000),
        
    "G" : np.random.choice([0, 1], 1000),
    "H" : np.random.choice([0, 1], 1000),
    "I" : np.random.choice([0, 1], 1000),
    "J" : np.random.choice([0, 1], 1000),
    "K" : np.random.choice([0, 1], 1000),
    "L" : np.random.choice([0, 1], 1000),
    "M" : np.random.choice([0, 1], 1000),
    
    "missing_values" : np.random.choice(np.arange(0,1000), 1000) /100
})

df.loc[df["missing_values"].sample(frac=0.1).index, "missing_values"] = np.nan

as I understand it, interpolation could not be used because that implies a linear relationship within missing values, and it's not a timeseries. a random forest or decision tree could work, but I was hoping to find a solution using a mask.

is there a way to programmatic way to achieve this?

edited Oct 7, 2024 at 13:39

asked Oct 6, 2024 at 14:30

plotmaster473

3412 silver badges11 bronze badges

3

Your question is missing a minimal reproducible example of the data and the matching expected output.

mozway
– mozway

2024-10-06 16:03:28 +00:00
Commented Oct 6, 2024 at 16:03
If you need to interpolate based on multiple columns at the same time (for example, both Award_ordinal and other columns), you can extend the` groupby `method to work with multiple columns.Like michelin['description_sentiment'] = michelin.groupby(['Award_ordinal', 'Another_feature'])['description_sentiment'].transform(lambda x: x.fillna(x.mean())).

steve-ed
– steve-ed

2024-10-06 16:14:45 +00:00
Commented Oct 6, 2024 at 16:14
1

It would be helpful if you provide minimal reproducible example to help us answer you more precisely

steve-ed
– steve-ed

2024-10-06 16:20:32 +00:00
Commented Oct 6, 2024 at 16:20
the groupby sounds like the right answer, actually (!!!). I'm somewhat surprised. does it work with continuous datatypes, or does one need to create bucket with pd.cut?

plotmaster473
– plotmaster473

2024-10-07 13:42:40 +00:00
Commented Oct 7, 2024 at 13:42
added an MRE to play around with. all variables are random.

plotmaster473
– plotmaster473

2024-10-07 13:43:15 +00:00
Commented Oct 7, 2024 at 13:43

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

best way to fill nans, getting the mean from multiple columns

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest