I have a dataframe with reviews of restaurants. a columns "description sentiment" has been assigned a score ranging from -1.00 - +1.00. Some of the rows are 0 (due to some other factors in the dataframe), and so I'd like to interpolate some of the rows based on other info, like the Award score.
interpolate_cols = ["description_sentiment", "Award_ordinal"]
As I understand it, best practice for interpolation is to set a mask and use for each possible value, such as in the last block below.
nan_mask = (michelin["description_sentiment"]==np.nan)
award1_mask = (michelin["Award_ordinal"]==1)
award2_mask = (michelin["Award_ordinal"]==2)
award3_mask = (michelin["Award_ordinal"]==3)
award4_mask = (michelin["Award_ordinal"]==4)
award5_mask = (michelin["Award_ordinal"]==5)
michelin.loc[nan_mask & award1_mask, "description_sentiment"] = michelin.loc[award1_mask, "description_sentiment"].mean()
...
and then list all possible values
My question is, what happens when the complexity of data increases, ie, more features. is the best way really to list all individually? or is there a more simple programmatic way, like:
interpolate() np.nan for col.unique(), col.unique(), col.unique()
many thanks.
Edit: MRE
the first three columns should be considered ordinal, the next three nominal, the the next few onehot encoded [boolean from categorical nominal], and the final continuous.
df = pd.DataFrame({
"A" : np.random.choice([1, 12], 1000),
"B" : np.random.choice([1, 20], 1000),
"C" : np.random.choice([1, 25], 1000),
"D" : np.random.choice(["Japan", "China", "Indonesia", "Thailand", "Laos", "Cambodia", "Philippines"], 1000),
"E" : np.random.choice(["Japan", "China", "Indonesia", "Thailand", "Laos", "Cambodia", "Philippines"], 1000),
"F" : np.random.choice(["Japan", "China", "Indonesia", "Thailand", "Laos", "Cambodia", "Philippines"], 1000),
"G" : np.random.choice([0, 1], 1000),
"H" : np.random.choice([0, 1], 1000),
"I" : np.random.choice([0, 1], 1000),
"J" : np.random.choice([0, 1], 1000),
"K" : np.random.choice([0, 1], 1000),
"L" : np.random.choice([0, 1], 1000),
"M" : np.random.choice([0, 1], 1000),
"missing_values" : np.random.choice(np.arange(0,1000), 1000) /100
})
df.loc[df["missing_values"].sample(frac=0.1).index, "missing_values"] = np.nan
as I understand it, interpolation could not be used because that implies a linear relationship within missing values, and it's not a timeseries. a random forest or decision tree could work, but I was hoping to find a solution using a mask.
is there a way to programmatic way to achieve this?
Award_ordinaland other columns), you can extend the` groupby `method to work with multiple columns.Likemichelin['description_sentiment'] = michelin.groupby(['Award_ordinal', 'Another_feature'])['description_sentiment'].transform(lambda x: x.fillna(x.mean())).