0

I have a dataframe with reviews of restaurants. a columns "description sentiment" has been assigned a score ranging from -1.00 - +1.00. Some of the rows are 0 (due to some other factors in the dataframe), and so I'd like to interpolate some of the rows based on other info, like the Award score.

interpolate_cols = ["description_sentiment", "Award_ordinal"]

As I understand it, best practice for interpolation is to set a mask and use for each possible value, such as in the last block below.

nan_mask = (michelin["description_sentiment"]==np.nan)

award1_mask = (michelin["Award_ordinal"]==1)
award2_mask = (michelin["Award_ordinal"]==2)
award3_mask = (michelin["Award_ordinal"]==3)
award4_mask = (michelin["Award_ordinal"]==4)
award5_mask = (michelin["Award_ordinal"]==5)

michelin.loc[nan_mask & award1_mask, "description_sentiment"] = michelin.loc[award1_mask, "description_sentiment"].mean()
...

and then list all possible values

My question is, what happens when the complexity of data increases, ie, more features. is the best way really to list all individually? or is there a more simple programmatic way, like:

interpolate() np.nan for col.unique(), col.unique(), col.unique() 

many thanks.


Edit: MRE

the first three columns should be considered ordinal, the next three nominal, the the next few onehot encoded [boolean from categorical nominal], and the final continuous.

df = pd.DataFrame({
    
    "A" : np.random.choice([1, 12], 1000),
    "B" : np.random.choice([1, 20], 1000),
    "C" : np.random.choice([1, 25], 1000),

    "D" : np.random.choice(["Japan", "China", "Indonesia", "Thailand", "Laos", "Cambodia", "Philippines"], 1000),
    "E" : np.random.choice(["Japan", "China", "Indonesia", "Thailand", "Laos", "Cambodia", "Philippines"], 1000),
    "F" : np.random.choice(["Japan", "China", "Indonesia", "Thailand", "Laos", "Cambodia", "Philippines"], 1000),
        
    "G" : np.random.choice([0, 1], 1000),
    "H" : np.random.choice([0, 1], 1000),
    "I" : np.random.choice([0, 1], 1000),
    "J" : np.random.choice([0, 1], 1000),
    "K" : np.random.choice([0, 1], 1000),
    "L" : np.random.choice([0, 1], 1000),
    "M" : np.random.choice([0, 1], 1000),
    
    "missing_values" : np.random.choice(np.arange(0,1000), 1000) /100
})

df.loc[df["missing_values"].sample(frac=0.1).index, "missing_values"] = np.nan

as I understand it, interpolation could not be used because that implies a linear relationship within missing values, and it's not a timeseries. a random forest or decision tree could work, but I was hoping to find a solution using a mask.

is there a way to programmatic way to achieve this?

5
  • 3
    Your question is missing a minimal reproducible example of the data and the matching expected output. Commented Oct 6, 2024 at 16:03
  • If you need to interpolate based on multiple columns at the same time (for example, both Award_ordinal and other columns), you can extend the` groupby `method to work with multiple columns.Like michelin['description_sentiment'] = michelin.groupby(['Award_ordinal', 'Another_feature'])['description_sentiment'].transform(lambda x: x.fillna(x.mean())). Commented Oct 6, 2024 at 16:14
  • 1
    It would be helpful if you provide minimal reproducible example to help us answer you more precisely Commented Oct 6, 2024 at 16:20
  • the groupby sounds like the right answer, actually (!!!). I'm somewhat surprised. does it work with continuous datatypes, or does one need to create bucket with pd.cut? Commented Oct 7, 2024 at 13:42
  • added an MRE to play around with. all variables are random. Commented Oct 7, 2024 at 13:43

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.