1

I have 4 pandas dataframe, First two are Categorical and Numeric Values df,

Cat_data = [
        ['Color', 'red', 0.2543], 
        ['Color', 'orange',0.1894], 
        ['Color', 'yellow',-0.2836],
        ['Fruit', 'orange', -1.3647], 
        ['Fruit','banana',0.3648]
        ] 

Cat_df = pd.DataFrame(Cat_data, columns = ['Variable', 'Cats', 'Value']) 

Num_data = [
        ['Quantity', '-inf', '5', 0.2145], 
        ['Quantity', '5', '10', 0.0268], 
        ['Quantity', '10', 'inf', -0.5421], 
        ['Rating', '-inf', '0.5', 0.6521], 
        ['Rating','0.5', 'inf', -0.4378], 
        ] 

Num_df = pd.DataFrame(Num_data, columns = ['Variable', 'Inclusive', 'Exclusive', 'Value']) 

In the Num_data 'Inclusive' and 'Exclusive' are checking values,

say on the first record >= -inf and < 5 ,

same for second record values >=5 and < 10, values come from Actual_df

Third Dataframe is the actual values

Actual_data = [
        ['yellow', 'banana', '4', '0.5'] 
        ] 

Actual_df = pd.DataFrame(Actual_data, columns = ['Color', 'Fruit', 'Quantity', 'Rating']) 

Fourth is the Value DataFrame with column names same as Actual_df

Value_df = pandas.DataFrame(numpy.zeros((1, 4)),
columns = ['Color', 'Fruit', 'Quantity', 'Rating'])

I need to fill the Value_df with the 'Value' from Cat_data and Num_data 'Value' columns corresponding to the data in Actual_data, I am not sure how to merge the four df's and take values to check the Inclusive and Exclusive columns along with that.

In Actual Data we have 'yellow', 'banana', '4', '0.5' the value corresponding to

yellow is in Cat_df as -0.2836

banana is in Cat_df as 0.3648

Quantity is in Num_df as 0.2145

Rating is in Num_df as -0.4378

My Result DataFrame of Value_df will be

Color    Fruit   Quantity   Rating
-0.2836  0.3648  0.2145     -0.4378

For the Cat_data, I did like

Value_df['Color'] = Actual_df['Color'].map(Cat_df.set_index('Cats')['Value'])

The issue for the color and fruit both orange, which value will be taken is the problem, so I have to match variable as well, I get error as

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

1 Answer 1

1

If you can rely on the fact, that your ranges in Num_df don't overlap, you can do this as follows. Note I define some helper functions, you could also do without, but I think it makes it a bit easier to read.

# convert the datatypes (guess your real data does not store numeric values in strings)
Num_df[['Inclusive', 'Exclusive']]= Num_df[['Inclusive', 'Exclusive']].astype('float32')
Actual_df[['Quantity', 'Rating']]=Actual_df[['Quantity', 'Rating']].astype('float32')

# define two helper functions (or just store the categories / variables in different dataframes)
def get_variable_data(df, variable):
    df= df.loc[df['Variable'] == variable, ['Cats', 'Value']].copy()
    df.set_index(['Cats'], inplace=True)
    df.columns= [variable + '_value']
    return df

def get_num_data(df, variable):
    df= df.loc[df['Variable'] == variable, ['Inclusive', 'Value']].copy()
    df.sort_values(['Inclusive'], inplace=True)
    df.columns=[variable + '_inclusive', variable + '_value']

# join the first part by a regular join
Joined_df= Actual_df
for cat in ['Color', 'Fruit']:
    Joined_df= Joined_df.merge(get_variable_data(Cat_df, cat), left_on=[cat], right_index=True, how='left')

# now join according ranges using asof
for cat in ['Quantity', 'Rating']:
    print(cat)
    Joined_df= pd.merge_asof(Joined_df, get_num_data(Num_df, cat), left_on=[cat], right_on='Inclusive', direction='backward', suffixes=['', '_'+cat])

# drop the excess columns
Joined_df.drop([col for col in Joined_df if col.endswith('_inclusive')], axis='columns', inplace=True)

# the result of this is
    Color   Fruit  Quantity  Rating  Color_value  Fruit_value  Quantity_value  Rating_value
0  yellow  banana       4.0     0.5      -0.2836       0.3648          0.2145       -0.4378

As written above, the last step with the merge_asof asumes that your ranges contain no gaps, where you don't have a value and span the whole value range. Because of this, you wouldn't need to check the end of the range. However if that assumption is not correct, you just have to change the code a bit:

  1. use merge_asof as it is, just alter get_num_data, so it also returns the Exclusive column.

  2. use Join_df.loc[Joined_df[cat]>=Joined_df[cat + '_exclusive'], cat]=defaultvalue to delete the values that exceed the exclusive range.

Btw, it is really safe to do it this way, because if there is a row, in which the value of the cat column lies, then it will be selected by merge_asof because it searches for the biggest available Inclusive value, that is smaller or equal than the value in col (I mean, at least if you have no overlapping ranges, but it seems that is unlikely for a constellation as in your example).

Sign up to request clarification or add additional context in comments.

2 Comments

Will converting the Dataframes to nested dictionary help?
Help for what? Pandas is pretty good for bigger datasets and the only point above, where it get's a bit tricky is the one with the ranges and you couldn't use dicitionaries for something like this effiently. you would start implementing loops.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.