Merge Pandas Dataframe Rows based on multiple conditions

Question

Hi I have a pandas df which contains dates and amounts.

    Date    Amount  
0 10/02/22  1600       
1 10/02/22  150     
2 11/02/22  100       
3 11/02/22  800
4 11/02/22  125

If an entry is one day later and less than 10% of any other entry I would like to sum the amounts and then take the earliest date.

So the df would look like:

Date    Amount  
0 10/02/22  1825       
1 10/02/22  150         
2 11/02/22  800

I've tried creating threshold and then creating groups based on these conditions but this does not yield expected results.

threshold_selector =  (amount_difference < 0.1) & (date_difference == day)

Where day is a time delta of one day

groups = threshold_selector.cumsum()
dates= dates.groupby(groups).agg({'Amount':sum, 'Date': min})

The result is all rows joined into one.

This can become quite complex when you have a lot of data per day, what should happen if you have for example 5 different amounts per day, 2 or 3 of which are within 10% or several values of the previous day. Can you provide a second more complex example? — mozway
– mozway, Commented Sep 29, 2022 at 8:56
I am Going on the assumption that there is only ever one large value per day and any number of small values would be added but there would not be enough to make the previous days value 10x larger than the next days large value. — Finlay Knight
– Finlay Knight, Commented Sep 29, 2022 at 9:00

mozway · Accepted Answer · 2022-09-29 09:29:13Z

1

Here is an alternative using a groupby approach:

# ensure datetime
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

# group Amounts by Date
g = df.groupby('Date')['Amount']

# get max amount per date
date_max = g.max()
# shift to previous date
prev_date_max = date_max.shift(1, freq='D').reindex(date_max.index, fill_value=0)

# identify rows to drop later
mask = df['Amount'].div(df['Date'].map(prev_date_max)).le(0.1)

# get value of next day to add to max
val_to_add = (df['Amount'][mask]
                 .groupby(df['Date']).sum()
                 .shift(-1, freq='D')
                )

# add to max
df['Amount'] += df['Date'].map(val_to_add).where(df.index.isin(g.idxmax())).fillna(0)

# drop rows
df = df.loc[~mask]

output:

        Date  Amount
0 2022-02-10  1825.0
1 2022-02-10   150.0
3 2022-02-11   800.0

answered Sep 29, 2022 at 9:29

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mozway · Accepted Answer · 2022-09-29 09:13:51Z

I would approach this using a pivot.

Sort the values with descending amount and pivot to have the largest value in the first column. Then find the values lower or equal to 10% that and mask them + add to first column. Then shape back to original:

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values(by=['Date', 'Amount'], ascending=[True, False])

# pivot to have col0 with the largest value per day
df2 = (df
 .assign(col=df.groupby('Date').cumcount())
 .pivot(index='Date', columns='col', values='Amount')
)

# identify values lower than the 10% of the previous day's max
mask = df2.div(df2[0].shift(1, freq='D'), axis=0).le(0.1).reindex_like(df2)

# add the lower than 10% values
df2[0] += df2.where(mask).sum(axis=1).shift(-1, 'D').reindex(mask.index, fill_value=0)

# mask them
df2 = df2.mask(mask)

# reshape back dropping the NaNs
df2 = df2.stack().droplevel('col').reset_index(name='Amount')

output:

        Date  Amount
0 2022-02-10  1825.0
1 2022-02-10   150.0
2 2022-02-11   800.0

Collectives™ on Stack Overflow

Merge Pandas Dataframe Rows based on multiple conditions

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related