1

Hi I have a pandas df which contains dates and amounts.

    Date    Amount  
0 10/02/22  1600       
1 10/02/22  150     
2 11/02/22  100       
3 11/02/22  800
4 11/02/22  125   

If an entry is one day later and less than 10% of any other entry I would like to sum the amounts and then take the earliest date.

So the df would look like:

Date    Amount  
0 10/02/22  1825       
1 10/02/22  150         
2 11/02/22  800 

I've tried creating threshold and then creating groups based on these conditions but this does not yield expected results.

threshold_selector =  (amount_difference < 0.1) & (date_difference == day)

Where day is a time delta of one day

groups = threshold_selector.cumsum()
dates= dates.groupby(groups).agg({'Amount':sum, 'Date': min})

The result is all rows joined into one.

2
  • 1
    This can become quite complex when you have a lot of data per day, what should happen if you have for example 5 different amounts per day, 2 or 3 of which are within 10% or several values of the previous day. Can you provide a second more complex example? Commented Sep 29, 2022 at 8:56
  • I am Going on the assumption that there is only ever one large value per day and any number of small values would be added but there would not be enough to make the previous days value 10x larger than the next days large value. Commented Sep 29, 2022 at 9:00

2 Answers 2

1

Here is an alternative using a groupby approach:

# ensure datetime
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

# group Amounts by Date
g = df.groupby('Date')['Amount']

# get max amount per date
date_max = g.max()
# shift to previous date
prev_date_max = date_max.shift(1, freq='D').reindex(date_max.index, fill_value=0)

# identify rows to drop later
mask = df['Amount'].div(df['Date'].map(prev_date_max)).le(0.1)

# get value of next day to add to max
val_to_add = (df['Amount'][mask]
                 .groupby(df['Date']).sum()
                 .shift(-1, freq='D')
                )

# add to max
df['Amount'] += df['Date'].map(val_to_add).where(df.index.isin(g.idxmax())).fillna(0)

# drop rows
df = df.loc[~mask]

output:

        Date  Amount
0 2022-02-10  1825.0
1 2022-02-10   150.0
3 2022-02-11   800.0
Sign up to request clarification or add additional context in comments.

Comments

1

I would approach this using a pivot.

Sort the values with descending amount and pivot to have the largest value in the first column. Then find the values lower or equal to 10% that and mask them + add to first column. Then shape back to original:

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values(by=['Date', 'Amount'], ascending=[True, False])

# pivot to have col0 with the largest value per day
df2 = (df
 .assign(col=df.groupby('Date').cumcount())
 .pivot(index='Date', columns='col', values='Amount')
)

# identify values lower than the 10% of the previous day's max
mask = df2.div(df2[0].shift(1, freq='D'), axis=0).le(0.1).reindex_like(df2)

# add the lower than 10% values
df2[0] += df2.where(mask).sum(axis=1).shift(-1, 'D').reindex(mask.index, fill_value=0)

# mask them
df2 = df2.mask(mask)

# reshape back dropping the NaNs
df2 = df2.stack().droplevel('col').reset_index(name='Amount')

output:

        Date  Amount
0 2022-02-10  1825.0
1 2022-02-10   150.0
2 2022-02-11   800.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.