1

I have this dataframe

df[['payout_date','total_value']].head(10)

    payout_date         total_value
0   2017-02-14T11:00:06  177.313
1   2017-02-14T11:00:06  0.000
2   2017-02-01T00:00:00  0.000
3   2017-02-14T11:00:06  47.392
4   2017-02-14T11:00:06  16.254
5   2017-02-14T11:00:06  125.818
6   2017-02-14T11:00:06  0.000
7   2017-02-14T11:00:06  0.000
8   2017-02-14T11:00:06  0.000
9   2017-02-14T11:00:06  0.000

I am using this code to plot the aggregated sum of total_value within specific date-range by day (and by month), but it plots a bar for each total_value and doesn't sum-aggregate total_value by day.

(df.set_index('payout_date')
                    .loc['2018-02-01':'2018-02-02']
                    .groupby('payout_date')
                    .agg(['sum'])
                    .reset_index()
                    .plot(x='payout_date', y='total_value',kind="bar"))
plt.show()

Data is not aggregated, I get bar for each value from df:

enter image description here

How to aggregate total_value by date and by month?

I tried to use answers from this and couple other similar questions but none of them worked for the date format that is used here.

I also tried adding .dt.to_period('M') to the code but I get TypeError: Empty 'DataFrame': no numeric data to plot error.

4
  • what happens if you remove .loc['2018-02-01':'2018-02-02']? Commented May 30, 2018 at 20:11
  • @Joe I tried that, it calculates for extremely long time (>30min and I stop the script), because the dataframe is very big, so I have to chose a specific interval. Commented May 30, 2018 at 20:14
  • with few rows, it plots correctly the aggregate without .loc Try as well, if it works, make the selection before the code you posted Commented May 30, 2018 at 20:15
  • @Joe I did new_df=df.set_index('payout_date').loc['2018-02-01':'2018-02-02'] and then tried to use the initial code without .loc but I am getting 'KeyError: 'payout_date'' error, so there is not 'payout_date' column in new_df? Why? How do I make the selection properly before the original code? Thanks Commented May 30, 2018 at 20:22

2 Answers 2

3

Setup

df = pd.DataFrame({'payout_date': {0: '2017-02-01T11:00:06',   1: '2017-02-01T11:00:06',   2: '2017-02-02T00:00:00',   3: '2017-02-14T11:00:06',   4: '2017-02-14T11:00:06',   5: '2017-02-15T11:00:06',   6: '2017-02-15T11:00:06',   7: '2017-02-16T11:00:06',   8: '2017-02-16T11:00:06',   9: '2017-02-16T11:00:06'},  'total_value':{0: 177.313,   1: 22.0,   2: 25.0,   3: 47.391999999999996,   4: 16.254,   5: 125.818,   6: 85.0,   7: 42.0,8: 22.0,   9: 19.0}})

Use normalize to just group by day:

df.groupby(pd.DatetimeIndex(df.payout_date).normalize()).sum().reset_index()

  payout_date  total_value
0  2017-02-01      199.313
1  2017-02-02       48.000
2  2017-02-14       63.646
3  2017-02-15      210.818
4  2017-02-16       83.000

Extend the previous command to plot:

df.groupby(
    pd.DatetimeIndex(df.payout_date)      \
    .normalize().strftime('%Y-%m-%d'))    \
    .agg(['sum'])                         \
    .reset_index()                        \
    .plot(x='index', y='total_value', kind='bar')

plt.tight_layout()
plt.show()

Output for my sample data:

enter image description here

If you want to apply this on a subset, you can do something like the following:

tmp = df.loc[(df.payout_date > '2017-02-01') & (df.payout_date < '2017-02-15')]

tmp.groupby(
    pd.DatetimeIndex(tmp.payout_date)                     \
    .normalize().strftime('%Y-%m-%d'))['total_value']     \
    .agg(['sum'])

# Result
                sum
2017-02-01  199.313
2017-02-02   25.000
2017-02-14   63.646

Which will only sum your desired range.

Sign up to request clarification or add additional context in comments.

Comments

0

Try in this way:

df = df.iloc[1:7]
(df.set_index('payout_date')
                .groupby('payout_date')
                .agg(['sum'])
                .reset_index()
                .plot(x='payout_date', y='total_value',kind="bar"))
plt.show()

Where the index are selected before

2 Comments

I got very strange result. What does 'index are selected before' means? How do I select the index? Also, what df.iloc[1:7] stands for? What it's purpose?
@user40 With df = df.iloc[1:7] you select the rows. in this case from the row 1 to 7. If for example you want the first 1000 rows change to df.iloc[0:1000]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.