0

Given the following df:

datetimeindex        store  sale   category  weekday
2018-10-13 09:27:01  gbn01  59.99  sporting  1
2018-10-13 09:27:01  gbn02  19.99  sporting  1
2018-10-13 09:27:02  gbn03  15.99  hygine    1
2018-10-13 09:27:03  gbn05  39.99  camping   1
....
2018-10-16 11:59:01  gbn01  19.99  other     0
2018-10-16 11:59:01  gbn02  49.99  sporting  0
2018-10-16 11:59:02  gbn03  10.00  food      0
2018-10-16 11:59:03  gbn05  89.99  electro   0
2018-10-16 12:30:03  gbn01  52.99
....
2018-10-16 21:05:03  gbn03  25.00  alcohol   0
2018-10-16 22:43:03  gbn01  10.05  health    0

Update

After re-reading the reqs it looks like the mean_sales will calculate for that specific timestamp for that store during that period (08:00 to 18:00 or 12:00 to 13:00). My current thinking is to implement the below pseudo but it would currently only work if it was ordered by datetimeindex,store:

#Lunch_Time_Mean
count=0
Lunch_Sum_Previous=0
for r in df:
    if LunchHours & WeekDay:
        count++
        if count=1:
            r.Lunch_Mean=r.sale
            Lunch_Sum_Previous = r.sale
        elif count > 1:
            r.Lunch_Mean = Lunch_Sum_Previous + r.sale / count
            Lunch_Sum_Previous += r.sale
    else:
        r.Lunch_Mean=1
        count=0
        Lunch_Sum_Previous = 0

Above Logic mapped to a table:

datetimeindex       store    IsWorkingHour    count    sales    working_hour_sum    working_hour_cumsum    working_hour_mean_sales
13/10/2018 07:27    gbn01    0                0        39.18    0                   0                      1
13/10/2018 08:27    gbn01    1                1        31.69    31.69               31.69                  1
13/10/2018 09:27    gbn01    1                2        99.19    99.19               130.88                 1
13/10/2018 10:27    gbn01    1                3        25.89    25.89               156.77                 1
13/10/2018 11:27    gbn01    1                4        19.10    19.10               175.87                 1
13/10/2018 12:27    gbn01    1                5        82.51    82.51               258.38                 1
13/10/2018 13:27    gbn01    1                6        10.82    10.82               269.2                  1
13/10/2018 14:27    gbn01    1                7        10.43    10.43               279.63                 1
13/10/2018 15:27    gbn01    1                8        15.83    15.83               295.46                 1
13/10/2018 16:27    gbn01    1                9        12.53    12.53               307.99                 1
13/10/2018 17:27    gbn01    1                10       10.03    10.03               318.02                 1
13/10/2018 18:27    gbn01    0                0        54.14    0                   0                      1
13/10/2018 19:27    gbn01    0                0        20.04    0                   0                      1
#Above enteries have weekday_mean_sales of 0 because 13/10/2018 is on a weekend.                                                                                         
16/10/2018 07:27    gbn01    0                0        13.34    0                   0                      1
16/10/2018 08:27    gbn01    1                1        15.84    15.84               15.84                  15.84
16/10/2018 09:27    gbn01    1                2        19.14    19.14               34.98                  17.49
16/10/2018 10:27    gbn01    1                3        11.64    11.64               46.62                  15.54
16/10/2018 11:27    gbn01    1                4        17.54    17.54               64.16                  16.04
16/10/2018 12:27    gbn01    1                5        20.84    20.84               85                     17
16/10/2018 13:27    gbn01    1                6        50.05    50.05               135.05                 22.51
16/10/2018 14:27    gbn01    1                7        10.05    10.05               145.1                  20.73
16/10/2018 15:27    gbn01    1                8        13.35    13.35               158.45                 19.81
16/10/2018 16:27    gbn01    1                9        32.55    32.55               191                    21.22
16/10/2018 17:27    gbn01    1                10       13.36    13.36               204.36                 20.44
16/10/2018 18:27    gbn01    0                0        10.86    0                   0                      1
16/10/2018 19:27    gbn01    0                0        20.06    0                   0                      1

Desired Output

I'm attempting to use the above to generate a new df that looks like the below:

#I've simplified it to a single condition and store
datetimeindex       store    working_hour_mean_sales
13/10/2018 07:27    gbn01    1
13/10/2018 08:27    gbn01    1
13/10/2018 09:27    gbn01    1
13/10/2018 10:27    gbn01    1
13/10/2018 11:27    gbn01    1
13/10/2018 12:27    gbn01    1
13/10/2018 13:27    gbn01    1
13/10/2018 14:27    gbn01    1
13/10/2018 15:27    gbn01    1
13/10/2018 16:27    gbn01    1
13/10/2018 17:27    gbn01    1
13/10/2018 18:27    gbn01    1
13/10/2018 19:27    gbn01    1
#Above weekday_mean_sales=1 because 13/10/2018 was a weekend                         
16/10/2018 07:27    gbn01    1
16/10/2018 08:27    gbn01    15.84
16/10/2018 09:27    gbn01    17.49
16/10/2018 10:27    gbn01    15.54
16/10/2018 11:27    gbn01    16.04
16/10/2018 12:27    gbn01    17
16/10/2018 13:27    gbn01    22.51
16/10/2018 14:27    gbn01    20.73
16/10/2018 15:27    gbn01    19.81
16/10/2018 16:27    gbn01    21.22
16/10/2018 17:27    gbn01    20.44
16/10/2018 18:27    gbn01    1
16/10/2018 19:27    gbn01    1

Where "working hours" are 08:00-18:00 Mon-Fri and "weekday lunch peak" is 12:00-13:30.

(N.B. I didn't make the counter-intuitive decision (at least to me) that weekday=0 means mon-fri)

Any suggestions how to implement this into pandas would be greatly appreciated!

4
  • Could you also include your desired output in your post? Commented Oct 17, 2018 at 0:29
  • The desired output is the second df, I've changed the wording to make that clearer. Commented Oct 17, 2018 at 9:05
  • Please check your dataframes, the output does not make sense. Commented Oct 17, 2018 at 15:48
  • I've simplified the example to a single condition and store. I've mapped the logic to a table, I'm now trying to figure out how to get the desired output using python. Hopefully it makes sense now. Commented Oct 19, 2018 at 8:02

3 Answers 3

1

You can use groupby(), agg() and between().

This will aggregate the results for week day lunch peaks Mon-Fri:

df[(df['datetimeindex'].dt.strftime('%H:%M:%S').between('12:00:00','13:30:00')) & (df['weekday']==0)].groupby(['store','category']).agg({'sale': 'mean'})

And this will aggregate the results for working hours Mon-Fri:

df[(df['datetimeindex'].dt.strftime('%H:%M:%S').between('08:00:00','18:00:00')) & (df['weekday']==0)].groupby(['store','category']).agg({'sale': 'mean'})
Sign up to request clarification or add additional context in comments.

1 Comment

Your comment helped me figure out the between times. I've re-read the reqs and think I misunderstood. I think I'll have to define a function. I've updated the question with my draft pseudo for looping the df per store.
0

Try separating your data to batches and then sum everything you need for every batch. At the end join the results, divide by number of entries and put the results in the columns you need.

Also you can batch data in number of ways, but as per your example, I suggest to group it by category and calculate everything for each of the categories and then join it in the final table.

I hope this helps you :)

1 Comment

Is there something dodgy about the operation I'm attempting? I've checked df.info(memory_usage='deep') and it's only 57MB, I'm not running this on a potato either. I'll look into the batch processing, thanks.
0

this should guide you with the logic you need. Basically, you define a new columns workinghours, weekdaylunchpeak and use sqlcode to aggregate (there are other methods).

import pandasql as ps
import datetime
import numpy as np

mydata = pd.DataFrame(data={'datetimeindex': ['13/10/2018 09:27:01','13/10/2018 09:27:02','13/10/2018 09:27:03','13/10/2018 09:27:04','16/10/2018 11:59:01','16/10/2018 11:59:02','16/10/2018 11:59:03','16/10/2018 11:59:04','16/10/2018 21:05:01','16/10/2018 22:43:01'],
                       'store': ['gbn01','gbn02','gbn03','gbn05','gbn01','gbn02','gbn03','gbn05','gbn03','gbn01'],                        
                       'sale': [59.99,19.99,15.99,39.99,19.99,49.99,10,89.99,25,10.05],
                       'category': ['sporting','sporting','hygine','camping','other','sporting','food','electro','alcohol','health'],
                       'weekday': [1,1,1,1,0,0,0,0,0,0] 
                       })

mydata['datetimeindex'] = pd.to_datetime(mydata['datetimeindex'])
mydata['workinghours']=(
    np.where((mydata.datetimeindex.dt.time >= time(8,00))
             &
             (mydata.datetimeindex.dt.time<=time(18,00))
             &
             (mydata.weekday==0)
             , 1, 0))
mydata['weekdaylunchpeak']=(
    np.where((mydata.datetimeindex.dt.time >= time(12,00))
             &
             (mydata.datetimeindex.dt.time<=time(13,30))
             &
             (mydata.weekday==0)
             , 1, 0))

sqlcode = '''
SELECT 
    store,   
    category,
    avg(case when workinghours=1 then sale else 0 end) AS working_hours_mean_sales,
    avg(case when weekdaylunchpeak=1 then sale else 0 end) AS weekday_lunch_peak_mean_sales    
FROM mydata 

GROUP BY
store,   
    category

;
'''
newdf = ps.sqldf(sqlcode,locals()) 
newdf

1 Comment

I've updated the question a few times since this, my fall back position was going to attempt it using SQL as I have more experience with that. Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.