Efficient way to create calculated column for Pandas DataFrame

Question

Given the following df:

datetimeindex        store  sale   category  weekday
2018-10-13 09:27:01  gbn01  59.99  sporting  1
2018-10-13 09:27:01  gbn02  19.99  sporting  1
2018-10-13 09:27:02  gbn03  15.99  hygine    1
2018-10-13 09:27:03  gbn05  39.99  camping   1
....
2018-10-16 11:59:01  gbn01  19.99  other     0
2018-10-16 11:59:01  gbn02  49.99  sporting  0
2018-10-16 11:59:02  gbn03  10.00  food      0
2018-10-16 11:59:03  gbn05  89.99  electro   0
2018-10-16 12:30:03  gbn01  52.99
....
2018-10-16 21:05:03  gbn03  25.00  alcohol   0
2018-10-16 22:43:03  gbn01  10.05  health    0

Update

After re-reading the reqs it looks like the mean_sales will calculate for that specific timestamp for that store during that period (08:00 to 18:00 or 12:00 to 13:00). My current thinking is to implement the below pseudo but it would currently only work if it was ordered by datetimeindex,store:

#Lunch_Time_Mean
count=0
Lunch_Sum_Previous=0
for r in df:
    if LunchHours & WeekDay:
        count++
        if count=1:
            r.Lunch_Mean=r.sale
            Lunch_Sum_Previous = r.sale
        elif count > 1:
            r.Lunch_Mean = Lunch_Sum_Previous + r.sale / count
            Lunch_Sum_Previous += r.sale
    else:
        r.Lunch_Mean=1
        count=0
        Lunch_Sum_Previous = 0

Above Logic mapped to a table:

datetimeindex       store    IsWorkingHour    count    sales    working_hour_sum    working_hour_cumsum    working_hour_mean_sales
13/10/2018 07:27    gbn01    0                0        39.18    0                   0                      1
13/10/2018 08:27    gbn01    1                1        31.69    31.69               31.69                  1
13/10/2018 09:27    gbn01    1                2        99.19    99.19               130.88                 1
13/10/2018 10:27    gbn01    1                3        25.89    25.89               156.77                 1
13/10/2018 11:27    gbn01    1                4        19.10    19.10               175.87                 1
13/10/2018 12:27    gbn01    1                5        82.51    82.51               258.38                 1
13/10/2018 13:27    gbn01    1                6        10.82    10.82               269.2                  1
13/10/2018 14:27    gbn01    1                7        10.43    10.43               279.63                 1
13/10/2018 15:27    gbn01    1                8        15.83    15.83               295.46                 1
13/10/2018 16:27    gbn01    1                9        12.53    12.53               307.99                 1
13/10/2018 17:27    gbn01    1                10       10.03    10.03               318.02                 1
13/10/2018 18:27    gbn01    0                0        54.14    0                   0                      1
13/10/2018 19:27    gbn01    0                0        20.04    0                   0                      1
#Above enteries have weekday_mean_sales of 0 because 13/10/2018 is on a weekend.                                                                                         
16/10/2018 07:27    gbn01    0                0        13.34    0                   0                      1
16/10/2018 08:27    gbn01    1                1        15.84    15.84               15.84                  15.84
16/10/2018 09:27    gbn01    1                2        19.14    19.14               34.98                  17.49
16/10/2018 10:27    gbn01    1                3        11.64    11.64               46.62                  15.54
16/10/2018 11:27    gbn01    1                4        17.54    17.54               64.16                  16.04
16/10/2018 12:27    gbn01    1                5        20.84    20.84               85                     17
16/10/2018 13:27    gbn01    1                6        50.05    50.05               135.05                 22.51
16/10/2018 14:27    gbn01    1                7        10.05    10.05               145.1                  20.73
16/10/2018 15:27    gbn01    1                8        13.35    13.35               158.45                 19.81
16/10/2018 16:27    gbn01    1                9        32.55    32.55               191                    21.22
16/10/2018 17:27    gbn01    1                10       13.36    13.36               204.36                 20.44
16/10/2018 18:27    gbn01    0                0        10.86    0                   0                      1
16/10/2018 19:27    gbn01    0                0        20.06    0                   0                      1

Desired Output

I'm attempting to use the above to generate a new df that looks like the below:

#I've simplified it to a single condition and store
datetimeindex       store    working_hour_mean_sales
13/10/2018 07:27    gbn01    1
13/10/2018 08:27    gbn01    1
13/10/2018 09:27    gbn01    1
13/10/2018 10:27    gbn01    1
13/10/2018 11:27    gbn01    1
13/10/2018 12:27    gbn01    1
13/10/2018 13:27    gbn01    1
13/10/2018 14:27    gbn01    1
13/10/2018 15:27    gbn01    1
13/10/2018 16:27    gbn01    1
13/10/2018 17:27    gbn01    1
13/10/2018 18:27    gbn01    1
13/10/2018 19:27    gbn01    1
#Above weekday_mean_sales=1 because 13/10/2018 was a weekend                         
16/10/2018 07:27    gbn01    1
16/10/2018 08:27    gbn01    15.84
16/10/2018 09:27    gbn01    17.49
16/10/2018 10:27    gbn01    15.54
16/10/2018 11:27    gbn01    16.04
16/10/2018 12:27    gbn01    17
16/10/2018 13:27    gbn01    22.51
16/10/2018 14:27    gbn01    20.73
16/10/2018 15:27    gbn01    19.81
16/10/2018 16:27    gbn01    21.22
16/10/2018 17:27    gbn01    20.44
16/10/2018 18:27    gbn01    1
16/10/2018 19:27    gbn01    1

Where "working hours" are 08:00-18:00 Mon-Fri and "weekday lunch peak" is 12:00-13:30.

(N.B. I didn't make the counter-intuitive decision (at least to me) that weekday=0 means mon-fri)

Any suggestions how to implement this into pandas would be greatly appreciated!

The desired output is the second df, I've changed the wording to make that clearer. — prassein
– prassein, Commented Oct 17, 2018 at 9:05
Please check your dataframes, the output does not make sense. — rahlf23
– rahlf23, Commented Oct 17, 2018 at 15:48
I've simplified the example to a single condition and store. I've mapped the logic to a table, I'm now trying to figure out how to get the desired output using python. Hopefully it makes sense now. — prassein
– prassein, Commented Oct 19, 2018 at 8:02

rahlf23 · Accepted Answer · 2018-10-16 15:26:27Z

1

You can use groupby(), agg() and between().

This will aggregate the results for week day lunch peaks Mon-Fri:

df[(df['datetimeindex'].dt.strftime('%H:%M:%S').between('12:00:00','13:30:00')) & (df['weekday']==0)].groupby(['store','category']).agg({'sale': 'mean'})

And this will aggregate the results for working hours Mon-Fri:

df[(df['datetimeindex'].dt.strftime('%H:%M:%S').between('08:00:00','18:00:00')) & (df['weekday']==0)].groupby(['store','category']).agg({'sale': 'mean'})

answered Oct 16, 2018 at 15:26

rahlf23

9,0494 gold badges30 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

prassein Over a year ago

Your comment helped me figure out the between times. I've re-read the reqs and think I misunderstood. I think I'll have to define a function. I've updated the question with my draft pseudo for looping the df per store.

Novak · Accepted Answer · 2018-10-16 11:57:32Z

0

Try separating your data to batches and then sum everything you need for every batch. At the end join the results, divide by number of entries and put the results in the columns you need.

Also you can batch data in number of ways, but as per your example, I suggest to group it by category and calculate everything for each of the categories and then join it in the final table.

I hope this helps you :)

answered Oct 16, 2018 at 11:57

Novak

2,1611 gold badge13 silver badges22 bronze badges

1 Comment

prassein Over a year ago

Is there something dodgy about the operation I'm attempting? I've checked df.info(memory_usage='deep') and it's only 57MB, I'm not running this on a potato either. I'll look into the batch processing, thanks.

MEdwin · Accepted Answer · 2018-10-16 15:49:23Z

this should guide you with the logic you need. Basically, you define a new columns workinghours, weekdaylunchpeak and use sqlcode to aggregate (there are other methods).

import pandasql as ps
import datetime
import numpy as np

mydata = pd.DataFrame(data={'datetimeindex': ['13/10/2018 09:27:01','13/10/2018 09:27:02','13/10/2018 09:27:03','13/10/2018 09:27:04','16/10/2018 11:59:01','16/10/2018 11:59:02','16/10/2018 11:59:03','16/10/2018 11:59:04','16/10/2018 21:05:01','16/10/2018 22:43:01'],
                       'store': ['gbn01','gbn02','gbn03','gbn05','gbn01','gbn02','gbn03','gbn05','gbn03','gbn01'],                        
                       'sale': [59.99,19.99,15.99,39.99,19.99,49.99,10,89.99,25,10.05],
                       'category': ['sporting','sporting','hygine','camping','other','sporting','food','electro','alcohol','health'],
                       'weekday': [1,1,1,1,0,0,0,0,0,0] 
                       })

mydata['datetimeindex'] = pd.to_datetime(mydata['datetimeindex'])
mydata['workinghours']=(
    np.where((mydata.datetimeindex.dt.time >= time(8,00))
             &
             (mydata.datetimeindex.dt.time<=time(18,00))
             &
             (mydata.weekday==0)
             , 1, 0))
mydata['weekdaylunchpeak']=(
    np.where((mydata.datetimeindex.dt.time >= time(12,00))
             &
             (mydata.datetimeindex.dt.time<=time(13,30))
             &
             (mydata.weekday==0)
             , 1, 0))

sqlcode = '''
SELECT 
    store,   
    category,
    avg(case when workinghours=1 then sale else 0 end) AS working_hours_mean_sales,
    avg(case when weekdaylunchpeak=1 then sale else 0 end) AS weekday_lunch_peak_mean_sales    
FROM mydata 

GROUP BY
store,   
    category

;
'''
newdf = ps.sqldf(sqlcode,locals()) 
newdf

I've updated the question a few times since this, my fall back position was going to attempt it using SQL as I have more experience with that. Thanks

Collectives™ on Stack Overflow

Efficient way to create calculated column for Pandas DataFrame

Desired Output

3 Answers 3

1 Comment

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Desired Output

3 Answers 3

1 Comment

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related