2

I have purchasing data and want to label them with a new column, which provides information about the daytime of the purchase. For that I'm using the hour of the timestamp column of each purchase.

Labels should work like this:

 hour 4 - 7 => 'morning'
 hour 8 - 11 => 'before midday'
 ...

I've already picked the hours of the timestamp. Now, I have a DataFrame with 50 mio records which looks as follows.

    user_id  timestamp              hour
0   11       2015-08-21 06:42:44    6
1   11       2015-08-20 13:38:58    13
2   11       2015-08-20 13:37:47    13
3   11       2015-08-21 06:59:05    6
4   11       2015-08-20 13:15:21    13

At the moment my approach is to use 6x .iterrows(), each with a different condition:

for index, row in basket_times[(basket_times['hour']  >= 4) & (basket_times['hour'] < 8)].iterrows():
    basket_times['periode'] = 'morning'

then:

for index, row in basket_times[(basket_times['hour']  >= 8) & (basket_times['hour'] < 12)].iterrows():
    basket_times['periode'] = 'before midday'

and so on.

However, one of those 6 loops for 50 mio records takes already like an hour. Is there a better way to do this?

2 Answers 2

2

You can try loc with boolean masks. I changed df for testing:

print basket_times
   user_id           timestamp  hour
0       11 2015-08-21 06:42:44     6
1       11 2015-08-20 13:38:58    13
2       11 2015-08-20 09:37:47     9
3       11 2015-08-21 06:59:05     6
4       11 2015-08-20 13:15:21    13

#create boolean masks
morning = (basket_times['hour']  >= 4) & (basket_times['hour'] < 8)
beforemidday = (basket_times['hour']  >= 8) & (basket_times['hour'] < 11)
aftermidday = (basket_times['hour']  >= 11) & (basket_times['hour'] < 15)
print morning
0     True
1    False
2    False
3     True
4    False
Name: hour, dtype: bool

print beforemidday
0    False
1    False
2     True
3    False
4    False
Name: hour, dtype: bool
print aftermidday
0    False
1     True
2    False
3    False
4     True
Name: hour, dtype: bool
basket_times.loc[morning, 'periode'] = 'morning'
basket_times.loc[beforemidday, 'periode'] = 'before midday'
basket_times.loc[aftermidday, 'periode'] = 'after midday'
print basket_times
   user_id           timestamp  hour        periode
0       11 2015-08-21 06:42:44     6        morning
1       11 2015-08-20 13:38:58    13   after midday
2       11 2015-08-20 09:37:47     9  before midday
3       11 2015-08-21 06:59:05     6        morning
4       11 2015-08-20 13:15:21    13   after midday

Timings - len(df) = 500k:

In [87]: %timeit a(df)
10 loops, best of 3: 34 ms per loop

In [88]: %timeit b(df1)
1 loops, best of 3: 490 ms per loop

Code for testing:

import pandas as pd
import io

temp=u"""user_id;timestamp;hour
11;2015-08-21 06:42:44;6
11;2015-08-20 10:38:58;10
11;2015-08-20 09:37:47;9
11;2015-08-21 06:59:05;6
11;2015-08-20 10:15:21;10"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=None, parse_dates=[1])
df = pd.concat([df]*100000).reset_index(drop=True)
print df.shape
#(500000, 3)
df1 = df.copy()

def a(basket_times):
    morning = (basket_times['hour']  >= 4) & (basket_times['hour'] < 8)
    beforemidday = (basket_times['hour']  >= 8) & (basket_times['hour'] < 11)
    basket_times.loc[morning, 'periode'] = 'morning'
    basket_times.loc[beforemidday, 'periode'] = 'before midday'
    return basket_times

def b(basket_times):
    def get_periode(hour):
        if 4 <= hour <= 7:
            return 'morning'
        elif 8 <= hour <= 11:
            return 'before midday'

    basket_times['periode'] = basket_times['hour'].map(get_periode)
    return basket_times

print a(df)    
print b(df1)    
Sign up to request clarification or add additional context in comments.

Comments

1

You can define a function that maps a time period to the string you want, and then use map.

def get_periode(hour):
    if 4 <= hour <= 7:
        return 'morning'
    elif 8 <= hour <= 11:
        return 'before midday'

basket_times['periode'] = basket_times['hour'].map(get_periode)

1 Comment

works perfect! also I figured out, my approach wasn't working at all.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.