I have a large dataframe containing logs of users on a website, and I need to find the duration of each visit for each users.
I have 3.5 million rows and 450k single users.
this is my code :
temp=df["server.REMOTE_ADDR"]# main df with timestamps and ip adresses
user_db = df["server.REMOTE_ADDR"]# df with all IP adresses
user_db = user_db.drop_duplicates() # drop duplicate IP
time_thresh = 15*60 # if user inactive for 15 minutes, it's a new visit
temp_moyen=[] # array for mean times
temp_min=[] # array for minimal time
temp_max=[] # array for max time
nb_visites=[] # array for number of visit
for k,user in enumerate(user_db.values): # for each user
print("User {}/{}").format(k+1,len(user_db.values))
t0=[] # time of beginning of visit
tf=[] # time of end of visit
times_db = df[temp == user]["server.date"].values # retrieve all timestamps for current user
times_db = [dateutil.parser.parse(times) for times in times_db] # parse to datetime
i=1
last_t = times_db[0]
delta = 0
while i<len(times_db): # while there is still a timestamp in the list
t0.append(times_db[i-1]) # begin the first visit
delta=0
while (delta < time_thresh and i<len(times_db)): # while not inactive for 15 minutes
delta = (times_db[i]-last_t).total_seconds()
last_t = times_db[i]
i+=1
if i!=len(times_db): #if not last run
tf.append(times_db[i-2])
else: # if no more timestamp, record the last one as end of last visit
tf.append(times_db[-1])
if len(times_db)<=1: # if only one timestamp, tf = t0
tf.append(times_db[-1])
diff=[(final-first).total_seconds() for first,final in zip(t0,tf)] # evaluate diff between each t0 and tf
temp_moyen.append(np.mean(diff)) # add to the lists
temp_min.append(np.min(diff))
temp_max.append(np.max(diff))
nb_visites.append(len(diff))
user_db=user_db.to_frame() # convert to dataframe
user_db["temp_moyen"]=temp_moyen # add columns for each information (mean,min,max,number of visits)
user_db["temp_min"]=temp_min
user_db["temp_max"]=temp_max
user_db["nb_visites"]=nb_visites
This code works, but it is very slow : 200 users/minutes on my computer. What can I do to :
identify the bottleneck?
speed it up?
EDIT :
As requested, my data look like this :
par each user, I have a list of timestamps : [100, 101, 104, 106, 109, 200, 209, 211, 213]
I need to find how many visits a single user did, e.g in this case, it would represent two visits, 100-109 and 200-213. The first visit lasted 9, the second lasted 13, so I can have the mean, min and max of visits duration.
EDIT 2 : Bottleneck is here (277ms out of 300ms per loop):
times_db = df[temp == user]["server.date"].values # retrieve all timestamps for current user
I put it in a list comprehension before the for loop, but it is still slow :
times_db_all = [df[temp == user]["server.date"].values for user in user_db.values]
%timeit times_db_all = [df_temp[temp == user]["server.date"].values for user in user_db.values[0:3]]
1 loops, best of 3: 848 ms per loop #848ms for 3 users !!
my db looks like this :
user_ip | server.date
1.1.1.1 datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()),
1.1.1.1 datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()),
3.3.3.3 datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc()),
1.1.1.1 datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc())
4.4.4.4 datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc())
....