0

I have a large dataframe containing logs of users on a website, and I need to find the duration of each visit for each users.

I have 3.5 million rows and 450k single users.

this is my code :

temp=df["server.REMOTE_ADDR"]# main df with timestamps and ip adresses
user_db = df["server.REMOTE_ADDR"]# df with all IP adresses

user_db = user_db.drop_duplicates() # drop duplicate IP
time_thresh = 15*60 # if user inactive for 15 minutes, it's a new visit
temp_moyen=[] # array for mean times
temp_min=[] # array for minimal time
temp_max=[] # array for max time
nb_visites=[] # array for number of visit

for k,user in enumerate(user_db.values): # for each user
    print("User {}/{}").format(k+1,len(user_db.values))
    t0=[] # time of beginning of visit
    tf=[] # time of end of visit
    times_db = df[temp == user]["server.date"].values # retrieve all timestamps for current user
    times_db = [dateutil.parser.parse(times) for times in times_db] # parse to datetime
    i=1
    last_t = times_db[0]
    delta = 0
    while i<len(times_db): # while there is still a timestamp in the list
        t0.append(times_db[i-1]) # begin the first visit
        delta=0
        while (delta < time_thresh and i<len(times_db)): # while not inactive for 15 minutes
            delta = (times_db[i]-last_t).total_seconds()
            last_t = times_db[i]
            i+=1
        if i!=len(times_db): #if not last run
            tf.append(times_db[i-2])
        else: # if no more timestamp, record the last one as end of last visit
            tf.append(times_db[-1])
    if len(times_db)<=1: # if only one timestamp, tf = t0
        tf.append(times_db[-1])

    diff=[(final-first).total_seconds() for first,final in zip(t0,tf)] # evaluate diff between each t0 and tf
    temp_moyen.append(np.mean(diff)) # add to the lists
    temp_min.append(np.min(diff))
    temp_max.append(np.max(diff))
    nb_visites.append(len(diff))

user_db=user_db.to_frame() # convert to dataframe
user_db["temp_moyen"]=temp_moyen # add columns for each information (mean,min,max,number of visits)
user_db["temp_min"]=temp_min
user_db["temp_max"]=temp_max
user_db["nb_visites"]=nb_visites

This code works, but it is very slow : 200 users/minutes on my computer. What can I do to :

  • identify the bottleneck?

  • speed it up?

EDIT : As requested, my data look like this : par each user, I have a list of timestamps : [100, 101, 104, 106, 109, 200, 209, 211, 213]

I need to find how many visits a single user did, e.g in this case, it would represent two visits, 100-109 and 200-213. The first visit lasted 9, the second lasted 13, so I can have the mean, min and max of visits duration.

EDIT 2 : Bottleneck is here (277ms out of 300ms per loop):

times_db = df[temp == user]["server.date"].values # retrieve all timestamps for current user

I put it in a list comprehension before the for loop, but it is still slow :

times_db_all = [df[temp == user]["server.date"].values for user in user_db.values]

%timeit times_db_all = [df_temp[temp == user]["server.date"].values for user in user_db.values[0:3]]
1 loops, best of 3: 848 ms per loop #848ms for 3 users !!

my db looks like this :

user_ip  | server.date
1.1.1.1    datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()),
1.1.1.1    datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()),
3.3.3.3    datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc()),
1.1.1.1    datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc())
4.4.4.4    datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc())
....
3
  • The first question obviously needs to be answered first, for this you can try profiling, docs.python.org/2/library/profile.html. This will identify which routines take most time, how often they are called etc. And a preemptive answer to the second question is that loops are usually slower than vector operation. Commented Jan 11, 2017 at 10:32
  • I agree with your point on vector operation, I just don't see how to apply it in my case. I will try profiling Commented Jan 11, 2017 at 10:35
  • 2
    Better would be to provide sample data explaining what you are trying to do and what you expect the result to look like. Commented Jan 11, 2017 at 10:35

2 Answers 2

2

To continue from my comment about removing the loop: as I see it you have a bunch of timestamps of activity and you are assuming that as long as these timestamps are close together they relate to a single visit and otherwise they represent different visits. As an example, [100, 101, 104, 106, 109, 200, 209, 211, 213] would represent two visits, 100-109 and 200-213. To speed this up, you could do the following using scipy:

import scipy

cutoff = 15

times = scipy.array([100, 101, 104, 106, 109, 200, 209, 211, 213, 300, 310, 325])
delta = times[1:] - times[:-1]
which = delta > cutoff # identifies which gaps represent a new visit
N_visits = which.sum() + 1 # note the +1 for 'fence post'
L_boundaries = scipy.zeros((N_visits,)) # generating these arrays might be unnecessary and relatvely slow
R_boundaries = scipy.zeros((N_visits,))
L_boundaries[1:] = times[1:][which]
R_boundaries[:-1] = times[:-1][which]
visit_lengths = R_boundaries - L_boundaries

This can probably be made even faster, but it is probably already a lot faster than your current loop.

The following is probably a little faster, at the expense of clarity in the code

import scipy

cutoff = 15

times = scipy.array([100, 101, 104, 106, 109, 200, 209, 211, 213, 300, 310, 325])
which = times[1:] - times[:-1] > cutoff
N_visits = which.sum() + 1 # fence post
visit_lengths = scipy.zeros((N_visits,)) # it is probably inevitable to have to generate this new array
visit_lengths[0]    = times[:-1][which][0] - times[0]
visit_lengths[1:-1] = times[:-1][which][1:] - times[1:][which][:-1]
visit_lengths[-1]   = times[-1] - times[1:][which][-1]

I also think that if you maybe don't care too much about the first and last visits it might be worth considering to just ignore these.

EDIT based on OP EDIT

You maybe should look at http://pandas.pydata.org/pandas-docs/stable/indexing.html. I think what is slow is the fact that you are making a copy of part of your dataframe for every user, that is, df[temp == user] makes a new dataframe, and stores it as times_db, maybe it would be faster to put the resulting values into a numpy array? You could also perform the parsing to datetime first for the whole dataframe.

Sign up to request clarification or add additional context in comments.

1 Comment

You are right, this should be faster. I'll try, thanks
1

I can't see the sample data so here is my advice:

  • Before you try to optimize your code,I suggest you to use profiler to get statistics of your code.

    import cProfile
    cProfile.run('foo()')

    or python -m cProfile foo.py And you can get the statistics that describes how often and for how long various parts of your program executed.This is an essential prerequisite of optimisation.

  • If your data is multi-dimensional arrays and matrices,try pandas or numpy,it will speed up your code.

  • Sometimes the reason why programs are slow is too many disk I/O or database access.So make sure this never showed in your code.

  • Try to eliminate common subexpressions where in tight loops.

Hope this helps.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.