I asked another question here, and I identified the bottleneck in my script, so I ask my question with more clarity. My code looks like this :
temp=df["IPs"]
times_db_all = [df[temp == user]["time"].values for user in user_db.values]
%timeit times_db_all = [df_temp[temp == user]["time"].values for user in user_db.values[0:3]]
1 loops, best of 3: 848 ms per loop #848ms for 3 users !!
my df looks like this :
IPs time
1.1.1.1 datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()),
1.1.1.1 datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()),
3.3.3.3 datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc()),
1.1.1.1 datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc())
4.4.4.4 datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc())
....
with user_db.values = ["1.1.1.1","3.3.3.3","4.4.4.4",...]
The goal is to have the list of all the timestamp in the "time" column of df, for each user. I then use this list to check how long the user stayed on the website and how many times he visited :
IP time
1.1.1.1 [datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()),
datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()),
datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc())]
3.3.3.3 [datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc())]
4.4.4.4 [datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc())]
My issue is that I have 3.5 millions row, and it slows the execution of this line a lot.
What could be a faster way to do the same thing?