Reduce execution time when locating element in pandas dataframe

Question

I asked another question here, and I identified the bottleneck in my script, so I ask my question with more clarity. My code looks like this :

temp=df["IPs"]
times_db_all = [df[temp == user]["time"].values for user in user_db.values]

%timeit times_db_all = [df_temp[temp == user]["time"].values for user in user_db.values[0:3]]
1 loops, best of 3: 848 ms per loop #848ms for 3 users !!

my df looks like this :

IPs        time
1.1.1.1    datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()),
1.1.1.1    datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()),
3.3.3.3    datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc()),
1.1.1.1    datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc())
4.4.4.4    datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc())
....

with user_db.values = ["1.1.1.1","3.3.3.3","4.4.4.4",...]

The goal is to have the list of all the timestamp in the "time" column of df, for each user. I then use this list to check how long the user stayed on the website and how many times he visited :

       IP       time
    1.1.1.1    [datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()), 
               datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()),   
               datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc())]
    3.3.3.3    [datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc())]
    4.4.4.4    [datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc())]

My issue is that I have 3.5 millions row, and it slows the execution of this line a lot.

What could be a faster way to do the same thing?

Any reason why you're not just grouping by IP?

swenzel
– swenzel

2017-01-11 13:38:21 +00:00
Commented Jan 11, 2017 at 13:38 — swenzel
– swenzel, Commented Jan 11, 2017 at 13:38
No reason, can you provide an answer ?

CoMartel
– CoMartel

2017-01-11 13:53:11 +00:00
Commented Jan 11, 2017 at 13:53 — CoMartel
– CoMartel, Commented Jan 11, 2017 at 13:53

Ted Petrou · Accepted Answer · 2017-01-11 13:23:26Z

3

You should not be utilizing a for loop to do individual boolean selection as you are doing. The isin method is designed exactly for this purpose and will select rows that match any of the values in user_db. Try it instead

df.loc[df['IPs'].isin(user_db.values), "time"]

answered Jan 11, 2017 at 13:23

Ted Petrou

62.4k19 gold badges139 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

CoMartel Over a year ago

Unless I'm mistaken, this gives me a single dataframe with all the timestamp for each users. I need an independant list of timestamp for each users.

Ted Petrou Over a year ago

The above solution only selects the time column. If you want all the columns of the original dataframe back then just do df.loc[df['IPs'].isin(user_db.values)]

ashishsingal · Accepted Answer · 2017-01-11 14:09:38Z

Try groupby as below ..

# create a random dataframe with your data
def create_ip(): return '.'.join([str(randint(0,255)) for i in range(4)])
def create_dt(): return datetime.datetime(2017, 1, randint(1,10), randint(0,23), randint(0,59))
df = pd.DataFrame({'ip': [create_ip() for i in range(10)]*10,
                   'time': [create_dt() for i in range(100)]})

# use groupby
df.groupby('ip')['time'].apply(list)

[Out]
ip
127.140.64.48      [2017-01-10 04:23:00, 2017-01-03 16:55:00, 201...
150.206.39.49      [2017-01-02 03:07:00, 2017-01-07 21:59:00, 201...
186.188.130.77     [2017-01-04 13:03:00, 2017-01-05 19:23:00, 201...
190.152.20.150     [2017-01-02 12:47:00, 2017-01-03 23:55:00, 201...
208.235.194.243    [2017-01-10 08:55:00, 2017-01-08 08:07:00, 201...
223.138.217.41     [2017-01-02 22:36:00, 2017-01-10 02:16:00, 201...
226.176.251.244    [2017-01-03 12:08:00, 2017-01-07 06:14:00, 201...
24.21.19.130       [2017-01-07 14:05:00, 2017-01-05 04:25:00, 201...
50.167.31.84       [2017-01-10 03:28:00, 2017-01-03 11:05:00, 201...
83.56.204.14       [2017-01-08 12:46:00, 2017-01-01 03:05:00, 201...
Name: time, dtype: object

# compare times
%timeit df.groupby('ip')['time'].apply(list)
[Out] 100 loops, best of 3: 2.69 ms per loop

%timeit times_db_all = [df[df['ip'] == user]['time'].values for user in df['ip'].unique()]
[Out] 100 loops, best of 3: 10.6 ms per loop

You might be able to make this even faster by setting 'ip' as your index and then grouping on the index.

Collectives™ on Stack Overflow

Reduce execution time when locating element in pandas dataframe

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related