1

I asked another question here, and I identified the bottleneck in my script, so I ask my question with more clarity. My code looks like this :

temp=df["IPs"]
times_db_all = [df[temp == user]["time"].values for user in user_db.values]

%timeit times_db_all = [df_temp[temp == user]["time"].values for user in user_db.values[0:3]]
1 loops, best of 3: 848 ms per loop #848ms for 3 users !!

my df looks like this :

IPs        time
1.1.1.1    datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()),
1.1.1.1    datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()),
3.3.3.3    datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc()),
1.1.1.1    datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc())
4.4.4.4    datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc())
....

with user_db.values = ["1.1.1.1","3.3.3.3","4.4.4.4",...]

The goal is to have the list of all the timestamp in the "time" column of df, for each user. I then use this list to check how long the user stayed on the website and how many times he visited :

       IP       time
    1.1.1.1    [datetime.datetime(2017, 1, 3, 0, 0, 3, tzinfo=tzutc()), 
               datetime.datetime(2017, 1, 4, 1, 7, 30, tzinfo=tzutc()),   
               datetime.datetime(2017, 1, 10, 16, 22, 56, tzinfo=tzutc())]
    3.3.3.3    [datetime.datetime(2017, 1, 4, 5, 58, 52, tzinfo=tzutc())]
    4.4.4.4    [datetime.datetime(2017, 1, 10, 16, 23, 01, tzinfo=tzutc())]

My issue is that I have 3.5 millions row, and it slows the execution of this line a lot.

What could be a faster way to do the same thing?

2
  • 2
    Any reason why you're not just grouping by IP? Commented Jan 11, 2017 at 13:38
  • No reason, can you provide an answer ? Commented Jan 11, 2017 at 13:53

2 Answers 2

3

You should not be utilizing a for loop to do individual boolean selection as you are doing. The isin method is designed exactly for this purpose and will select rows that match any of the values in user_db. Try it instead

df.loc[df['IPs'].isin(user_db.values), "time"]
Sign up to request clarification or add additional context in comments.

2 Comments

Unless I'm mistaken, this gives me a single dataframe with all the timestamp for each users. I need an independant list of timestamp for each users.
The above solution only selects the time column. If you want all the columns of the original dataframe back then just do df.loc[df['IPs'].isin(user_db.values)]
1

Try groupby as below ..

# create a random dataframe with your data
def create_ip(): return '.'.join([str(randint(0,255)) for i in range(4)])
def create_dt(): return datetime.datetime(2017, 1, randint(1,10), randint(0,23), randint(0,59))
df = pd.DataFrame({'ip': [create_ip() for i in range(10)]*10,
                   'time': [create_dt() for i in range(100)]})

# use groupby
df.groupby('ip')['time'].apply(list)

[Out]
ip
127.140.64.48      [2017-01-10 04:23:00, 2017-01-03 16:55:00, 201...
150.206.39.49      [2017-01-02 03:07:00, 2017-01-07 21:59:00, 201...
186.188.130.77     [2017-01-04 13:03:00, 2017-01-05 19:23:00, 201...
190.152.20.150     [2017-01-02 12:47:00, 2017-01-03 23:55:00, 201...
208.235.194.243    [2017-01-10 08:55:00, 2017-01-08 08:07:00, 201...
223.138.217.41     [2017-01-02 22:36:00, 2017-01-10 02:16:00, 201...
226.176.251.244    [2017-01-03 12:08:00, 2017-01-07 06:14:00, 201...
24.21.19.130       [2017-01-07 14:05:00, 2017-01-05 04:25:00, 201...
50.167.31.84       [2017-01-10 03:28:00, 2017-01-03 11:05:00, 201...
83.56.204.14       [2017-01-08 12:46:00, 2017-01-01 03:05:00, 201...
Name: time, dtype: object

# compare times
%timeit df.groupby('ip')['time'].apply(list)
[Out] 100 loops, best of 3: 2.69 ms per loop

%timeit times_db_all = [df[df['ip'] == user]['time'].values for user in df['ip'].unique()]
[Out] 100 loops, best of 3: 10.6 ms per loop

You might be able to make this even faster by setting 'ip' as your index and then grouping on the index.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.