3

I have two dataframes: let's call them group_user_log and group_user

group_user_log

user_id  server_time  session_id  

1           2018-01-01   435
1           2018-01-01   435
1           2018-01-04   675
1           2018-01-05   454
1           2018-01-05   454
1           2018-01-06   920 


group_train 

user_id  impression_time  totalcount  distinct_count
1         2018-01-03      0            0
1         2018-01-05      0            0

Logic is to pull total and distinct count of session_id from group_user_log where server_time is less than impression_time and populate the total and distinct count columns. Expected output for group_train is:

user_id  impression_time  totalcount  distinct_count
1         2018-01-03      2               1
1         2018-01-05      3               2       

I tried doing it row-by-row but that is time consuming and very inefficient for larger dataframes because above data is a subset for a particular user_id from two large dataframes and such calculation needs to be done for a large number of user_id so I am looking to make it efficient.

Thanks for your help!!

3
  • Do you want distinct dates or distinct session ids? Commented Aug 29, 2019 at 16:22
  • Hi, distinct session_id is what I am aiming to get. Commented Aug 29, 2019 at 16:23
  • Possible duplicate of Pandas aggregate count distinct Commented Aug 29, 2019 at 16:27

1 Answer 1

3

With groupby , merge and query:

#merge on user_id and query for server_time<impression_time
m=group_user_log.merge(group_train,on='user_id').query('server_time<impression_time')
#groupby on user_id and impression_time and agg on size and nunique
(m.groupby(['user_id','impression_time'])['session_id'].agg(['size','nunique'])
   .rename(columns={'size':'totalcount','nunique':'distinct_count'}))

                         totalcount  distinct_count
user_id impression_time                            
1       2018-01-03                2               1
        2018-01-05                3               2

You can then use this to update the group_train by setting user_id and impression_time as index:

group_train=group_train.set_index(['user_id','impression_time'])
group_train.update(m)
print(group_train) #.reset_index()

                         totalcount  distinct_count
user_id impression_time                            
1       2018-01-03                2               1
        2018-01-05                3               2
Sign up to request clarification or add additional context in comments.

1 Comment

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.