Efficient Way of Updating Dataframe Columns

Question

I have two dataframes: let's call them group_user_log and group_user

group_user_log

user_id  server_time  session_id  

1           2018-01-01   435
1           2018-01-01   435
1           2018-01-04   675
1           2018-01-05   454
1           2018-01-05   454
1           2018-01-06   920 


group_train 

user_id  impression_time  totalcount  distinct_count
1         2018-01-03      0            0
1         2018-01-05      0            0

Logic is to pull total and distinct count of session_id from group_user_log where server_time is less than impression_time and populate the total and distinct count columns. Expected output for group_train is:

user_id  impression_time  totalcount  distinct_count
1         2018-01-03      2               1
1         2018-01-05      3               2

I tried doing it row-by-row but that is time consuming and very inefficient for larger dataframes because above data is a subset for a particular user_id from two large dataframes and such calculation needs to be done for a large number of user_id so I am looking to make it efficient.

Thanks for your help!!

Do you want distinct dates or distinct session ids?

user3483203
– user3483203

2019-08-29 16:22:54 +00:00
Commented Aug 29, 2019 at 16:22 — user3483203
– user3483203, Commented Aug 29, 2019 at 16:22
Hi, distinct session_id is what I am aiming to get.

ChandanJha
– ChandanJha

2019-08-29 16:23:54 +00:00
Commented Aug 29, 2019 at 16:23 — ChandanJha
– ChandanJha, Commented Aug 29, 2019 at 16:23
Possible duplicate of Pandas aggregate count distinct

G. Anderson
– G. Anderson

2019-08-29 16:27:29 +00:00
Commented Aug 29, 2019 at 16:27 — G. Anderson
– G. Anderson, Commented Aug 29, 2019 at 16:27

anky · Accepted Answer · 2019-08-30 06:42:05Z

3

With groupby , merge and query:

#merge on user_id and query for server_time<impression_time
m=group_user_log.merge(group_train,on='user_id').query('server_time<impression_time')
#groupby on user_id and impression_time and agg on size and nunique
(m.groupby(['user_id','impression_time'])['session_id'].agg(['size','nunique'])
   .rename(columns={'size':'totalcount','nunique':'distinct_count'}))

                         totalcount  distinct_count
user_id impression_time                            
1       2018-01-03                2               1
        2018-01-05                3               2

You can then use this to update the group_train by setting user_id and impression_time as index:

group_train=group_train.set_index(['user_id','impression_time'])
group_train.update(m)
print(group_train) #.reset_index()

                         totalcount  distinct_count
user_id impression_time                            
1       2018-01-03                2               1
        2018-01-05                3               2

edited Aug 30, 2019 at 6:42

answered Aug 29, 2019 at 16:38

anky

75.3k11 gold badges46 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ChandanJha Over a year ago

Let us continue this discussion in chat.

Collectives™ on Stack Overflow

Efficient Way of Updating Dataframe Columns

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related