156

Let's say I have a log of user activity and I want to generate a report of the total duration and the number of unique users per day.

import numpy as np
import pandas as pd
df = pd.DataFrame({'date': ['2013-04-01','2013-04-01','2013-04-01','2013-04-02', '2013-04-02'],
    'user_id': ['0001', '0001', '0002', '0002', '0002'],
    'duration': [30, 15, 20, 15, 30]})

Aggregating duration is pretty straightforward:

group = df.groupby('date')
agg = group.aggregate({'duration': np.sum})
agg
            duration
date
2013-04-01        65
2013-04-02        45

What I'd like to do is sum the duration and count distincts at the same time, but I can't seem to find an equivalent for count_distinct:

agg = group.aggregate({ 'duration': np.sum, 'user_id': count_distinct})

This works, but surely there's a better way, no?

group = df.groupby('date')
agg = group.aggregate({'duration': np.sum})
agg['uv'] = df.groupby('date').user_id.nunique()
agg
            duration  uv
date
2013-04-01        65   2
2013-04-02        45   1

I'm thinking I just need to provide a function that returns the count of distinct items of a Series object to the aggregate function, but I don't have a lot of exposure to the various libraries at my disposal. Also, it seems that the groupby object already knows this information, so wouldn't I just be duplicating the effort?

5 Answers 5

231

How about either of:

>>> df
         date  duration user_id
0  2013-04-01        30    0001
1  2013-04-01        15    0001
2  2013-04-01        20    0002
3  2013-04-02        15    0002
4  2013-04-02        30    0002
>>> df.groupby("date").agg({"duration": np.sum, "user_id": pd.Series.nunique})
            duration  user_id
date                         
2013-04-01        65        2
2013-04-02        45        1
>>> df.groupby("date").agg({"duration": np.sum, "user_id": lambda x: x.nunique()})
            duration  user_id
date                         
2013-04-01        65        2
2013-04-02        45        1
Sign up to request clarification or add additional context in comments.

3 Comments

That's it. pd.Series.nunique is what I couldn't find, well, couldn't get to work correctly. Pretty obvious in hindsight. Thanks!
This answer is outdated. You can now use nunique directly. See @Blodwyn Pig's solution below
Hey do you know how to get non-duplicate count?
130

'nunique' is an option for .agg() since pandas 0.20.0, so:

df.groupby('date').agg({'duration': 'sum', 'user_id': 'nunique'})

3 Comments

Is it possible to agg and get the unique values? something like duration: np.unique
@guy Try df.groupby('date').agg({'user_id': lambda s: s.unique().reset_index(drop=True)})
How do we get the output?
38

Just adding to the answers already given, the solution using the string "nunique" seems much faster, tested here on ~21M rows dataframe, then grouped to ~2M

%time _=g.agg({"id": lambda x: x.nunique()})
CPU times: user 3min 3s, sys: 2.94 s, total: 3min 6s
Wall time: 3min 20s

%time _=g.agg({"id": pd.Series.nunique})
CPU times: user 3min 2s, sys: 2.44 s, total: 3min 4s
Wall time: 3min 18s

%time _=g.agg({"id": "nunique"})
CPU times: user 14 s, sys: 4.76 s, total: 18.8 s
Wall time: 24.4 s

4 Comments

Nice catch! I guess it's b/c in a "lambda"/"other function" case it is applied sequentially, while "known" functions are applied to the whole column in a vectorized fashion.
which solution is from @Blodwyn Pig?
@Chogg, the fastest one!
@Chogg - sorry I changed my username. It was me.
0

If you want to get only a number of distinct values per group you can use the method nunique directly with the DataFrameGroupBy object:

df.groupby('date')['user_id'].nunique()

Comments

0

You can find it for all columns at once with the aggregate method,

df.aggregate(func=pd.Series.nunique, axis=0)
# or
df.aggregate(func='nunique', axis=0)

See aggregate | Pandas Docs

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.