Pandas: Counting unique values in a dataframe

Question

We have a DataFrame that looks like this:

> df.ix[:2,:10]
    0   1   2   3   4   5   6   7   8   9   10
0   NaN NaN NaN NaN  6   5  NaN NaN  4  NaN  5
1   NaN NaN NaN NaN  8  NaN NaN  7  NaN NaN  5
2   NaN NaN NaN NaN NaN  1  NaN NaN NaN NaN NaN

We simply want the counts of all unique values in the DataFrame. A simple solution is:

df.stack().value_counts()

However: 1. It looks like stack returns a copy, not a view, which is memory prohibitive in this case. Is this correct? 2. I want to group the DataFrame by rows, and then get the different histograms for each grouping. If we ignore the memory issues with stack and use it for now, how does one do the grouping correctly?

d = pd.DataFrame([[nan, 1, nan, 2, 3],
              [nan, 1, 1, 1, 3],
              [nan, 1, nan, 2, 3],
              [nan,2,2,2, 3]])

len(d.stack()) #14
d.stack().groupby(arange(4))
AssertionError: Grouper and axis must be same length

The stacked DataFrame has a MultiIndex, with a length of some number less than n_rows*n_columns, because the nans are removed.

This means we don't easily know how to build our grouping. It would be much better to just operate on the first level, but then I'm stuck on how to then apply the grouping I actually want.

d.stack().groupby(level=0).groupby(list('aabb'))
KeyError: 'a'

Edit: A solution, which doesn't use stacking:

f = lambda x: pd.value_counts(x.values.ravel())
d.groupby(list('aabb')).apply(f)
a  1    4
   3    2
   2    1
b  2    4
   3    2
   1    1
dtype: int64

Looks clunky, though. If there's a better option I'm happy to hear it.

Edit: Dan's comment revealed I had a typo, though correcting that still doesn't get us to the finish line.

df.stack().groupby(list('aaaabbbb')).value_counts() works for me. — Dan Allan
– Dan Allan, Commented Feb 7, 2014 at 17:16
@DanAllan Hmmm the value_counts shortcut has been turned off in 0.13+ (thought it was whitelisted, but guess not...) — Andy Hayden
– Andy Hayden, Commented Feb 7, 2014 at 20:19

Andy Hayden · Accepted Answer · 2014-02-07 20:18:28Z

7

I think you are doing a row/column-wise operation so can use apply:

In [11]: d.apply(pd.Series.value_counts, axis=1).fillna(0)
Out[11]: 
   1  2  3
0  1  1  1
1  4  0  1
2  1  1  1
3  0  4  1

Note: There is a value_counts DataFrame method in the works for 0.14... which will make this more efficient and more concise.

It's worth noting that the pandas value_counts function also works on a numpy array, so you can pass it the values of the DataFrame (as a 1-d array view using np.ravel):

In [21]: pd.value_counts(d.values.ravel())
Out[21]: 
2    6
1    6
3    4
dtype: int64

Also, you were pretty close to getting this correct, but you'd need to stack and unstack:

In [22]: d.stack().groupby(level=0).apply(pd.Series.value_counts).unstack().fillna(0)
Out[22]: 
   1  2  3
0  1  1  1
1  4  0  1
2  1  1  1
3  0  4  1

This error seems somewhat self explanatory (4 != 16):

len(d.stack()) #16
d.stack().groupby(arange(4))
AssertionError: Grouper and axis must be same length

perhaps you wanted to pass:

In [23]: np.repeat(np.arange(4), 4)
Out[23]: array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])

answered Feb 7, 2014 at 20:18

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

jeffalstott Over a year ago

Thanks, Andy. This gets close, but the grouping still isn't quite right. See the example I'm about to put into the original question. Also, does stack return a copy instead of a view?

jeffalstott Over a year ago

The issue is the nans make the number of rows variable, so we can't readily know ahead of time how to construct an appropriate grouper for d.stack(). It would be better to just operate without the stack, as you've done; I just don't see a way to get the appropriate grouping in there.

Andy Hayden Over a year ago

@jeffalstott you groupby the level, stack is a copy. I'm confused what the final result is that you actually want....

jeffalstott Over a year ago

Got a solution. Happy to hear of a better way of doing it. I will accept your answer in the meantime.

tegan · Accepted Answer · 2014-12-02 15:23:27Z

1

Not enough rep to comment, but Andy's answer:

pd.value_counts(d.values.ravel())

is what I have used personally, and seems to me to be by far the most versatile and easily-readable solution. Another advantage is that it is easy to use a subset of the columns:

pd.value_counts(d[[1,3,4,6,7]].values.ravel())

or

pd.value_counts(d[["col_title1","col_title2"]].values.ravel())

Is there any disadvantage to this approach, or any particular reason you want to use stack and groupby?

answered Dec 2, 2014 at 15:23

tegan

2,2742 gold badges17 silver badges18 bronze badges

Collectives™ on Stack Overflow

Pandas: Counting unique values in a dataframe

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related