Counting non zero values in each column of a DataFrame in python

Question

I have a pandas DataFrame in which first column is "user_id" and rest of the columns are tags("Tag_0" to "Tag_122").

I have the data in the following format:

UserId  Tag_0   Tag_1
7867688 0   5
7867688 0   3
7867688 3   0
7867688 3.5 3.5
7867688 4   4
7867688 3.5 0

My aim is to achieve Sum(Tag)/Count(NonZero(Tags)) for each user_id

df.groupby('user_id').sum(), gives me sum(tag), however I am clueless about counting non zero values.

Is it possible to achieve Sum(Tag)/Count(NonZero(Tags)) in one command?

The Unfun Cat · Accepted Answer · 2018-05-09 17:08:44Z

185

My favorite way of getting number of nonzeros in each column is

df.astype(bool).sum(axis=0)

For the number of non-zeros in each row use

df.astype(bool).sum(axis=1)

(Thanks to Skulas)

If you have nans in your df you should make these zero first, otherwise they will be counted as 1.

df.fillna(0).astype(bool).sum(axis=1)

(Thanks to SirC)

edited May 9, 2018 at 17:08

answered Dec 8, 2015 at 12:39

The Unfun Cat

32.5k32 gold badges127 silver badges168 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Skulas Over a year ago

I think you meant to axis=0. If you do axis=1 you'd be counting non zero rows

The Unfun Cat Over a year ago

@skulas Good catch! I guess most people come here for rows and that is why no-one has complained before :)

Chandra Kanth Over a year ago

Thats a great one liner! To get all the column values which are not null

The Unfun Cat Over a year ago

@Amir would datetypes ever be zero though?

SirC Over a year ago

It is dangerous if you have nan in your dataframe, they would contribute to the sum.

|

Sarah · Accepted Answer · 2019-10-11 18:04:22Z

38

Why not use np.count_nonzero?

To count the number of non-zeros of an entire dataframe, np.count_nonzero(df)
To count the number of non-zeros of all rows np.count_nonzero(df, axis=0)
To count the number of non-zeros of all columns np.count_nonzero(df, axis=1)

It works with dates too.

answered Oct 11, 2019 at 18:04

Sarah

2,00219 silver badges18 bronze badges

1 Comment

marcu1000s Over a year ago

Thanksfor this answer! I ended up with this solution as I think it is very human-readable. I only modified two things: For my understanding of "getting the number of non-zero values for all rows" (your case 2) I needed axis=1 instead of axis=0. And I preferred to get the output as pandas.Series, so I used result = pd.Series(index=df.index, data=np.count_nonzero(df, axis=1))

BrenBarn · Accepted Answer · 2014-09-26 07:06:56Z

14

To count nonzero values, just do (column!=0).sum(), where column is the data you want to do it for. column != 0 returns a boolean array, and True is 1 and False is 0, so summing this gives you the number of elements that match the condition.

So to get your desired result, do

df.groupby('user_id').apply(lambda column: column.sum()/(column != 0).sum())

answered Sep 26, 2014 at 7:06

BrenBarn

253k39 gold badges421 silver badges392 bronze badges

4 Comments

Harsh Singal Over a year ago

@BrenBram What shall be the approach if we have negative values in some of the cells?

BrenBarn Over a year ago

@HarshSingal: column != 0 will find all values that are not zero, regardless of whether they're positive or negative.

Harsh Singal Over a year ago

Sorry for not stating the problem precisely. When I implemented above method the user_id's for which the SUM(Tags) was negative returned -inf in the output while positive SUM(Tags) performed perfectly. I have been unable to figure out why!

BrenBarn Over a year ago

@HarshSingal: You could get inf if there were no nonzero tags (so that the count of nonzero tags was zero). Your original formulation is not well-defined for that case, so you'll need to think about what you want the result to be.

user7864386 · Accepted Answer · 2022-02-19 07:00:33Z

0

I know this question is old but it seems OP's aim is different from the question title:

My aim is to achieve Sum(Tag)/Count(NonZero(Tags)) for each user_id...

For OP's aim, we could replace 0 with NaN and use groupby + mean (this works because mean skips NaN by default):

out = df.replace(0, np.nan).groupby('UserId', as_index=False).mean()

Output:

    UserId  Tag_0  Tag_1
0  7867688    3.5  3.875

answered Feb 19, 2022 at 7:00

user7864386

Comments

datariel · Accepted Answer · 2023-05-19 18:24:54Z

0

A simple list comprehension to get the count of non-zero values in each column of df:

[np.count_nonzero(df[x]) for x in df.columns]

answered May 19, 2023 at 18:24

datariel

16013 bronze badges

Collectives™ on Stack Overflow

Counting non zero values in each column of a DataFrame in python

5 Answers 5

6 Comments

1 Comment

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

6 Comments

1 Comment

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related