98

I have a pandas DataFrame in which first column is "user_id" and rest of the columns are tags("Tag_0" to "Tag_122").

I have the data in the following format:

UserId  Tag_0   Tag_1
7867688 0   5
7867688 0   3
7867688 3   0
7867688 3.5 3.5
7867688 4   4
7867688 3.5 0

My aim is to achieve Sum(Tag)/Count(NonZero(Tags)) for each user_id

df.groupby('user_id').sum(), gives me sum(tag), however I am clueless about counting non zero values.

Is it possible to achieve Sum(Tag)/Count(NonZero(Tags)) in one command?

5 Answers 5

185

My favorite way of getting number of nonzeros in each column is

df.astype(bool).sum(axis=0)

For the number of non-zeros in each row use

df.astype(bool).sum(axis=1)

(Thanks to Skulas)

If you have nans in your df you should make these zero first, otherwise they will be counted as 1.

df.fillna(0).astype(bool).sum(axis=1)

(Thanks to SirC)

Sign up to request clarification or add additional context in comments.

6 Comments

I think you meant to axis=0. If you do axis=1 you'd be counting non zero rows
@skulas Good catch! I guess most people come here for rows and that is why no-one has complained before :)
Thats a great one liner! To get all the column values which are not null
@Amir would datetypes ever be zero though?
It is dangerous if you have nan in your dataframe, they would contribute to the sum.
|
38

Why not use np.count_nonzero?

  1. To count the number of non-zeros of an entire dataframe, np.count_nonzero(df)
  2. To count the number of non-zeros of all rows np.count_nonzero(df, axis=0)
  3. To count the number of non-zeros of all columns np.count_nonzero(df, axis=1)

It works with dates too.

1 Comment

Thanksfor this answer! I ended up with this solution as I think it is very human-readable. I only modified two things: For my understanding of "getting the number of non-zero values for all rows" (your case 2) I needed axis=1 instead of axis=0. And I preferred to get the output as pandas.Series, so I used result = pd.Series(index=df.index, data=np.count_nonzero(df, axis=1))
14

To count nonzero values, just do (column!=0).sum(), where column is the data you want to do it for. column != 0 returns a boolean array, and True is 1 and False is 0, so summing this gives you the number of elements that match the condition.

So to get your desired result, do

df.groupby('user_id').apply(lambda column: column.sum()/(column != 0).sum())

4 Comments

@BrenBram What shall be the approach if we have negative values in some of the cells?
@HarshSingal: column != 0 will find all values that are not zero, regardless of whether they're positive or negative.
Sorry for not stating the problem precisely. When I implemented above method the user_id's for which the SUM(Tags) was negative returned -inf in the output while positive SUM(Tags) performed perfectly. I have been unable to figure out why!
@HarshSingal: You could get inf if there were no nonzero tags (so that the count of nonzero tags was zero). Your original formulation is not well-defined for that case, so you'll need to think about what you want the result to be.
0

I know this question is old but it seems OP's aim is different from the question title:

My aim is to achieve Sum(Tag)/Count(NonZero(Tags)) for each user_id...


For OP's aim, we could replace 0 with NaN and use groupby + mean (this works because mean skips NaN by default):

out = df.replace(0, np.nan).groupby('UserId', as_index=False).mean()

Output:

    UserId  Tag_0  Tag_1
0  7867688    3.5  3.875

Comments

0

A simple list comprehension to get the count of non-zero values in each column of df:

[np.count_nonzero(df[x]) for x in df.columns]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.