33

I have dataframe

    site1   time1   site2   time2   site3   time3   site4   time4   site5   time5   ... time6   site7   time7   site8   time8   site9   time9   site10  time10  target
 session_id                                                                                 

21669   56  2013-01-12 08:05:57 55.0    2013-01-12 08:05:57 NaN NaT NaN NaT NaN NaT ... NaT NaN NaT NaN NaT NaN NaT NaN NaT 0
54843   56  2013-01-12 08:37:23 55.0    2013-01-12 08:37:23 56.0    2013-01-12 09:07:07 55.0    2013-01-12 09:07:09 NaN NaT ... NaT NaN NaT NaN NaT NaN NaT NaN NaT 0
77292   946 2013-01-12 08:50:13 946.0   2013-01-12 08:50:14 951.0   2013-01-12 08:50:15 946.0   2013-01-12 08:50:15 946.0   2013-01-12 08:50:16 ... 2013-01-12 08:50:16 948.0   2013-01-12 08:50:16 784.0   2013-01-12 08:50:16 949.0   2013-01-12 08:50:17 946.0   2013-01-12 08:50:17 0
114021  945 2013-01-12 08:50:17 948.0   2013-01-12 08:50:17 949.0   2013-01-12 08:50:18 948.0   2013-01-12 08:50:18 945.0   2013-01-12 08:50:18 ... 2013-01-12 08:50:18 947.0   2013-01-12 08:50:19 945.0   2013-01-12 08:50:19 946.0   2013-01-12 08:50:19 946.0   2013-01-12 08:50:20 0

I need to count N of columns, where site != NaN. I try to use

df[['site%s' % i for i in range(1, 11)]].count(axis=1)

but it returns me 10 to every id

Also I have tried

train_df[sites].notnull().count(axis=1)

and it also didn't help.

Desire output

21669    2
54843    4
77292    10
114021   10
1
  • 1
    train_df[sites].notnull().sum(axis=1)? You only want to sum the True elements in your columns. Alternatively, use train_df[sites].count(axis=1) Commented Oct 31, 2017 at 20:40

3 Answers 3

49

I'd do this with just count:

train_df[sites].count(axis=1)

count specifically counts non-null values. The issue with your current implementation is that notnull yields boolean values, and bools are certainly not-null, meaning they are always counted.


df

        one       two     three four   five
a -0.166778  0.501113 -0.355322  bar  False
b       NaN       NaN       NaN  NaN    NaN
c -0.337890  0.580967  0.983801  bar  False
d       NaN       NaN       NaN  NaN    NaN
e  0.057802  0.761948 -0.712964  bar   True
f -0.443160 -0.974602  1.047704  bar  False
g       NaN       NaN       NaN  NaN    NaN
h -0.717852 -1.053898 -0.019369  bar  False

df.count(axis=1)

a    5
b    0
c    5
d    0
e    5
f    5
g    0
h    5
dtype: int64

And...

df.notnull().count(axis=1)


a    5
b    5
c    5
d    5
e    5
f    5
g    5
h    5
dtype: int64
Sign up to request clarification or add additional context in comments.

2 Comments

it returns me 10 to every id
@PetrPetrov Try saving your file... See my edit, it works nicely.
11

Also trading count(axis=1) for sum() should do the trick

train_df[sites].notnull().sum()

1 Comment

train_df[sites].isnull().sum() and train_df[sites].isnull().any() are two more useful idioms (first counts number of null values, and second shows if there are any nulls)
5

A simple way to find the number of missing values by row-wise is :

df.isnull().sum(axis=1)

To find the number of rows which are having more than 3 null values:

df[df.isnull().sum(axis=1) >=3]

In case if you need to drop rows which are having more than 3 null values then you can follow this code:

df = df[df.isnull().sum(axis=1) < 3]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.