Pandas Dataframe : Using the count function to filter data

Question

I have a pandas dataframe from which I want to create a new dataframe by applying a filter based on the count function such that only those columns should be selected whose count is equal to a specified number.

For example in the dataframe below:

month_end   Col A       Col B       Col C       Col D
200703      NaN          NaN         NaN         NaN
200704      0.084       0.152       0.142      0.0766
200705      0.124       0.123       0.020       NaN 
200706      NaN         0.191       0.091       0.149   
200707      -0.136      0.047       0.135      -0.127

If my_variable = 4, then df1 should only contain Col B and Col D alongwith the index month_end.

How do I do this?

So the first question I have is: did you tried anything first? Dataframes have a count method that will give you a series where the index is the names of the columns and the values are the number of non-null results in that column. — Paul H
– Paul H, Commented Sep 7, 2020 at 15:21
Can you add a running example? That will show us where you are and save us the hassle of building the dataframe ourselves. BTW, clarify "count" - I think you want the count of non NaN values, but be specific. — tdelaney
– tdelaney, Commented Sep 7, 2020 at 15:24
Yes. I have some understanding of applying column based filters but I am not too sure how do I apply a count function on all the columns to create a new dataframe. Thank you — Jamil
– Jamil, Commented Sep 7, 2020 at 15:25
@tdelaney, Yes I want to filter on the basis of non NaN values. I have added an image to my question. I hope it makes my question easier to understand. Thank you very much for responding. — Jamil
– Jamil, Commented Sep 7, 2020 at 15:27

yatu · Accepted Answer · 2020-09-07 15:31:38Z

5

You could do something along the lines of:

df.loc[:,df.notna().sum(0).eq(4)]

    ColB   ColC
0    NaN    NaN
1  0.152  0.142
2  0.123  0.020
3  0.191  0.091
4  0.047  0.135

Or there's also count, which already drops dupes prior to counting:

df.loc[:,df.count().eq(4)]

If you want to include the date column, and it isn't the index:

ix = df.notna().sum(0).eq(4)
df.loc[:,ix.index[ix].union(['month_end'])]

    ColB   ColC  month_end
0    NaN    NaN     200703
1  0.152  0.142     200704
2  0.123  0.020     200705
3  0.191  0.091     200706
4  0.047  0.135     200707

edited Sep 7, 2020 at 15:31

answered Sep 7, 2020 at 15:23

yatu

88.6k12 gold badges93 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Chris Over a year ago

he could also like to return the date column

Paul H Over a year ago

@Chris This answer is on the right track. If the month_end "column" is the actually index, it'll be included when the OP runs this on the actual dataframe.

Paul H Over a year ago

instead of .notna().sum(), i would use .count()

yatu Over a year ago

Yeh, not one i tend to use, so wasn't in the top of my mind. Nonetheless, fitted in the answer :) @PaulH

Jamil Over a year ago

Thank you everyone for your quick help. It looks like the solution works perfectly well. Eyeballing the dataframe (df1), I can see that the columns with lesser row count have been chopped off the data frame. Will do a more detailed reconciliation of my data, but this looks sorted more or less. Thank you once again.

Terry · Accepted Answer · 2020-09-07 16:35:43Z

1

Another solution without a loop:

s = df.notna().sum(0) == 4     
df = df.loc[:, s]

edited Sep 7, 2020 at 16:35

Terry

2,8212 gold badges16 silver badges30 bronze badges

answered Sep 7, 2020 at 15:36

gtomer

6,6041 gold badge14 silver badges29 bronze badges

5 Comments

gtomer Over a year ago

Emmmm. Just saw it is similar to @yatu above :-(

Jamil Over a year ago

Thank you very much for your help. I have learnt some really elegant solutions today. Thank you once again.

Jamil Over a year ago

Sure. I am new here (not very active). I will keep this in mind henceforth. Thank you very much.

yatu Over a year ago

As a future reference @gtomer , users tend to remove an answer when they realise an identical one has been posted. It avoids the confusion about whether or not the answer was copied (which I'm sure is not the case) and from the latter one being accepted. Another obvious reason is that there is not much use in having two identical answers.

gtomer Over a year ago

@yatu - tried to delete it, but could not because it was"accepted". Sorry.

gtomer · Accepted Answer · 2020-09-07 15:24:21Z

0

A solution with a for loop:

for col in df.columns:
    if (df[col].count() != 4):
        df.drop(col, axis=1)

answered Sep 7, 2020 at 15:24

gtomer

6,6041 gold badge14 silver badges29 bronze badges

4 Comments

Paul H Over a year ago

there's no need to write a loop here.

gtomer Over a year ago

It is not a must - but an option. Can be a good one depends on what @Jamil plans to do next

Paul H Over a year ago

but it's not. since you're not saving the output of drop, this won't act on the OP's dataframe at all. Loop very rarely are a good option with pandas options.

Jamil Over a year ago

Thank you for the response everyone but I am hoping someone could help me with a more efficient solution. Thank you.

Collectives™ on Stack Overflow

Pandas Dataframe : Using the count function to filter data

3 Answers 3

5 Comments

5 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

5 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related