4

I have a pandas dataframe from which I want to create a new dataframe by applying a filter based on the count function such that only those columns should be selected whose count is equal to a specified number.

For example in the dataframe below:

month_end   Col A       Col B       Col C       Col D
200703      NaN          NaN         NaN         NaN
200704      0.084       0.152       0.142      0.0766
200705      0.124       0.123       0.020       NaN 
200706      NaN         0.191       0.091       0.149   
200707      -0.136      0.047       0.135      -0.127

If my_variable = 4, then df1 should only contain Col B and Col D alongwith the index month_end.

How do I do this?

5
  • So the first question I have is: did you tried anything first? Dataframes have a count method that will give you a series where the index is the names of the columns and the values are the number of non-null results in that column. Commented Sep 7, 2020 at 15:21
  • Can you add a running example? That will show us where you are and save us the hassle of building the dataframe ourselves. BTW, clarify "count" - I think you want the count of non NaN values, but be specific. Commented Sep 7, 2020 at 15:24
  • Yes. I have some understanding of applying column based filters but I am not too sure how do I apply a count function on all the columns to create a new dataframe. Thank you Commented Sep 7, 2020 at 15:25
  • @tdelaney, Yes I want to filter on the basis of non NaN values. I have added an image to my question. I hope it makes my question easier to understand. Thank you very much for responding. Commented Sep 7, 2020 at 15:27
  • @Jamil -- it's dataframe.count() Commented Sep 7, 2020 at 15:29

3 Answers 3

5

You could do something along the lines of:

df.loc[:,df.notna().sum(0).eq(4)]

    ColB   ColC
0    NaN    NaN
1  0.152  0.142
2  0.123  0.020
3  0.191  0.091
4  0.047  0.135

Or there's also count, which already drops dupes prior to counting:

df.loc[:,df.count().eq(4)]

If you want to include the date column, and it isn't the index:

ix = df.notna().sum(0).eq(4)
df.loc[:,ix.index[ix].union(['month_end'])]

    ColB   ColC  month_end
0    NaN    NaN     200703
1  0.152  0.142     200704
2  0.123  0.020     200705
3  0.191  0.091     200706
4  0.047  0.135     200707
Sign up to request clarification or add additional context in comments.

5 Comments

he could also like to return the date column
@Chris This answer is on the right track. If the month_end "column" is the actually index, it'll be included when the OP runs this on the actual dataframe.
instead of .notna().sum(), i would use .count()
Yeh, not one i tend to use, so wasn't in the top of my mind. Nonetheless, fitted in the answer :) @PaulH
Thank you everyone for your quick help. It looks like the solution works perfectly well. Eyeballing the dataframe (df1), I can see that the columns with lesser row count have been chopped off the data frame. Will do a more detailed reconciliation of my data, but this looks sorted more or less. Thank you once again.
1

Another solution without a loop:

s = df.notna().sum(0) == 4     
df = df.loc[:, s]

5 Comments

Emmmm. Just saw it is similar to @yatu above :-(
Thank you very much for your help. I have learnt some really elegant solutions today. Thank you once again.
Sure. I am new here (not very active). I will keep this in mind henceforth. Thank you very much.
As a future reference @gtomer , users tend to remove an answer when they realise an identical one has been posted. It avoids the confusion about whether or not the answer was copied (which I'm sure is not the case) and from the latter one being accepted. Another obvious reason is that there is not much use in having two identical answers.
@yatu - tried to delete it, but could not because it was"accepted". Sorry.
0

A solution with a for loop:

for col in df.columns:
    if (df[col].count() != 4):
        df.drop(col, axis=1)

4 Comments

there's no need to write a loop here.
It is not a must - but an option. Can be a good one depends on what @Jamil plans to do next
but it's not. since you're not saving the output of drop, this won't act on the OP's dataframe at all. Loop very rarely are a good option with pandas options.
Thank you for the response everyone but I am hoping someone could help me with a more efficient solution. Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.