3

I have a python pandas DataFrame that looks like this:

                   A      B      C    ...     ZZ
2008-01-01 00    NaN    NaN    NaN    ...      1
2008-01-02 00    NaN    NaN    NaN    ...    NaN
2008-01-03 00    NaN    NaN      1    ...    NaN
...              ...    ...    ...    ...    ...
2012-12-31 00    NaN      1    NaN    ...    NaN

and I can't figure out how to get a subset of the DataFrame where there is one or more '1' in it, so that the final df should be something like this:

                   B      C    ...     ZZ
2008-01-01 00    NaN    NaN    ...      1
2008-01-03 00    NaN      1    ...    NaN
...              ...    ...    ...    ...
2012-12-31 00    1      NaN    ...    NaN

This is, removing all rows and columns that do not have a 1 in it.

I try this which seems to remove the rows with no 1:

df_filtered = df[df.sum(1)>0]

And the try to remove columns with:

df_filtered = df_filtered[df.sum(0)>0]

but get this error after the second line:

IndexingError('Unalignable boolean Series key provided')

2 Answers 2

5

Do it with loc:

In [90]: df
Out[90]:
    0   1   2   3   4   5
0   1 NaN NaN   1   1 NaN
1 NaN NaN NaN NaN NaN NaN
2   1   1 NaN NaN   1 NaN
3   1 NaN   1   1 NaN NaN
4 NaN NaN NaN NaN NaN NaN

In [91]: df.loc[df.sum(1) > 0, df.sum(0) > 0]
Out[91]:
   0   1   2   3   4
0  1 NaN NaN   1   1
2  1   1 NaN NaN   1
3  1 NaN   1   1 NaN

Here's why you get that error:

Let's say I have the following frame, df, (similar to yours):

In [112]: df
Out[112]:
    a   b   c   d   e
0   0   1   1 NaN   1
1 NaN NaN NaN NaN NaN
2   0   0   0 NaN   0
3   0   0   1 NaN   1
4   1   1   1 NaN   1
5   0   0   0 NaN   0
6   1   0   1 NaN   0

When I sum along the rows and threshold at 0, I get:

In [113]: row_sum = df.sum()

In [114]: row_sum > 0
Out[114]:
a     True
b     True
c     True
d    False
e     True
dtype: bool

Since the index of row_sum is the columns of df, it doesn't make sense in this case to try to use the values of row_sum > 0 to fancy-index into the rows of df, since their row indices are not aligned and they cannot be aligned.

Sign up to request clarification or add additional context in comments.

Comments

0

Alternatively to remove all NaN rows or columns you can use .any() too.

In [1680]: df
Out[1680]:
     0    1    2    3    4   5
0  1.0  NaN  NaN  1.0  1.0 NaN
1  NaN  NaN  NaN  NaN  NaN NaN
2  1.0  1.0  NaN  NaN  1.0 NaN
3  1.0  NaN  1.0  1.0  NaN NaN
4  NaN  NaN  NaN  NaN  NaN NaN

In [1681]: df.loc[df.any(axis=1), df.any(axis=0)]
Out[1681]:
     0    1    2    3    4
0  1.0  NaN  NaN  1.0  1.0
2  1.0  1.0  NaN  NaN  1.0
3  1.0  NaN  1.0  1.0  NaN

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.