Pandas: return rows from a dataframe where multiple subsets of columns are non zero

Question

I have a dataframe called df

The columns in the dataframe can be logically grouped. Hence I grouped the column names in lists A, B, C where:

A = [column_1, column_2, column_3]
B = [column_4, column_5, column_6]
C = [column_7, column_8, column_9]

In addition to the columns column_1 to column_9, df has one more column called "filename_ID", which is used as the index and thus is not grouped. The olumns column_1 to column_9 contain only 0 and 1 values.

Now I want to filter the dataframe such that it only includes rows where there is at least one non-zero value for each group (A,B,C). As such, I only want to keep rows with the respective filename_ID that fullfill this condition.

I have managed to create a seperate dataframe for each group:

df_A = df.loc[(df[A]!=0).any(axis=1)]
df_B = df.loc[(df[B]!=0).any(axis=1)]
df_C = df.loc[(df[C]!=0).any(axis=1)]

However, I dont know how to apply all conditions simultaniously - i.e how to create one new dataframe where all rows fullfill the condition that in each logical column-group there is at least one non-zero value.

pandas.DataFrame.merge???

Rushabh Mehta
– Rushabh Mehta

2018-08-08 20:56:46 +00:00
Commented Aug 8, 2018 at 20:56 — Rushabh Mehta
– Rushabh Mehta, Commented Aug 8, 2018 at 20:56

piRSquared · Accepted Answer · 2018-08-08 21:26:09Z

Setup

np.random.seed([3, 1415])

df = pd.DataFrame(
    np.random.randint(2, size=(10, 9)),
    columns=[f"col{i + 1}" for i in range(9)]
)

df

   col1  col2  col3  col4  col5  col6  col7  col8  col9
0     0     1     0     1     0     0     1     0     1
1     1     1     1     0     1     1     0     1     0
2     0     0     0     0     0     0     0     0     0
3     1     0     1     1     1     1     0     0     0
4     0     0     1     1     1     1     1     0     1
5     1     1     0     1     1     1     1     1     1
6     1     0     1     0     0     0     1     1     0
7     0     0     0     0     0     1     0     1     0
8     1     0     1     0     1     0     0     1     1
9     1     0     1     0     0     1     0     1     0

Solution

Create a dictionary

m = {
    **dict.fromkeys(['col1', 'col2', 'col3'], 'A'),
    **dict.fromkeys(['col4', 'col5', 'col6'], 'B'),
    **dict.fromkeys(['col7', 'col8', 'col9'], 'C'),
}

Then groupby based on axis=1

df[df.groupby(m, axis=1).any().all(1)]

   col1  col2  col3  col4  col5  col6  col7  col8  col9
0     0     1     0     1     0     0     1     0     1
1     1     1     1     0     1     1     0     1     0
4     0     0     1     1     1     1     1     0     1
5     1     1     0     1     1     1     1     1     1
8     1     0     1     0     1     0     0     1     1
9     1     0     1     0     0     1     0     1     0

Notice the ones that didn't make it

   col1  col2  col3  col4  col5  col6  col7  col8  col9
2     0     0     0     0     0     0     0     0     0
3     1     0     1     1     1     1     0     0     0
6     1     0     1     0     0     0     1     1     0
7     0     0     0     0     0     1     0     1     0

You could also have had columns like this:

cols = [['col1', 'col2', 'col3'], ['col4', 'col5', 'col6'], ['col7', 'col8', 'col9']]
m = {k: v for v, c in enumerate(cols) for k in c}

And performed the same groupby

PMende · Accepted Answer · 2018-08-08 21:18:12Z

1

Try the following:

column_groups = [A, B, C]
masks = [(df[cols] != 0).any(axis=1) for cols in column_groups]
full_mask = np.logical_and.reduce(masks)
full_df = df[full_mask]

answered Aug 8, 2018 at 21:18

PMende

5,4903 gold badges21 silver badges26 bronze badges

Comments

NAGA · Accepted Answer · 2018-08-08 21:25:06Z

Created a csv file with sample data

Sample Input:

ID  a1  a2  a3  a4  a5  a6  a7  a8  a9
1   1   1   1   1   1   1   1   1   1
2   0   0   0   1   0   0   0   1   0
3   0   1   0   0   0   0   1   0   0
4   0   0   0   0   1   0   1   0   1
5   1   1   0   1   1   1   1   0   1
6   0   0   0   0   1   0   0   1   0
7   1   0   1   1   1   0   1   1   1
8   1   1   1   0   1   1   1   0   1
9   0   0   0   1   0   1   0   0   0
10  0   0   1   0   0   0   0   0   0
11  1   0   1   0   1   1   0   1   1
12  1   1   0   1   0   1   1   0   1

import pandas as pd
df = pd.read_csv('check.csv')
df['sumA'] = df.a1+df.a2+df.a3
df['sumB'] = df.a4+df.a5+df.a6
df['sumC'] = df.a7+df.a8+df.a9
new_df = df[(df.sumA>1)&(df.sumB>1)&(df.sumC>1)]
new_df = new_df.drop(['sumA','sumB','sumC'],axis=1)

Output:

    ID  a1  a2  a3  a4  a5  a6  a7  a8  a9
0   1   1   1   1   1   1   1   1   1   1
4   5   1   1   0   1   1   1   1   0   1
6   7   1   0   1   1   1   0   1   1   1
7   8   1   1   1   0   1   1   1   0   1
10  11  1   0   1   0   1   1   0   1   1
11  12  1   1   0   1   0   1   1   0   1

Collectives™ on Stack Overflow

Pandas: return rows from a dataframe where multiple subsets of columns are non zero

3 Answers 3

Setup

Solution

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Setup

Solution

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related