2

I have a dataframe called df

The columns in the dataframe can be logically grouped. Hence I grouped the column names in lists A, B, C where:

A = [column_1, column_2, column_3]
B = [column_4, column_5, column_6]
C = [column_7, column_8, column_9]

In addition to the columns column_1 to column_9, df has one more column called "filename_ID", which is used as the index and thus is not grouped. The olumns column_1 to column_9 contain only 0 and 1 values.

Now I want to filter the dataframe such that it only includes rows where there is at least one non-zero value for each group (A,B,C). As such, I only want to keep rows with the respective filename_ID that fullfill this condition.

I have managed to create a seperate dataframe for each group:

df_A = df.loc[(df[A]!=0).any(axis=1)]
df_B = df.loc[(df[B]!=0).any(axis=1)]
df_C = df.loc[(df[C]!=0).any(axis=1)]

However, I dont know how to apply all conditions simultaniously - i.e how to create one new dataframe where all rows fullfill the condition that in each logical column-group there is at least one non-zero value.

1
  • pandas.DataFrame.merge??? Commented Aug 8, 2018 at 20:56

3 Answers 3

3

Setup

np.random.seed([3, 1415])

df = pd.DataFrame(
    np.random.randint(2, size=(10, 9)),
    columns=[f"col{i + 1}" for i in range(9)]
)

df

   col1  col2  col3  col4  col5  col6  col7  col8  col9
0     0     1     0     1     0     0     1     0     1
1     1     1     1     0     1     1     0     1     0
2     0     0     0     0     0     0     0     0     0
3     1     0     1     1     1     1     0     0     0
4     0     0     1     1     1     1     1     0     1
5     1     1     0     1     1     1     1     1     1
6     1     0     1     0     0     0     1     1     0
7     0     0     0     0     0     1     0     1     0
8     1     0     1     0     1     0     0     1     1
9     1     0     1     0     0     1     0     1     0

Solution

Create a dictionary

m = {
    **dict.fromkeys(['col1', 'col2', 'col3'], 'A'),
    **dict.fromkeys(['col4', 'col5', 'col6'], 'B'),
    **dict.fromkeys(['col7', 'col8', 'col9'], 'C'),
}

Then groupby based on axis=1

df[df.groupby(m, axis=1).any().all(1)]

   col1  col2  col3  col4  col5  col6  col7  col8  col9
0     0     1     0     1     0     0     1     0     1
1     1     1     1     0     1     1     0     1     0
4     0     0     1     1     1     1     1     0     1
5     1     1     0     1     1     1     1     1     1
8     1     0     1     0     1     0     0     1     1
9     1     0     1     0     0     1     0     1     0

Notice the ones that didn't make it

   col1  col2  col3  col4  col5  col6  col7  col8  col9
2     0     0     0     0     0     0     0     0     0
3     1     0     1     1     1     1     0     0     0
6     1     0     1     0     0     0     1     1     0
7     0     0     0     0     0     1     0     1     0

You could also have had columns like this:

cols = [['col1', 'col2', 'col3'], ['col4', 'col5', 'col6'], ['col7', 'col8', 'col9']]
m = {k: v for v, c in enumerate(cols) for k in c}

And performed the same groupby

Sign up to request clarification or add additional context in comments.

1 Comment

Wow, this is super helpful! Thank you so much!
1

Try the following:

column_groups = [A, B, C]
masks = [(df[cols] != 0).any(axis=1) for cols in column_groups]
full_mask = np.logical_and.reduce(masks)
full_df = df[full_mask]

Comments

1

Created a csv file with sample data

Sample Input:

ID  a1  a2  a3  a4  a5  a6  a7  a8  a9
1   1   1   1   1   1   1   1   1   1
2   0   0   0   1   0   0   0   1   0
3   0   1   0   0   0   0   1   0   0
4   0   0   0   0   1   0   1   0   1
5   1   1   0   1   1   1   1   0   1
6   0   0   0   0   1   0   0   1   0
7   1   0   1   1   1   0   1   1   1
8   1   1   1   0   1   1   1   0   1
9   0   0   0   1   0   1   0   0   0
10  0   0   1   0   0   0   0   0   0
11  1   0   1   0   1   1   0   1   1
12  1   1   0   1   0   1   1   0   1

import pandas as pd
df = pd.read_csv('check.csv')
df['sumA'] = df.a1+df.a2+df.a3
df['sumB'] = df.a4+df.a5+df.a6
df['sumC'] = df.a7+df.a8+df.a9
new_df = df[(df.sumA>1)&(df.sumB>1)&(df.sumC>1)]
new_df = new_df.drop(['sumA','sumB','sumC'],axis=1)

Output:

    ID  a1  a2  a3  a4  a5  a6  a7  a8  a9
0   1   1   1   1   1   1   1   1   1   1
4   5   1   1   0   1   1   1   1   0   1
6   7   1   0   1   1   1   0   1   1   1
7   8   1   1   1   0   1   1   1   0   1
10  11  1   0   1   0   1   1   0   1   1
11  12  1   1   0   1   0   1   1   0   1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.