0

I have the following data

attr1_A    attr1_B    attr1_C    attr1_D    attr2_A    attr2_B   attr2_C
      1          0          0          1          1          0         0
      0          1          1          0          0          0         1
      0          0          0          0          0          1         0
      1          1          1          0          1          1         0

I want to retain attr1_A, attr1_B and combine attr1_C and attr1_D into attr1_others. As long as attr1_C and/or attr1_D is 1, then attr1_others will be 1. Similarly, I want to keep attr2_A but combine the remaining attr2_* into attr2_others. Like this:

attr1_A    attr1_B    attr1_others    attr2_A    attr2_others
      1          0          1               1               0     
      0          1          1               0               1  
      0          0          0               0               1 
      1          1          1               1               1 

In other words, for any group of attr, I want to retain a few known columns but combine the remaining (which I don't know how many remaining attr of the same group.

I am thinking of doing each group separately: processing all attr1_*, and then attr2_* because there are a limited number of groups in my dataset, but many attr under each group.

What I can think right now is to retrieve the others columns like:

# for group 1
df[x for x in df.columns if "A" not in x and "B" not in x and "attr1_" in x]

# for group 2
df[x for x in df.columns if "A" not in x and "attr2_" in x]

And to combine, I am thinking of using any function, but I can't come up with the syntax. Could you help?

Updated attempt:

I tried this

# for group 1
df['attr1_others'] = df[df[[x for x in list(df.columns) 
                            if "attr1_" in x
                            and "A" not in x 
                            and "B" not in x]].any(axis = 'column')]

but got the below error:

ValueError: No axis named column for object type <class 'pandas.core.frame.DataFrame'>

3 Answers 3

2

Dataframes have the great ability to manipulate data in place, without having to write complex python logic.

To create your attr1_others and attr2_others columns, you can combine the columns with or conditions using this:

df['attr1_others'] = df['attr1_C'] | df['attr1_D']
df['attr2_others'] = df['attr2_B'] | df['attr2_C']

If instead, you wanted an and condition, you could use:

df['attr1_others'] = df['attr1_C'] & df['attr1_D']
df['attr2_others'] = df['attr2_B'] & df['attr2_C']

You can then delete the lingering original values using del:

del df['attr1_C']
del df['attr1_D']
del df['attr2_B']
del df['attr2_C']
Sign up to request clarification or add additional context in comments.

1 Comment

Hi @StevenMoseley, there are many attr1_* to be combined, not just C and D. Is there a way to combine all attr1_* that are not A and B ?
1

Create a list of kept-columns. Drop those kept-columns out and assign left-over columns to new dataframe df1. Groupby df1 by the splitted column names; call any on axis=1; add_suffix '_others' and assign result to df2. Finally, join and sort_index

keep_cols = ['attr1_A', 'attr1_B', 'attr2_A']
df1 = df.drop(keep_cols,1)
df2 = (df1.groupby(df1.columns.str.split('_').str[0], axis=1)
          .any(1).add_suffix('_others').astype(int))

Out[512]:
   attr1_others  attr2_others
0             1             0
1             1             1
2             0             1
3             1             1

df_final = df[keep_cols].join(df2).sort_index(1)

Out[514]:
   attr1_A  attr1_B  attr1_others  attr2_A  attr2_others
0        1        0             1        1             0
1        0        1             1        0             1
2        0        0             0        0             1
3        1        1             1        1             1

Comments

0

You can use custom list to select columns, and then .any() with axis=1 parameter. To convert to interger, use .astype(int).

For example:

import pandas as pd

df = pd.DataFrame({
        'attr1_A': [1, 0, 0, 1],
        'attr1_B': [0, 1, 0, 1],
        'attr1_C': [0, 1, 0, 1],
        'attr1_D': [1, 0, 0, 0],
        'attr2_A': [1, 0, 0, 1],
        'attr2_B': [0, 0, 1, 1],
        'attr2_C': [0, 1, 0, 0]})

cols = [col for col in df.columns.values if col.startswith('attr1') and col.split('_')[1] not in ('A', 'B')]
df['attr1_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)

cols = [col for col in df.columns.values if col.startswith('attr2') and col.split('_')[1] not in ('A', )]
df['attr2_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)

print(df)

Prints:

   attr1_A  attr1_B  attr2_A  attr1_others  attr2_others
0        1        0        1             1             0
1        0        1        0             1             1
2        0        0        0             0             1
3        1        1        1             1             1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.