1

if have two dataframes, (pandas.DataFrame), each looking as follows. Let's call the first one df_A

    code1   code2   code3   code4   code5   
0   1       4       2       0       0 
1   3       2       1       5       0   
2   2       3       0       0       0   

    has1    has2    has3    has4    has5
0   1       1       0       1       0              
1   1       1       0       0       1 
2   0       1       1       0       0
    

The objects(rows) are each given up to 5 codes shown by the five columns in the first df.

I instead want a binary representation of which codes each object has. As shown in the second df.

The functions in pandas or scikit-learn for dummy-values take into account which position the code is written in, this in unimportant.

The attempts I have with my own code have not worked due to my inexperience in python and pandas.

This case is different from others I have seen on stack overflow as all the columns represent the same thing.

Thank you!

Edit:

for colname in df_bin.columns:
    for row in range(len(df_codes)):
        if int(colname) in df_codes.iloc[[row]]:
            df_bin[colname][row]=1

This is one of the attempts I made so far.

1
  • Please post what you have so far. Commented Jun 29, 2020 at 14:56

3 Answers 3

3

You can try stack then str.get_dummies

s=df.stack().loc[lambda x : x!=0].astype(str).str.get_dummies().sum(level=0).add_prefix('Has')
   Has1  Has2  Has3  Has4  Has5
0     1     1     0     1     0
1     1     1     1     0     1
2     0     1     1     0     0
Sign up to request clarification or add additional context in comments.

1 Comment

This, worked! I have seen the use of lamdba before but never understood it. I will look into it more! thank you very much
1

Let's try:

(df.stack().groupby(level=0)
   .value_counts()
   .unstack(fill_value=0)
   [range(1,6)]
   .add_prefix('has')
)

Output:

   has1  has2  has3  has4  has5
0     1     1     0     1     0
1     1     1     1     0     1
2     0     1     1     0     0

Comments

0

Here's another way using pd.crosstab:

df_out = df.reset_index().melt('index')
df_out = pd.crosstab(df_out['index'], df_out['value']).drop(0, axis=1).add_prefix('has')

Output:

value  has1  has2  has3  has4  has5
index                              
0         1     1     0     1     0
1         1     1     1     0     1
2         0     1     1     0     0

2 Comments

thank, you. I am unfamiliar with crosstab. The code seems to work, and the output df looks as expected.. the dataframe is of the right dimensions, but df_out.shape, is totally different. and i cannot acess the columns the way that i am used to. How is this new table structured, and how would i get the same result as I would usually with df_out["has1"]
@Professional_n00b The output is different, because of the column header name ('value'). You can still access the dataframe. However we, need to assign the outputs of pd.crosstab back to df_out. I didn't do that in this solution. I will modify now. (I changed the answer to include the re-assignment back to df_out).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.