1

I am trying to create a new dataframe with binary (0 or 1) values from an exisitng dataframe. For every row in the given dataframe, the program should take value from each cell and set 1 for the corresponding columns of the row indexed with same number in the new dataframe

Structure of the existing dataframe

Structure of the new dataframe

I have tried executing the following code snippet.

for col in products :
    index = 0;
    for item in products.loc[col] :
        products_coded.ix[index, 'prod_' + str(item)] = 1;
        index = index + 1;

It works for less number of rows. But,it takes lot of time for any large dataset. What could be the best way to get the desired outcome.

1 Answer 1

1

I think you need:

  • first get_dummies with casting values to strings
  • aggregate max by columns names max
  • for correct ordering convert columns to int
  • reindex for ordering and append missing columns, replace NaNs by 0 by parameter fill_value=0 and remove first 0 column
  • add_prefix for rename columns

df = pd.DataFrame({'B':[3,1,12,12,8],
                   'C':[0,6,0,14,0],
                   'D':[0,14,0,0,0]})

print (df)
    B   C   D
0   3   0   0
1   1   6  14
2  12   0   0
3  12  14   0
4   8   0   0

df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
         .max(level=0, axis=1)
         .rename(columns=lambda x: int(x))
         .reindex(columns=range(1, df.values.max() + 1), fill_value=0)
         .add_prefix('prod_'))
print (df1)

   prod_1  prod_2  prod_3  prod_4  prod_5  prod_6  prod_7  prod_8  prod_9  \
0       0       0       1       0       0       0       0       0       0   
1       1       0       0       0       0       1       0       0       0   
2       0       0       0       0       0       0       0       0       0   
3       0       0       0       0       0       0       0       0       0   
4       0       0       0       0       0       0       0       1       0   

   prod_10  prod_11  prod_12  prod_13  prod_14  
0        0        0        0        0        0  
1        0        0        0        0        1  
2        0        0        1        0        0  
3        0        0        1        0        1  
4        0        0        0        0        0 

Another similar solution:

df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
         .max(level=0, axis=1))

df1.columns = df1.columns.astype(int)    

df1 = (df1.reindex(columns=range(1, df1.columns.max() + 1), fill_value=0)
          .add_prefix('prod_'))
Sign up to request clarification or add additional context in comments.

2 Comments

I get the following error: TypeError: reindex() got an unexpected keyword argument "axis"
It is oldier version of pandas, only need .reindex(columns=range(1, df1.columns.max() + 1), fill_value=0) instead .reindex(range(1, df1.columns.max() + 1), axis=1, fill_value=0)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.