0

I have a dataframe df like this:

  X1   X2   X3
0  a    c    a
1  b    e    c
2  c  nan    e
3  d  nan  nan

I would like to create a new dataframe newdf which has one column (uentries) that contains the unique entries of df and the three columns of df which are filled with 0 and 1 depending on whether the the entry of uentries exists in the respective column in df.

My desired output would therefore look as follows (uentries does not need to be ordered):

  uentries  X1  X2  X3
0        a   1   0   1
1        b   1   0   0
2        c   1   1   1
3        d   1   0   0
4        e   0   1   1

Currently, I do it like this:

import pandas as pd
import numpy as np

df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
                   'X2': ['c', 'e', 'nan', 'nan'],
                   'X3': ['a', 'c', 'e', 'nan']})

uniqueEntries = set([x for x in np.ravel(df.values) if str(x) != 'nan'])

newdf = pd.DataFrame()
newdf['uentries'] = list(uniqueEntries)

for coli in df.columns:
    newdf[coli] = newdf['uentries'].isin(df[coli])

newdf.ix[:, 'X1':'X3'] = newdf.ix[:, 'X1':'X3'].astype(int)

which gives me the desired output.

Is it possible to fill newdf in a more efficient manner?

2 Answers 2

1

This is a simple way to approach this problem using pd.value_counts.

newdf = df.apply(pd.value_counts).fillna(0)
newdf['uentries'] = newdf.index
newdf = newdf[['uentries', 'X1','X2','X3']]
newdf

uentries X1 X2 X3
a   a   1   0   1
b   b   1   0   0
c   c   1   1   1
d   d   1   0   0
e   e   0   1   1
nan nan 0   2   1

Then you can just drop the row with the nan values:

newdf.drop(['nan'])

uentries X1 X2 X3
a   a   1   0   1
b   b   1   0   0
c   c   1   1   1
d   d   1   0   0
e   e   0   1   1
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the answer. What would be the easiest way to get rid of the nan row and to use a:e as a column rather than as index?
@Cleb I edited my answer. I got rid of the row with the nan values and created a uentries column. The index remained a:e.
Thanks, works fine and seems more efficient than the solution I have. I upvote it for now and might accept it later on depending on the other answers' quality.
Great. I am glad I could help!
1

You can use get_dummies, sum and last concat with fillna:

import pandas as pd

df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
                   'X2': ['c', 'e', 'nan', 'nan'],
                   'X3': ['a', 'c', 'e', 'nan']})
print df
  X1   X2   X3
0  a    c    a
1  b    e    c
2  c  nan    e
3  d  nan  nan

a = pd.get_dummies(df['X1']).sum()
b = pd.get_dummies(df['X2']).sum()
c = pd.get_dummies(df['X3']).sum()

print pd.concat([a,b,c], axis=1, keys=['X1','X2','X3']).fillna(0)
     X1  X2  X3
a     1   0   1
b     1   0   0
c     1   1   1
d     1   0   0
e     0   1   1
nan   0   2   1

If you use np.nan in test data:

import pandas as pd
import numpy as np
import io

df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
                   'X2': ['c', 'e', np.nan, np.nan],
                   'X3': ['a', 'c', 'e', np.nan]})
print df

a = pd.get_dummies(df['X1']).sum()
b = pd.get_dummies(df['X2']).sum()
c = pd.get_dummies(df['X3']).sum()

print pd.concat([a,b,c], axis=1, keys=['X1','X2','X3']).fillna(0)
   X1  X2  X3
a   1   0   1
b   1   0   0
c   1   1   1
d   1   0   0
e   0   1   1

2 Comments

Thanks for the answer. How would you then avoid the single calls of get_dummies when you have a large number of columns?
pd.concat([pd.get_dummies(df[c]).any().astype(int) for c in df.columns], axis=1, keys=df.columns).fillna(0)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.