I have a dataframe df like this:
X1 X2 X3
0 a c a
1 b e c
2 c nan e
3 d nan nan
I would like to create a new dataframe newdf which has one column (uentries) that contains the unique entries of df and the three columns of df which are filled with 0 and 1 depending on whether the the entry of uentries exists in the respective column in df.
My desired output would therefore look as follows (uentries does not need to be ordered):
uentries X1 X2 X3
0 a 1 0 1
1 b 1 0 0
2 c 1 1 1
3 d 1 0 0
4 e 0 1 1
Currently, I do it like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
'X2': ['c', 'e', 'nan', 'nan'],
'X3': ['a', 'c', 'e', 'nan']})
uniqueEntries = set([x for x in np.ravel(df.values) if str(x) != 'nan'])
newdf = pd.DataFrame()
newdf['uentries'] = list(uniqueEntries)
for coli in df.columns:
newdf[coli] = newdf['uentries'].isin(df[coli])
newdf.ix[:, 'X1':'X3'] = newdf.ix[:, 'X1':'X3'].astype(int)
which gives me the desired output.
Is it possible to fill newdf in a more efficient manner?