Create and populate a dataframe using the unique values of another dataframe

Question

I have a dataframe df like this:

  X1   X2   X3
0  a    c    a
1  b    e    c
2  c  nan    e
3  d  nan  nan

I would like to create a new dataframe newdf which has one column (uentries) that contains the unique entries of df and the three columns of df which are filled with 0 and 1 depending on whether the the entry of uentries exists in the respective column in df.

My desired output would therefore look as follows (uentries does not need to be ordered):

  uentries  X1  X2  X3
0        a   1   0   1
1        b   1   0   0
2        c   1   1   1
3        d   1   0   0
4        e   0   1   1

Currently, I do it like this:

import pandas as pd
import numpy as np

df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
                   'X2': ['c', 'e', 'nan', 'nan'],
                   'X3': ['a', 'c', 'e', 'nan']})

uniqueEntries = set([x for x in np.ravel(df.values) if str(x) != 'nan'])

newdf = pd.DataFrame()
newdf['uentries'] = list(uniqueEntries)

for coli in df.columns:
    newdf[coli] = newdf['uentries'].isin(df[coli])

newdf.ix[:, 'X1':'X3'] = newdf.ix[:, 'X1':'X3'].astype(int)

which gives me the desired output.

Is it possible to fill newdf in a more efficient manner?

Joe T. Boka · Accepted Answer · 2016-02-16 15:12:44Z

1

This is a simple way to approach this problem using pd.value_counts.

newdf = df.apply(pd.value_counts).fillna(0)
newdf['uentries'] = newdf.index
newdf = newdf[['uentries', 'X1','X2','X3']]
newdf

uentries X1 X2 X3
a   a   1   0   1
b   b   1   0   0
c   c   1   1   1
d   d   1   0   0
e   e   0   1   1
nan nan 0   2   1

Then you can just drop the row with the nan values:

newdf.drop(['nan'])

uentries X1 X2 X3
a   a   1   0   1
b   b   1   0   0
c   c   1   1   1
d   d   1   0   0
e   e   0   1   1

edited Feb 16, 2016 at 15:12

answered Feb 16, 2016 at 14:34

Joe T. Boka

6,5896 gold badges33 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Cleb Over a year ago

Thanks for the answer. What would be the easiest way to get rid of the nan row and to use a:e as a column rather than as index?

Joe T. Boka Over a year ago

@Cleb I edited my answer. I got rid of the row with the nan values and created a uentries column. The index remained a:e.

Cleb Over a year ago

Thanks, works fine and seems more efficient than the solution I have. I upvote it for now and might accept it later on depending on the other answers' quality.

Joe T. Boka Over a year ago

Great. I am glad I could help!

jezrael · Accepted Answer · 2016-02-16 14:22:15Z

1

You can use get_dummies, sum and last concat with fillna:

import pandas as pd

df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
                   'X2': ['c', 'e', 'nan', 'nan'],
                   'X3': ['a', 'c', 'e', 'nan']})
print df
  X1   X2   X3
0  a    c    a
1  b    e    c
2  c  nan    e
3  d  nan  nan

a = pd.get_dummies(df['X1']).sum()
b = pd.get_dummies(df['X2']).sum()
c = pd.get_dummies(df['X3']).sum()

print pd.concat([a,b,c], axis=1, keys=['X1','X2','X3']).fillna(0)
     X1  X2  X3
a     1   0   1
b     1   0   0
c     1   1   1
d     1   0   0
e     0   1   1
nan   0   2   1

If you use np.nan in test data:

import pandas as pd
import numpy as np
import io

df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
                   'X2': ['c', 'e', np.nan, np.nan],
                   'X3': ['a', 'c', 'e', np.nan]})
print df

a = pd.get_dummies(df['X1']).sum()
b = pd.get_dummies(df['X2']).sum()
c = pd.get_dummies(df['X3']).sum()

print pd.concat([a,b,c], axis=1, keys=['X1','X2','X3']).fillna(0)
   X1  X2  X3
a   1   0   1
b   1   0   0
c   1   1   1
d   1   0   0
e   0   1   1

edited Feb 16, 2016 at 14:22

answered Feb 16, 2016 at 14:19

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

2 Comments

Cleb Over a year ago

Thanks for the answer. How would you then avoid the single calls of get_dummies when you have a large number of columns?

Randy Over a year ago

pd.concat([pd.get_dummies(df[c]).any().astype(int) for c in df.columns], axis=1, keys=df.columns).fillna(0)

Collectives™ on Stack Overflow

Create and populate a dataframe using the unique values of another dataframe

2 Answers 2

4 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related