Pivot a Pandas dataframe using multiple columns

Question

This is a follow up question to Pivot a dataframe with two columns as the index.

My data is in this format:

Record ID Para  Col2     Col3
1          A        x      a
1          A        x      b
2          B        y      a
2          B        y      b
1          A        z      c
1          C        x      a

I would like to reshape it into:

Record Para  a     b      c    x   y  z 
1       A    1     1      1    1   0  1
1       C    1     1      1    1   0  1
2       B    1     1      0    0   1  0

I tried

    csv3 = csv2.pivot_table(index=['Record ID', 'Para'], columns=csv2.iloc[:,2:], aggfunc='size', fill_value=0).reset_index()

but don't get the columns right. What do I need to do differently?

UPDATE 1:

I have 10s of columns.

BENY · Accepted Answer · 2018-06-13 02:12:00Z

1

IIUC get_dummies

pd.get_dummies(df.set_index(['RecordID','Para']),prefix='',prefix_sep = '').sum(level=[0,1]).gt(0).astype(int)
Out[272]: 
               x  y  z  a  b  c
RecordID Para                  
1        A     1  0  1  1  1  1
2        B     0  1  0  1  1  0

Update

pd.get_dummies(df.set_index(['RecordID','Para']),prefix='',prefix_sep = '').sum(level=[0,1]).gt(0).astype(int).replace(0,np.nan).groupby(level=0).ffill().fillna(0)
Out[292]: 
                 x    y    z  a    b    c
RecordID Para                            
1        A     1.0  0.0  1.0  1  1.0  1.0
2        B     0.0  1.0  0.0  1  1.0  0.0
1        C     1.0  0.0  1.0  1  1.0  1.0

edited Jun 13, 2018 at 2:12

answered Jun 13, 2018 at 1:37

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

kurious Over a year ago

Upon trying your solution, I realized I omitted data where a Record ID can have multiple Paras. Can you please update your solution as appropriate.

BENY Over a year ago

@kurious check the update , I do not think you expected out put is reasonable , can you explained it ?

kurious Over a year ago

Sure. My record's Para is my outcome of interest. I'm trying to predict it using the Cols. As a record may have more than 1 para, I want to keep the rest of attributes identical to see if each of the Paras can be correctly identified.

kurious Over a year ago

Also, I don't a difference b/w your original solution and the update. Am I missing something here?

BENY Over a year ago

Can you explain why C have the same with A in your output? @kurious

|

jpp · Accepted Answer · 2018-06-13 00:50:15Z

1

You can aggregate to set and then use get_dummies.

df2 = df.groupby(['RecordID', 'Para'])[df.columns[2:]].aggregate(set)

values = df2.apply(lambda x: set().union(*x.values), axis=1)
dummies = values.str.join('|').str.get_dummies()

res = dummies.reset_index()

print(res)

   RecordID Para  a  b  c  x  y  z
0         1    A  1  1  1  1  0  1
1         2    B  1  1  0  0  1  0

edited Jun 13, 2018 at 0:50

answered Jun 13, 2018 at 0:29

jpp

166k37 gold badges301 silver badges363 bronze badges

2 Comments

kurious Over a year ago

Edited my comment. This approach seems to be cumbersome if I have 10s of columns

jpp Over a year ago

@kurious, Then you can use a list comprehension instead of explicitly defining your list. But please update your question specifying all your requirements.

Collectives™ on Stack Overflow

Pivot a Pandas dataframe using multiple columns

2 Answers 2

11 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related