1

This is a follow up question to Pivot a dataframe with two columns as the index.

My data is in this format:

Record ID Para  Col2     Col3
1          A        x      a
1          A        x      b
2          B        y      a
2          B        y      b
1          A        z      c
1          C        x      a

I would like to reshape it into:

Record Para  a     b      c    x   y  z 
1       A    1     1      1    1   0  1
1       C    1     1      1    1   0  1
2       B    1     1      0    0   1  0 

I tried

    csv3 = csv2.pivot_table(index=['Record ID', 'Para'], columns=csv2.iloc[:,2:], aggfunc='size', fill_value=0).reset_index()

but don't get the columns right. What do I need to do differently?

UPDATE 1:

I have 10s of columns.

2 Answers 2

1

IIUC get_dummies

pd.get_dummies(df.set_index(['RecordID','Para']),prefix='',prefix_sep = '').sum(level=[0,1]).gt(0).astype(int)
Out[272]: 
               x  y  z  a  b  c
RecordID Para                  
1        A     1  0  1  1  1  1
2        B     0  1  0  1  1  0

Update

pd.get_dummies(df.set_index(['RecordID','Para']),prefix='',prefix_sep = '').sum(level=[0,1]).gt(0).astype(int).replace(0,np.nan).groupby(level=0).ffill().fillna(0)
Out[292]: 
                 x    y    z  a    b    c
RecordID Para                            
1        A     1.0  0.0  1.0  1  1.0  1.0
2        B     0.0  1.0  0.0  1  1.0  0.0
1        C     1.0  0.0  1.0  1  1.0  1.0
Sign up to request clarification or add additional context in comments.

11 Comments

Upon trying your solution, I realized I omitted data where a Record ID can have multiple Paras. Can you please update your solution as appropriate.
@kurious check the update , I do not think you expected out put is reasonable , can you explained it ?
Sure. My record's Para is my outcome of interest. I'm trying to predict it using the Cols. As a record may have more than 1 para, I want to keep the rest of attributes identical to see if each of the Paras can be correctly identified.
Also, I don't a difference b/w your original solution and the update. Am I missing something here?
Can you explain why C have the same with A in your output? @kurious
|
1

You can aggregate to set and then use get_dummies.

df2 = df.groupby(['RecordID', 'Para'])[df.columns[2:]].aggregate(set)

values = df2.apply(lambda x: set().union(*x.values), axis=1)
dummies = values.str.join('|').str.get_dummies()

res = dummies.reset_index()

print(res)

   RecordID Para  a  b  c  x  y  z
0         1    A  1  1  1  1  0  1
1         2    B  1  1  0  0  1  0

2 Comments

Edited my comment. This approach seems to be cumbersome if I have 10s of columns
@kurious, Then you can use a list comprehension instead of explicitly defining your list. But please update your question specifying all your requirements.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.