How to Create New Columns from Rows in Python?

Question

I have a dataset.

date        a     b     c    d
2021-07-01  1234  1111  123  x
2021-07-02  3456  3000  456  x
2021-07-03  5643  5142  501  x
2021-07-01  500   200   300  y
2021-07-03  1256  756   500  y
2021-07-01  1000  780   220  z
2021-07-02  3200  3000  200  z
2021-07-03  2345  2300  45   z
2021-07-04  2000  1700  300  z

There are 3 different companies in the dataset. a: total_Sales, b: net_sales, c: bonus sales, d: company name

I created a dataframe of company x (df_x). I want to add the a, b, c features of other companies (y and z) as new features to the new dataset (df_x). However, I want to add records that match date column of df_x. If there is a date that does not match, assign zero.

date        a_x   b_x   c_x   a_y  b_y  c_y  a_z   b_z   c_z
2021-07-01  1234  1111  123   500  200  300  1000  780   220
2021-07-02  3456  3000  456   0    0    0    3200  3000  200
2021-07-03  5643  5142  501   1256 756  500  2000  1700  300

I tried to this: I created other companies datasets (df_y, df_z)

df_x = df_x.merge(df_y, on='Date', how='left', suffixes=['_x', '_y']).fillna(0)
df_x = df_x.merge(df_z, on='Date', how='left', suffixes=['', '_z']).fillna(0)

But it doesn't work. This code duplicated the columns. I also have nine companies in my main dataset (real work dataset).

How can I solve this problem?

Corralien · Accepted Answer · 2021-11-19 06:26:32Z

1

You can use pivot or pivot_table:

out = df.pivot('date', 'd', ['a', 'b', 'c']) \
        .sort_index(level=1, axis=1).fillna(0).astype(int)
out.columns = out.columns.to_flat_index().str.join('_')

# OR (to merge duplicate entries)

out = df.pivot_table(['a', 'b', 'c'], 'date', 'd', aggfunc='sum') \
        .sort_index(level=1, axis=1).fillna(0).astype(int)
out.columns = out.columns.to_flat_index().str.join('_')

You can replace 'sum' by 'mean', 'min' (the lowest value), 'max' (the highest value), 'first' (first value found), 'last' (last value found) or whatever custom function.

Output:

>>> out.reset_index()
         date   a_x   b_x  c_x   a_y  b_y  c_y   a_z   b_z  c_z
0  2021-07-01  1234  1111  123   500  200  300  1000   780  220
1  2021-07-02  3456  3000  456     0    0    0  3200  3000  200
2  2021-07-03  5643  5142  501  1256  756  500  2345  2300   45
3  2021-07-04     0     0    0     0    0    0  2000  1700  300

edited Nov 19, 2021 at 6:26

answered Nov 18, 2021 at 7:37

Corralien

121k8 gold badges43 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Corralien Over a year ago

Your column d. df.pivot(index='date', columns='d', values=['a', 'b', 'c'])

Hale Over a year ago

Thank you for your answer. I take "ValueError: Index contains duplicate entries, cannot reshape" this error when I tried to your code. @corralien

Corralien Over a year ago

You have probably multiple rows with same (date, d). This is why Pandas raised this error. For example, if you have 2 rows with same (date, d), you also have two values for 'a', 'b' and 'c'. How do you want to merge them? sum, mean, highest value, lowest value?

jezrael Over a year ago

@Jade - added solution to another answer.

Hale Over a year ago

@Corralien I merge them with "sum" operator.

jezrael · Accepted Answer · 2021-11-18 08:44:41Z

Use DataFrame.set_index with DataFrame.unstack, then sorting second level of MultiIndex and last flatten it:

df = df.set_index(['date','d']).unstack(fill_value=0).sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
             a_x   b_x  c_x   a_y  b_y  c_y   a_z   b_z  c_z
date                                                        
2021-07-01  1234  1111  123   500  200  300  1000   780  220
2021-07-02  3456  3000  456     0    0    0  3200  3000  200
2021-07-03  5643  5142  501  1256  756  500  2345  2300   45
2021-07-04     0     0    0     0    0    0  2000  1700  300

Or DataFrame.pivot solution:

df = df.pivot('date','d').fillna(0).sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
               a_x     b_x    c_x     a_y    b_y    c_y     a_z     b_z    c_z
date                                                                          
2021-07-01  1234.0  1111.0  123.0   500.0  200.0  300.0  1000.0   780.0  220.0
2021-07-02  3456.0  3000.0  456.0     0.0    0.0    0.0  3200.0  3000.0  200.0
2021-07-03  5643.0  5142.0  501.0  1256.0  756.0  500.0  2345.0  2300.0   45.0
2021-07-04     0.0     0.0    0.0     0.0    0.0    0.0  2000.0  1700.0  300.0

If get:

"ValueError: Index contains duplicate entries, cannot reshape"

there are duplicates and need specify aggregate function, eg. here mean in DataFrame.pivot_table:

print (df)
         date     a     b    c  d
0  2021-07-01  1234  1111  123  x <- duplicates 2021-07-01, x
1  2021-07-01  1234   222  100  x <- duplicates 2021-07-01, x
2  2021-07-02  3456  3000  456  x
3  2021-07-03  5643  5142  501  x
4  2021-07-01   500   200  300  y
5  2021-07-03  1256   756  500  y
6  2021-07-01  1000   780  220  z
7  2021-07-02  3200  3000  200  z
8  2021-07-03  2345  2300   45  z
9  2021-07-04  2000  1700  300  z


df = df.pivot_table(index='date',columns='d', aggfunc='mean', fill_value=0).sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
             a_x     b_x    c_x   a_y  b_y  c_y   a_z   b_z  c_z
date                                                            
2021-07-01  1234   666.5  111.5   500  200  300  1000   780  220 <- 666.5, 111.5 is mean
2021-07-02  3456  3000.0  456.0     0    0    0  3200  3000  200
2021-07-03  5643  5142.0  501.0  1256  756  500  2345  2300   45
2021-07-04     0     0.0    0.0     0    0    0  2000  1700  300

Collectives™ on Stack Overflow

How to Create New Columns from Rows in Python?

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related