0

I have a dataset.

date        a     b     c    d
2021-07-01  1234  1111  123  x
2021-07-02  3456  3000  456  x
2021-07-03  5643  5142  501  x
2021-07-01  500   200   300  y
2021-07-03  1256  756   500  y
2021-07-01  1000  780   220  z
2021-07-02  3200  3000  200  z
2021-07-03  2345  2300  45   z
2021-07-04  2000  1700  300  z

There are 3 different companies in the dataset. a: total_Sales, b: net_sales, c: bonus sales, d: company name

I created a dataframe of company x (df_x). I want to add the a, b, c features of other companies (y and z) as new features to the new dataset (df_x). However, I want to add records that match date column of df_x. If there is a date that does not match, assign zero.

date        a_x   b_x   c_x   a_y  b_y  c_y  a_z   b_z   c_z
2021-07-01  1234  1111  123   500  200  300  1000  780   220
2021-07-02  3456  3000  456   0    0    0    3200  3000  200
2021-07-03  5643  5142  501   1256 756  500  2000  1700  300

I tried to this: I created other companies datasets (df_y, df_z)

df_x = df_x.merge(df_y, on='Date', how='left', suffixes=['_x', '_y']).fillna(0)
df_x = df_x.merge(df_z, on='Date', how='left', suffixes=['', '_z']).fillna(0)

But it doesn't work. This code duplicated the columns. I also have nine companies in my main dataset (real work dataset).

How can I solve this problem?

2 Answers 2

1

You can use pivot or pivot_table:

out = df.pivot('date', 'd', ['a', 'b', 'c']) \
        .sort_index(level=1, axis=1).fillna(0).astype(int)
out.columns = out.columns.to_flat_index().str.join('_')

# OR (to merge duplicate entries)

out = df.pivot_table(['a', 'b', 'c'], 'date', 'd', aggfunc='sum') \
        .sort_index(level=1, axis=1).fillna(0).astype(int)
out.columns = out.columns.to_flat_index().str.join('_')

You can replace 'sum' by 'mean', 'min' (the lowest value), 'max' (the highest value), 'first' (first value found), 'last' (last value found) or whatever custom function.

Output:

>>> out.reset_index()
         date   a_x   b_x  c_x   a_y  b_y  c_y   a_z   b_z  c_z
0  2021-07-01  1234  1111  123   500  200  300  1000   780  220
1  2021-07-02  3456  3000  456     0    0    0  3200  3000  200
2  2021-07-03  5643  5142  501  1256  756  500  2345  2300   45
3  2021-07-04     0     0    0     0    0    0  2000  1700  300
Sign up to request clarification or add additional context in comments.

5 Comments

Your column d. df.pivot(index='date', columns='d', values=['a', 'b', 'c'])
Thank you for your answer. I take "ValueError: Index contains duplicate entries, cannot reshape" this error when I tried to your code. @corralien
You have probably multiple rows with same (date, d). This is why Pandas raised this error. For example, if you have 2 rows with same (date, d), you also have two values for 'a', 'b' and 'c'. How do you want to merge them? sum, mean, highest value, lowest value?
@Jade - added solution to another answer.
@Corralien I merge them with "sum" operator.
1

Use DataFrame.set_index with DataFrame.unstack, then sorting second level of MultiIndex and last flatten it:

df = df.set_index(['date','d']).unstack(fill_value=0).sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
             a_x   b_x  c_x   a_y  b_y  c_y   a_z   b_z  c_z
date                                                        
2021-07-01  1234  1111  123   500  200  300  1000   780  220
2021-07-02  3456  3000  456     0    0    0  3200  3000  200
2021-07-03  5643  5142  501  1256  756  500  2345  2300   45
2021-07-04     0     0    0     0    0    0  2000  1700  300

Or DataFrame.pivot solution:

df = df.pivot('date','d').fillna(0).sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
               a_x     b_x    c_x     a_y    b_y    c_y     a_z     b_z    c_z
date                                                                          
2021-07-01  1234.0  1111.0  123.0   500.0  200.0  300.0  1000.0   780.0  220.0
2021-07-02  3456.0  3000.0  456.0     0.0    0.0    0.0  3200.0  3000.0  200.0
2021-07-03  5643.0  5142.0  501.0  1256.0  756.0  500.0  2345.0  2300.0   45.0
2021-07-04     0.0     0.0    0.0     0.0    0.0    0.0  2000.0  1700.0  300.0

If get:

"ValueError: Index contains duplicate entries, cannot reshape"

there are duplicates and need specify aggregate function, eg. here mean in DataFrame.pivot_table:

print (df)
         date     a     b    c  d
0  2021-07-01  1234  1111  123  x <- duplicates 2021-07-01, x
1  2021-07-01  1234   222  100  x <- duplicates 2021-07-01, x
2  2021-07-02  3456  3000  456  x
3  2021-07-03  5643  5142  501  x
4  2021-07-01   500   200  300  y
5  2021-07-03  1256   756  500  y
6  2021-07-01  1000   780  220  z
7  2021-07-02  3200  3000  200  z
8  2021-07-03  2345  2300   45  z
9  2021-07-04  2000  1700  300  z


df = df.pivot_table(index='date',columns='d', aggfunc='mean', fill_value=0).sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
             a_x     b_x    c_x   a_y  b_y  c_y   a_z   b_z  c_z
date                                                            
2021-07-01  1234   666.5  111.5   500  200  300  1000   780  220 <- 666.5, 111.5 is mean
2021-07-02  3456  3000.0  456.0     0    0    0  3200  3000  200
2021-07-03  5643  5142.0  501.0  1256  756  500  2345  2300   45
2021-07-04     0     0.0    0.0     0    0    0  2000  1700  300

1 Comment

@Jade - it is column d, my code is df.pivot('date','d')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.