1

I have csv like this,

Market_id   Metric  New_Mar_19  New_Feb_19  tot_Mar_19  tot_Feb_19  
7528            1       9           7           6.5         8           
7528            2       6           3.2         2.2         3           

I want to convert the above csv to like this below,

  Month         Nrx_count       Nrx_quantity    Trx_count       Trx_quantity

    Mar_19      9                   6               6.5             2.2
    Feb_19      7                   3.2             8               3

I have tried using pivot and pd.melt but I am not able to like above.

7
  • I am thinking about Nrx and Trx, what is logic for assign to output columns? Commented May 21, 2019 at 13:49
  • @jezrael, Basically my metric, market_id and other column will be having repeating values, its throwing error 'Index contains duplicate entries, cannot reshape' Could you help ? Commented May 21, 2019 at 14:46
  • @jezrael, Is it possible to do this whole transformation is SQL ? Commented May 22, 2019 at 1:06
  • I think is it possible, but it is something similar like pivot, what is really complicated in SQL. Commented May 22, 2019 at 5:00
  • @jezrael can u help on that ? Commented May 22, 2019 at 5:04

1 Answer 1

3

Idea is convert columns without _ to MultiIndex, then split column by _ and reshape by DataFrame.stack, then by DataFrame.unstack:

df1 = df.set_index(['Market_id','Metric'])
df1.columns = df1.columns.str.split('_', n=1, expand=True)
df1 = df1.stack()
print (df1)
                         New  tot
Market_id Metric                 
7528      1      Feb_19  7.0  8.0
                 Mar_19  9.0  6.5
          2      Feb_19  3.2  3.0
                 Mar_19  6.0  2.2

#second level, here Metric is converted to columns
df1 = df1.unstack(1)
df1.columns = [f'{a}_{b}' for a, b in df1.columns]
df1 = df1.reset_index().rename(columns={'level_1':'y'})
print (df1)
   Market_id       y  New_1  New_2  tot_1  tot_2
0       7528  Feb_19    7.0    3.2    8.0    3.0
1       7528  Mar_19    9.0    6.0    6.5    2.2

#first level, here Market_id is converted to columns
df1 = df1.unstack(0)
df1.columns = [f'{a}_{b}' for a, b in df1.columns]
df1 = df1.reset_index().rename(columns={'level_1':'y'})
print (df1)
   Metric       y  New_7528  tot_7528
0       1  Feb_19       7.0       8.0
1       1  Mar_19       9.0       6.5
2       2  Feb_19       3.2       3.0
3       2  Mar_19       6.0       2.2

EDIT:

If get error:

ValueError: Index contains duplicate entries, cannot reshape

it means there are duplicates. Solution is same, only is necessary add aggregation - e.g. by mean, sum...

#change data for duplicates in column for Multiindex - Market_id, Metric
print (df)
   Market_id  Metric  New_Mar_19  New_Feb_19  tot_Mar_19  tot_Feb_19
0       7528       1           2         5.0         5.0           8
1       7528       1          10        20.0         2.5          16
2       7528       2           6         3.2         2.2           3  

df1 = df.set_index(['Market_id','Metric'])
df1.columns = df1.columns.str.split('_', n=1, expand=True)
df1 = df1.stack()
print (df1)
                          New   tot
Market_id Metric                   
7528      1      Feb_19   5.0   8.0
                 Mar_19   2.0   5.0
                 Feb_19  20.0  16.0
                 Mar_19  10.0   2.5
          2      Feb_19   3.2   3.0
                 Mar_19   6.0   2.2

#number of levels passes to groupby - here 3 levels -> 0,1,2
print (df1.index.nlevels)
3

#grouping by all levels and aggregate mean, sum,max...
df1 = df1.groupby(level=[0,1,2]).mean().unstack(1)
print (df1)
                   New         tot     
Metric               1    2      1    2
Market_id                              
7528      Feb_19  12.5  3.2  12.00  3.0
          Mar_19   6.0  6.0   3.75  2.2

df1.columns = [f'{a}_{b}' for a, b in df1.columns]
df1 = df1.reset_index().rename(columns={'level_1':'y'})
print (df1)
   Market_id       y  New_1  New_2  tot_1  tot_2
0       7528  Feb_19   12.5    3.2  12.00    3.0
1       7528  Mar_19    6.0    6.0   3.75    2.2
Sign up to request clarification or add additional context in comments.

7 Comments

sorry I want the market_id column also in output Could you also help me with that ?
Will this kind of approach works if i have few other columns like market_id ?
@venkat - Change df1 = df.set_index(['Market_id','Metric']) to df1 = df.set_index(['Market_id','Metric', 'col without _', 'another col without_'])
Basically my metric, market_id and other column will be having repeating values, its throwing error Could you help ?
@venkat: it looks like you have a new question. If you have done sufficient research on the problem, then ask a new question. You can invite someone to take a look by providing the link to them, but tread carefully - if someone had the time to help you on this occasion, it does not mean they will always have the time to help you.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.