python3: pandas group by several columns and convert rows value into multiple columns

Question

I have data frame like as following:

id   date          t_s     t_p    t_prob
1    '2020-01-01'   1       1      0.5
1    '2020-01-01'   2       1      0.55
1    '2020-01-01'   3       1      0.56
1    '2020-01-01'   4       0      0.4
1    '2020-01-01'   5       1      0.6
1    '2020-01-01'   6       1      0.7
2    '2020-01-01'   1       1      0.77
2    '2020-01-01'   2       0      0.3
2    '2020-01-01'   3       0      0.2 
2    '2020-01-01'   4       0      0.33
2    '2020-01-01'   5       1      0.66
2    '2020-01-01'   6       1      0.56
....

each id has same date for example (2020-01-01-2020-01-09). each id has 6 t_s(1,2,3,4,5,6) for each date, and t_p is the label for each t_s, and t_prob is the value of label fo each t_s. I want to get transform the t_prob value for each t_s in the same date to the columns like t_s_1, t_s_2, t_s_3, t_s_4, t_s_5, t_s_6. and Finally get the most value of t_prob, and t_s value. like id 1 in '2020-01-01' is t_s_6 is the most value.

 id     date              t_s_1   t_s_2   t_s_3  t_s_4   t_s_5   t_s_6  t_prob_max_s    
    1    '2020-01-01'     0.5    0.55    0.56    0.4      0.6      0.7      6
    2    '2020-01-01'     0.77   0.3     0.2    0.33     0.66      0.56     1
    ....

Thanks!

Maybe groupby, I've done this before, but I can't do it now. — Johnny
– Johnny, Commented May 18, 2021 at 7:39
Are the t_s values for each date per unique id present in sequential order i.e from 1 to 6? — Shubham Sharma
– Shubham Sharma, Commented May 18, 2021 at 7:45

jo9k · Accepted Answer · 2021-05-18 08:06:19Z

2

First group by relevant indexing columns and columns meant to be unstack. You can choose something else than "max" aggregation, depends on the context. If each occurs once, then it doesn't matter.

unstacked = df.groupby(['id', 'date', 't_s'])['t_prob'].aggregate('max').unstack()

Or alternatively:

df.pivot_table(index=['id', 'date'], columns='t_s', values='t_prob', aggfunc='max')

Which is less flexible but perhaps slightly more clear in the context.

Rename the axis such that there is no weird "t_s" name for the columns axis. Then rename the columns so that they enumerate t_s:

unstacked_renamed = unstacked.rename_axis(columns = None).rename(columns={val:f't_s_{val}' for val in unstacked.columns.values})

Get index of column with highest value for each row, then preprocess it to get the number of t_s relevant for that column:

unstacked_renamed['t_prob_max_s'] = unstacked_renamed.idxmax(axis=1).str.split('_').str[-1]

Reset the index so it is flat again:

unstacked_reindexed = unstacked_renamed.reset_index()

Inspect for correctness:

>>unstacked_reindexed
    id          date    t_s_1   t_s_2   t_s_3   t_s_4   t_s_5   t_s_6   t_prob_max_s
0   1   '2020-01-01'    0.50    0.55    0.56    0.40    0.60    0.70    6
1   2   '2020-01-01'    0.77    0.30    0.20    0.33    0.66    0.56    1

This approach works even if the initial data is unsorted by indexers, if given t_s value occurs multiple times (but then the aggregation of choice is non-negligible), or when there are missing/skipped t_s (e.g. values of t_s 1,2,3,4,5,7). It is in general pretty robust solution.

edited May 18, 2021 at 8:06

answered May 18, 2021 at 7:48

jo9k

7106 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Johnny Over a year ago

Perhaps unstack can do it all in one go ?

jo9k Over a year ago

What do you mean by "all" in this context? It just pivots one level from one axis to another. Maybe pd.pivot_table() may be more efficient, I will investigate.

tktktk0711 Over a year ago

thanks for your answer. there is error in my code, TypeError: rename_axis() got an unexpected keyword argument 'columns'

jo9k Over a year ago

That is surprising, because the 'pd.DataFrame().rename_axis()' takes the keyword argument "columns". pandas.pydata.org/docs/reference/api/… I suggest checking if the unstacked DataFrame looks like expected. The code perhaps behaved differently if the initial DataFrame is significantly different than the one provided in the original post.

jo9k Over a year ago

I've checked. You have old pandas version. This feature was changed in pandas version 0.24 to the functionality like in my code. What's your pandas version? For older version of pandas use syntax unstacked.rename_axis({}, axis="columns"). Docs link: pandas.pydata.org/pandas-docs/version/0.19.2/generated/…

Collectives™ on Stack Overflow

python3: pandas group by several columns and convert rows value into multiple columns

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related