3

The following dataframe have multiple column names with format item:district:

   date  price:dc  price:xc  price:cy  ratio:dc  ratio:xc  ratio:cy
0  2017        12        11        14       0.1       0.1       0.3
1  2018        14        12        15       0.2       0.7       0.6
2  2019        13        13        16       0.5      -0.2       0.8

Is it possible to convert it to a new dataframe as follows? Thanks.

   date district  price  ratio
0  2017       dc     12    0.1
1  2018       dc     14    0.2
2  2019       dc     13    0.5
3  2017       xc     11    0.1
4  2018       xc     12    0.7
5  2019       xc     13   -0.2
6  2017       cy     14    0.3
7  2018       cy     15    0.6
8  2019       cy     16    0.8

1 Answer 1

1

You can create MultiIndex with columns with : by str.split with created index by non : columns before by DataFrame.set_index and then reshape by DataFrame.stack:

df = df.set_index('date')
df.columns = df.columns.str.split(':', expand=True)
df = df.stack().rename_axis(('date','district')).reset_index()
print (df)
   date district  price  ratio
0  2017       cy     14    0.3
1  2017       dc     12    0.1
2  2017       xc     11    0.1
3  2018       cy     15    0.6
4  2018       dc     14    0.2
5  2018       xc     12    0.7
6  2019       cy     16    0.8
7  2019       dc     13    0.5
8  2019       xc     13   -0.2

If ordering is important one solution is create ordered categoricals:

df = df.set_index('date')
df.columns = df.columns.str.split(':', expand=True)

lvl = pd.CategoricalIndex(df.columns.levels[1], 
                          ordered=True, 
                          categories=df.columns.get_level_values(1).drop_duplicates())
df.columns = df.columns.set_levels(lvl, level=1)

df = df.stack().sort_index(level=[1,0]).rename_axis(('date','district')).reset_index()
print (df)
   date district  price  ratio
0  2017       dc     12    0.1
1  2018       dc     14    0.2
2  2019       dc     13    0.5
3  2017       xc     11    0.1
4  2018       xc     12    0.7
5  2019       xc     13   -0.2
6  2017       cy     14    0.3
7  2018       cy     15    0.6
8  2019       cy     16    0.8
Sign up to request clarification or add additional context in comments.

7 Comments

Many thanks. If headers is split by multiple :s such as a:price:xc:b, a:price:cy:b:c, a:price:xc:b:c:d, I need to get the values after first and second :, at this case price:xc, price:cy, price:dc as df.columns, etc, how can I do that?
@ahbon - There is always same value a before first : for price column?
@ahbon - Can you test df.columns = df.columns.str.replace(r'a:price', 'price').str.split(':', n=2, expand=True).droplevel(-1) ?
I use df.columns = df.columns.str.replace("a:", ""), then apply df.columns = df.columns.str.split(':', n=2, expand=True).droplevel(-1), works perfectly. Great thanks. :) You're genius.
@ahbon - Because it create also 3rd level by :b:c:d values, so droplevel(-1) remove it
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.