1

I am trying to stack financial values year wise in pandas Dataframe. But not able to get started.

All I have tried is

df1 = df.set_index(['refnum','y1gp','y2gp','y3gp']).stack()\
.reset_index(name='REV').rename(columns={'level_5':'Year'})

Existing:

refnum y1 y1rev y1gp y2 y2rev y2gp y3 y3rev y3gp
10001 2021 300 200 2022 100 600 2023 300 300
10002 2020 300 200 2021 200 500 2022 300 300
10003 2021 300 200 2022 500 500 2023 300 300

Expected:

refnum year REV GP Base Year
10001 2021 300 200 BaseYear
10001 2022 100 600 BaseYear+1
10001 2023 300 300 BaseYear+2
10002 2020 300 200 BaseYear
10002 2021 200 500 BaseYear+1
10002 2022 300 300 BaseYear+2
10003 2021 300 200 BaseYear
10003 2022 500 500 BaseYear+1
10003 2023 300 300 BaseYear+2
4
  • Where does the Base Year column come from? Commented Jul 15, 2021 at 22:04
  • Have to create one on the fly to keep track of calculations Commented Jul 15, 2021 at 22:06
  • But is it just the literal words "BaseYear", "BaseYear+1" etc? not some calculation? Commented Jul 15, 2021 at 22:07
  • int values 0,1,2,3 will work as well 0 being the base year Commented Jul 15, 2021 at 22:09

3 Answers 3

4

Try:

df.columns = [re.sub(r"y(\d+)(.*)", r"\2-\1", c) for c in df.columns]
x = (
    pd.wide_to_long(
        df, stubnames=["", "gp", "rev"], sep="-", i="refnum", j="Base Year"
    )
    .rename(columns={"": "year"})
    .reset_index()
    .sort_values(by="refnum")
)
print(x)

Prints:

   refnum  Base Year  year   gp  rev
0   10001          1  2021  200  300
3   10001          2  2022  600  100
6   10001          3  2023  300  300
1   10002          1  2020  200  300
4   10002          2  2021  500  200
7   10002          3  2022  300  300
2   10003          1  2021  200  300
5   10003          2  2022  500  500
8   10003          3  2023  300  300
Sign up to request clarification or add additional context in comments.

1 Comment

Interesting, I didn't know you could use empty string as a stubname. Way cool! I was going to use wide_to_long but I wanted to avoid the sort step.
3

Let's convert the headers to a useable MultiIndex using str.replace and str.split then stack to go from wide-form to long. Then groupby cumcount to create the BaseYear column.

# Save Columns
df = df.set_index('refnum')
# Create a MultiIndex with Numbers at the end and split into multiple levels
df.columns = (
    df.columns.str.replace(r'^(.*?)(\d+)(.*)$', r'\1\3/\2', regex=True)
        .str.split('/', expand=True)
)
# Wide Format to Long + Rename Columns
df = df.stack().droplevel(-1).reset_index().rename(
    columns={'y': 'Year', 'ygp': 'GP', 'yrev': 'REV'}
)
# Add Base Year Column
df['BaseYear'] = "BaseYear+" + df.groupby('refnum').cumcount().astype(str)
# df['BaseYear'] = df.groupby('refnum').cumcount()  # (int version)

df:

   refnum  Year   GP  REV    BaseYear
0   10001  2021  200  300  BaseYear+0
1   10001  2022  600  100  BaseYear+1
2   10001  2023  300  300  BaseYear+2
3   10002  2020  200  300  BaseYear+0
4   10002  2021  500  200  BaseYear+1
5   10002  2022  300  300  BaseYear+2
6   10003  2021  200  300  BaseYear+0
7   10003  2022  500  500  BaseYear+1
8   10003  2023  300  300  BaseYear+2

Comments

2

You could use pivot_longer from pyjanitor; for this scenario, you pass regexes to names_pattern, and the new column names in names_to:

# pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index='refnum', 
                names_to=['year', 'REV', 'GP'], 
                names_pattern=['^y\d$', '.*rev$', '.*gp$']
               )

   refnum  year  REV   GP
0   10001  2021  300  200
1   10002  2020  300  200
2   10003  2021  300  200
3   10001  2022  100  600
4   10002  2021  200  500
5   10003  2022  500  500
6   10001  2023  300  300
7   10002  2022  300  300
8   10003  2023  300  300

If you want the base year included, you can modify the column labels that end with a number, before using pivot_longer:

(df.rename(columns = lambda col: f"{col}YEAR" 
                                 if col.endswith(('1','2','3')) 
                                 else col)
   .pivot_longer(index='refnum', 
                 names_to= ("Base Year", ".value"), 
                 names_pattern=r".(\d)(.+)", 
                 sort_by_appearance=True)
 )

   refnum Base Year  YEAR  rev   gp
0   10001         1  2021  300  200
1   10001         2  2022  100  600
2   10001         3  2023  300  300
3   10002         1  2020  300  200
4   10002         2  2021  200  500
5   10002         3  2022  300  300
6   10003         1  2021  300  200
7   10003         2  2022  500  500
8   10003         3  2023  300  300

labels associated with .value stay as column headers, while the rest are lumped into a new column (base year)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.