0

I've got an issue with the following, which is a simple code that calculates some means on a sports dataset. If needed I can provide an example of what these data look like.

Right now attempting to append the generated dataframes back-to-back, which I am not able to do. Basically I would like to take all the datasets (there are two here, but more to be used), and to generate a timeline based on the outputs. All the files have the same column and indices. Below is the code I'm using:

import pandas as pd

season_list = ["2017-2018","2018-2019"]
excl_value = 10
header_list = []

for seas in season_list:
    excel_sheet = pd.read_excel(r'Data\NData' + str(seas) + '.xlsx')
    excel_sheet = excel_sheet.fillna(value = 0)
    headers = excel_sheet.columns.values.tolist()
    if headers not in header_list:
        header_list.append(headers)
    for header in header_list:
        data_mean = excel_sheet[header][excel_sheet.games >=                         
excl_value].mean()
        data_mean = round(data_mean,2)
        print(data_mean)

With this I simply get two datasets, which look akin to the below:

age                   25.42
games                 22.42
games_starts          18.19

age                   24.95
games                 27.18
games_starts          22.00

These are the results of two different calculations of the mean, for a few variables. Now ideally I'd like to have these broken down by file, and by the same index concurrently. I'd see it working as below:

                    2017-2018  2018-2019
age                   25.42      24.95
games                 22.42      27.18
games_starts          18.19      22.00

Like this I can then create a timeline of the values together. However struggling to achieve this. How should I go about combining the calculations run, into one set all together? Thanks!

I've already tested a few methods of achieving this. One was just simply appending the excel sheets with .append, but I am seeing some issues with the values not coming together. I also thought about just grabbing the mean calculations and applying them to their own dictionary/list/dataframe, but I haven't managed to actually push them into their own set.

1 Answer 1

1

You can do it nicely by 3 steps. Load:

sports_data = [pd.read_excel(r'Data\NData' + str(season) + '.xlsx') for season in season_list]

Now you'd like to add a column to identify each sheet such that you can group on it in the next step:

for i, season in enumerate(season_list):
    sports_data[i]['season'] = season

Combine to a single DataFrame:

sports_data = pd.concat(sports_data)

Now we are ready to group by season and take the mean:

sports_data.groupby('season').mean()

This will give you

                  age       games   games_starts
season          
2017-2018   26.666667   33.000000   22.333333
2018-2019   25.666667   25.666667   19.000000

What is left now is only to transpose it to get what you want:

sports_data.groupby('season').mean().T

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

Wow, I was going through this with far more complexity than necessary! Thanks Alex, that was perfect!
Sorry. I missed that.
sports_data[sports_data.gmaes >= excl_value].groupby('season').mean().T

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.