Generate a pandas dataframe with for-loop

Question

I have generated a dataframe (called 'sectors') that stores information from my brokerage account (sector/industry, sub sector, company name, current value, cost basis, etc).

I want to avoid hard coding a filter for each sector or sub sector to find specific data. I have achieved this with the following code (I know, not very pythonic, but I am new to coding):

for x in set(sectors_df['Sector']):
    x_filt = sectors_df['Sector'] == x
    #value in sect takes the sum of all current values in a given sector
    value_in_sect = round(sectors_df.loc[x_filt]['Current Value'].sum(), 2)
    #pct in sect is the % of the sector in the over all portfolio (total equals the total value of all sectors) 
    pct_in_sect = round((value_in_sect/total)*100 , 2)
    print(x, value_in_sect, pct_in_sect)

for sub in set(sectors_df['Sub Sector']):
    sub_filt = sectors_df['Sub Sector'] == sub
    value_of_subs = round(sectors_df.loc[sub_filt]['Current Value'].sum(), 2)
    pct_of_subs = round((value_of_subs/total)*100, 2)
    print(sub, value_of_subs, pct_of_subs)

My print statements produce the majority of the information I want, although I am still working through how to program for the % of a sub sector within its own sector. Anyways, I would now like to put this information (value_in_sect, pct_in_sect, etc) into dataframes of their own. What would be the best way or the smartest way or the most pythonic way to go about this? I am thinking a dictionary, and then creating a dataframe from the dictionary, but not sure.

pandas.pivot_table is your friend. You can group by the Sectors and Sub Sectors get percentages and totals. — Paul Brennan
– Paul Brennan, Commented Feb 23, 2021 at 1:57

Reinier · Accepted Answer · 2021-02-23 19:14:30Z

The split-apply-combine process in pandas, specifically aggregation, is the best way to go about this. First I'll explain how this process would work manually, and then I'll show how pandas can do it in one line.

Manual split-apply-combine

Split

First, divide the DataFrame into groups of the same Sector. This involves getting a list of Sectors and figuring out which rows belong to it (just like the first two lines of your code). This code runs through the DataFrame and builds a dictionary with keys as Sectors and a list of indices of rows from sectors_df that correspond to it.

sectors_index = {}
for ix, row in sectors_df.iterrows():
    if row['Sector'] not in sectors_index:
        sectors_index[row['Sector']] = [ix]
    else:
        sectors_index[row['Sector']].append(ix)

Apply

Run the same function, in this case summing of Current Value and calculating its percentage share, on each group. That is, for each sector, grab the corresponding rows from the DataFrame and run the calculations in the next lines of your code. I'll store the results as a dictionary of dictionaries: {'Sector1': {'value_in_sect': 1234.56, 'pct_in_sect': 11.11}, 'Sector2': ... } for reasons that will become obvious later:

sector_total_value = {}
total_value = sectors_df['Current Value'].sum()
for sector, row_indices in sectors_index.items():
    sector_df = sectors_df.loc[row_indices]
    current_value = sector_df['Current Value'].sum()
    sector_total_value[sector] = {'value_in_sect': round(current_value, 2),
                                  'pct_in_sect': round(current_value/total_value * 100, 2)
                                 }

(see footnote 1 for a note on rounding)

Combine

Finally, collect the function results into a new DataFrame, where the index is the Sector. pandas can easily convert this nested dictionary structure into a DataFrame:

sector_total_value_df = pd.DataFrame.from_dict(sector_total_value, orient='index')

split-apply-combine using `groupby`

pandas makes this process very simple using the groupby method.

Split

The groupby method splits a DataFrame into groups by a column or multiple columns (or even another Series):

grouped_by_sector = sectors_df.groupby('Sector')

grouped_by_sector is similar to the index we built earlier, but the groups can be manipulated much more easily, as we can see in the following steps.

Apply

To calculate the total value in each group, select the column or columns to sum up, use the agg or aggregate method with the function you want to apply:

sector_total_value = grouped_by_sector['Current Value'].agg(value_in_sect=sum)

Combine

It's already done! The apply step already creates a DataFrame where the index is the Sector (the groupby column) and the value in the value_in_sect column is the result of the sum operation.

I've left out the pct_in_sect part because a) it can be more easily done after the fact:

sector_total_value_df['pct_in_sect'] = round(sector_total_value_df['value_in_sect'] / total_value * 100, 2)
sector_total_value_df['value_in_sect'] = round(sector_total_value_df['value_in_sect'], 2)

and b) it's outside the scope of this answer.

Most of this can be done easily in one line (see footnote 2 for including the percentage, and rounding):

sector_total_value_df = sectors_df.groupby('Sector')['Current Value'].agg(value_in_sect=sum)

For subsectors, there's one additional consideration, which is that grouping should be done by Sector and Subsector rather than just Subsector, so that, for example rows from Utilities/Gas and Energy/Gas aren't combined.

subsector_total_value_df = sectors_df.groupby(['Sector', 'Sub Sector'])['Current Value'].agg(value_in_sect=sum)

This produces a DataFrame with a MultiIndex with levels 'Sector' and 'Sub Sector', and a column 'value_in_sect'. For a final piece of magic, the percentage in Sector can be calculated quite easily:

subsector_total_value_df['pct_within_sect'] = round(subsector_total_value_df['value_in_sect'] / sector_total_value_df['value_in_sect'] * 100, 2)

which works because the 'Sector' index level is matched during division.

Footnote 1. This deviates from your code slightly, because I've chosen to calculate the percentage using the unrounded total value, to minimize the error in the percentage. Ideally though, rounding is only done at display time.

Footnote 2. This one-liner generates the desired result, including percentage and rounding:

sector_total_value_df = sectors_df.groupby('Sector')['Current Value'].agg(
    value_in_sect = lambda c: round(sum(c), 2),
    pct_in_sect = lambda c: round(sum(c)/sectors_df['Current Value'].sum() * 100, 2),
)

Unfortunately when I just copy and paste this one liner into terminal, all I get is a dataframe with 3 column (index, value in sect, and pct in sect) and one row. II haven't had time to really look at this response, but I will get back to you when I figure out the issue! Thanks for the answer and the explanations!

Collectives™ on Stack Overflow

Generate a pandas dataframe with for-loop

1 Answer 1

Manual split-apply-combine

Split

Apply

Combine

split-apply-combine using `groupby`

Split

Apply

Combine

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Manual split-apply-combine

Split

Apply

Combine

split-apply-combine using groupby

Split

Apply

Combine

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related

split-apply-combine using `groupby`