0

I have got 1 huge df (over 5 million rows) and 5 columns in jupyter notebook. 'Name' column has 20 unique values and 'lot' column has 10 unique values. So 200 unique combinations of these two columns. I need to subset the df based on the unique combinations of these two columns, make some calculations and return some parts of the calculations in a final df. Final df will have 200 rows, one for each iteration/subset/combination.

For example (with only 2 names and 3 lots = 6 combinations):

    dfhuge
    index      Name   lot   col3 col4 col5
    123        delta   1    786  10   1
    657        delta   2    787  11   2
    567        delta   2    777  13   4
    456        bravo   3    775  12   3
    789        bravo   3    772  14   5

For 1 of the 6 iterations/combinations, I could use

df1outof6 = dfhuge.loc[(dfhuge["Name"] == "delta") & (dfhuge["lot"] == 2)]
df1outof6
   index       Name   lot   col3 col4 col5
    657        delta   2    787  11   2
    567        delta   2    777  13   4
mean = df1outof6["col4"].mean()
sum = df1outof6["col5"].sum()
... 

I want the above operation repeated for all the 6 subsets using a loop.

Final df should be:

    finaldf
    newcol                     col4mean  col5sum  
    combination1(delta and 2)      12     6   
    combination2(delta and 1)      10     1    
    combination3(delta and 3)      0      0   
    combination4(bravo and 1)      0      0    
    combination5(bravo and 2)      0      0
    combination6(bravo and 3)      13     8  

I need a loop, result of which will be the finaldf. I can't use df.loc to subset each combination because I originally have 200 of them.

1
  • Please provide code how to create your dataframe Commented Oct 18, 2022 at 4:21

2 Answers 2

1

Since you did not provide how to create your test df, I made my own and changed column names. Hope it's ok

import numpy as np
import pandas as pd
from itertools import product
df = pd.DataFrame(np.random.randint(0,10,size=(100, 4)), columns=list('ABCD'))

rows = []
unique_a = list(df["A"].unique())
unique_b = list(df["B"].unique())
unique_a.sort()
unique_b.sort()
key_error=0
for a_index, b_index in product(unique_a, unique_b):
    df_subset =  df.loc[(df["A"] == a_index) & (df["B"] == b_index),:]
    calc0 = 0
    calc1 = 0
    if len(df_subset):
        calc0 = df_subset["C"].sum()
        calc1 = df_subset["D"].mean()
    rows.append(((a_index, b_index), calc0, calc1))
        
    
final_df = pd.DataFrame(
    rows,  
    columns = ["index combination", "calc1", "calc2"]
).set_index("index combination")

final_df

Initial df

    A   B   C   D
0   1   7   3   4
1   6   1   3   9
2   7   9   1   4
3   8   7   4   6
4   0   3   3   3
... ... ... ... ...
95  7   5   5   7
96  0   4   9   0
97  0   3   2   1
98  4   2   9   2
99  6   5   6   6

Final df

    calc1   calc2
index combination       
(0, 0)  0   0.0
(0, 1)  21  5.0
(0, 2)  1   9.0
(0, 3)  20  3.0
(0, 4)  9   0.0
... ... ...
(9, 5)  1   0.0
(9, 6)  5   7.0
(9, 7)  0   0.0
(9, 8)  0   0.0
(9, 9)  0   0.0
Sign up to request clarification or add additional context in comments.

Comments

1

You can create a GroupBy object to calculate mean and sum for groups based on name and lot. These can then be concatenated to create your final dataframe.

grp = df.groupby(['Name', 'lot'])
finaldf = pd.concat({"col4mean":grp["col4"].mean(),
    "col5sum":grp["col5"].sum()}, axis=1).reset_index()
print(finaldf)

Output

    Name  lot  col4mean  col5sum
0  bravo    3      13.0        8
1  delta    1      10.0        1
2  delta    2      12.0        6

If you want to include name,lot combinations not included in the data sample, you could prefill the final dataframe with an index and add the columns from there.

grp = df.groupby(['Name', 'lot'])
finaldf = pd.DataFrame(index=pd.MultiIndex.from_product(
         [df["Name"].unique(), df["lot"].unique()]))         
finaldf["col4mean"] = grp["col4"].mean()
finaldf["col5sum"] = grp["col5"].sum()
finaldf = finaldf.fillna(0)
finaldf["col5sum"] = finaldf["col5sum"].astype(int)

Output

         col4mean  col5sum
delta 1      10.0        1
      2      12.0        6
      3       0.0        0
bravo 1       0.0        0
      2       0.0        0
      3      13.0        8

3 Comments

I think he wanted full list of possible combinations of 'Name' and 'lot' filled with 0 if the combination not present in dataframe
@NikolayZakirov - Added an alternate implementation that I think covers the case.
NikolayZakirov and tdelaney - Both your solutions have been great. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.