I have got 1 huge df (over 5 million rows) and 5 columns in jupyter notebook. 'Name' column has 20 unique values and 'lot' column has 10 unique values. So 200 unique combinations of these two columns. I need to subset the df based on the unique combinations of these two columns, make some calculations and return some parts of the calculations in a final df. Final df will have 200 rows, one for each iteration/subset/combination.
For example (with only 2 names and 3 lots = 6 combinations):
dfhuge
index Name lot col3 col4 col5
123 delta 1 786 10 1
657 delta 2 787 11 2
567 delta 2 777 13 4
456 bravo 3 775 12 3
789 bravo 3 772 14 5
For 1 of the 6 iterations/combinations, I could use
df1outof6 = dfhuge.loc[(dfhuge["Name"] == "delta") & (dfhuge["lot"] == 2)]
df1outof6
index Name lot col3 col4 col5
657 delta 2 787 11 2
567 delta 2 777 13 4
mean = df1outof6["col4"].mean()
sum = df1outof6["col5"].sum()
...
I want the above operation repeated for all the 6 subsets using a loop.
Final df should be:
finaldf
newcol col4mean col5sum
combination1(delta and 2) 12 6
combination2(delta and 1) 10 1
combination3(delta and 3) 0 0
combination4(bravo and 1) 0 0
combination5(bravo and 2) 0 0
combination6(bravo and 3) 13 8
I need a loop, result of which will be the finaldf. I can't use df.loc to subset each combination because I originally have 200 of them.