In the following pandas code, why is df not need in the arguments?
df.groupby('Category').apply(lambda df,a,b: sum(df[a] * df[b]), 'Weight (oz.)', 'Quantity')
The first parameter is passed implicitly to a function in the apply call. Therefore, it does not appear in the args again. You could actually rewrite the anonymous function in the apply to
df.groupby('Category').apply(lambda x: sum(x["Weight (oz.)"] * x["Quantity"]))
without using args here at all. It get's clear, that x is the first parameter which is passed without explicitly passing it.
more generally, apply is an method of the DataFrame instance df.
This boils down to meaning that apply is passed a self parameter implicitly. Imagine the call to be be apply(self, *args).
Here self refers to the DataFrame instance df; so now it should be clear that passing df again would be redundant (if it were allowed).
It is somewhat related and worth mentioning that you don't need apply at all here, and can speed up the operation considerably by only grouping the product of your two columns of interest by
your 'Category' column, e.g.
(df['Weight (oz.)'] * df['Quantity']).groupby(df.Category).sum()
Example
df = pd.DataFrame(dict(category=[1, 1, 1, 2, 2, 2, 3, 3, 3]*(10**6),
a = np.random.randint(1, 10, 9*(10**6)),
b=np.random.randint(1, 10, 9*(10**6))))
%timeit (df.a*df.b).groupby(df.category).sum()
1 loop, best of 3: 560 ms per loop
%timeit df.groupby('category').apply(lambda x: sum(x.a*x.b))
1 loop, best of 3: 3.34 s per loop
.applyget's passed either a column or a row, as aSeriesdepending on whether you usedaxis=0oraxis=1, respectively.