This is what I am trying to explain:
>>> a = pd.Series([7, 20, 22, 22])
>>> a.std()
7.2284161474004804
>>> np.std(a)
6.2599920127744575
I have data about many different restaurants. For simplicity I have extracted just one restaurant with four items:
>>> df
restaurant_id price
id
1 10407 7
3 10407 20
6 10407 22
13 10407 22
For each restaurant, I want to get the standard deviation, however, Pandas returns wrong values.
>>> df.groupby('restaurant_id').std()
price
restaurant_id
10407 7.228416
We can get the correct value with np.std():
>>> np.std(df['price'])
6.2599920127744575
But obviously, this is not a solution when I have more than one restaurant. How do I do this properly?
Just to make sure, I checked that df['price'].mean() == np.mean(df['price']).
There is a related discussion here, but their suggestions do not work either.
pd.Series([7,20,22,22]).std(ddof=0)would be the same number asnp.std.agg(np.std)as a workaround (which wouldn't be an ideal solution in this case, but the pattern is good to know), but actually, that still produces the Bessel output! I had to do.agg(lambda col: np.std(col))to get the non-Bessel output. I'm not an expert on this, but I thinknp.stdis a ufunc, which causes special behaviour.