Python - Calculating standard deviation (row level) of dataframe columns

Question

I have created a Pandas Dataframe and am able to determine the standard deviation of one or more columns of this dataframe (column level). I need to determine the standard deviation for all the rows of a particular column. Below are the commands that I have tried so far

# Will determine the standard deviation of all the numerical columns by default.
inp_df.std()

salary         8.194421e-01
num_months     3.690081e+05
no_of_hours    2.518869e+02

# Same as above command. Performs the standard deviation at the column level.
inp_df.std(axis = 0)

# Determines the standard deviation over only the salary column of the dataframe.
inp_df[['salary']].std()

salary         8.194421e-01

# Determines Standard Deviation for every row present in the dataframe. But it
# does this for the entire row and it will output values in a single column.
# One std value for each row.
inp_df.std(axis=1)

0       4.374107e+12
1       4.377543e+12
2       4.374026e+12
3       4.374046e+12
4       4.374112e+12
5       4.373926e+12

When I execute the below command I am getting "NaN" for all the records. Is there a way to resolve this?

# Trying to determine standard deviation only for the "salary" column at the
# row level.
inp_df[['salary']].std(axis = 1)

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN

Not sure what "standard deviation for all rows of one column" means. Isn't that just the std of that column, which would be one scalar number instead a column? Could you post the code that generates your DataFrame and also which columns/rows you want to calculate std on? — Indominus
– Indominus, Commented Dec 17, 2018 at 5:37
you're calculating standard deviation of a single number (one column, row by row)... what result would you expect? it's NaN because it divides by N-1 where N is 1. — filippo
– filippo, Commented Dec 17, 2018 at 5:45
@filippo apologies. I was not aware of the reason why it was getting NaN. Now it makes sense. Thanks for your inputs — JKC
– JKC, Commented Dec 17, 2018 at 9:00
@Indominus That's right . It will return only one scalar if we do std over only one column. I have to combine with another column to get proper std values as explained by jezrael below. — JKC
– JKC, Commented Dec 17, 2018 at 9:02
@JKC no need to apologize ;-) maybe I sounded too harsh. What I meant was it wasn't that clear from your question if your issue was with the NaNs or you didn't really notice your were calculating standard deviation on single samples. Glad it's solved now! — filippo
– filippo, Commented Dec 18, 2018 at 0:58

Peter Mortensen · Accepted Answer · 2021-07-17 21:30:56Z

It is expected, because if checking DataFrame.std:

Normalized by N-1 by default. This can be changed using the ddof argument

If you have one element, you're doing a division by 0. So if you have one column and want the sample standard deviation over columns, get all the missing values.

Sample:

inp_df = pd.DataFrame({'salary':[10,20,30],
                       'num_months':[1,2,3],
                       'no_of_hours':[2,5,6]})
print (inp_df)
   salary  num_months  no_of_hours
0      10           1            2
1      20           2            5
2      30           3            6

Select one column by one [] for Series:

print (inp_df['salary'])
0    10
1    20
2    30
Name: salary, dtype: int64

Get std of Series - get a scalar:

print (inp_df['salary'].std())
10.0

Select one column by double [] for one column DataFrame:

print (inp_df[['salary']])
   salary
0      10
1      20
2      30

Get std of DataFrame per index (default value) - get one element Series:

print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary    10.0
dtype: float64

Get std of DataFrame per columns (axis=1) - get all NaNs:

print (inp_df[['salary']].std(axis = 1))
0   NaN
1   NaN
2   NaN
dtype: float64

If changed default ddof=1 to ddof=0:

print (inp_df[['salary']].std(axis = 1, ddof=0))
0    0.0
1    0.0
2    0.0
dtype: float64

If you want std by two or more columns:

#select 2 columns
print (inp_df[['salary', 'num_months']])
   salary  num_months
0      10           1
1      20           2
2      30           3

#std by index
print (inp_df[['salary','num_months']].std())
salary        10.0
num_months     1.0
dtype: float64

#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0     5.656854
1    10.606602
2    16.970563
dtype: float64

No words to express my gratitude to you and your answer. It's simply an awesome explanation :)

Collectives™ on Stack Overflow

Python - Calculating standard deviation (row level) of dataframe columns

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related