Python: how to get element-wise standard deviation of multiple arrays in a dataframe

Question

I have a rather big dataframe (df) containing arrays and NaN in each cell, the first 3 rows look like this:

df:
                 A                B                C
X  [4, 8, 1, 1, 9]              NaN  [8, 2, 8, 4, 9]
Y  [4, 3, 4, 1, 5]  [1, 2, 6, 2, 7]  [7, 1, 1, 7, 8]
Z              NaN  [9, 3, 8, 7, 7]  [2, 6, 3, 1, 9]

I already know (thanks to piRSquared) how to take the element-wise mean over rows for each column so that I get this:

element_wise_mean:
A                        [4.0, 5.5, 2.5, 1.0, 7.0]
B                        [5.0, 2.5, 7.0, 4.5, 7.0]
C    [5.66666666667, 3.0, 4.0, 4.0, 8.66666666667]

Now I wonder how to get the respective standard deviation, any idea? Also, I don't understand yet what groupby() is doing, could someone explain its function in more detail?

df

np.random.seed([3,14159])
df = pd.DataFrame(
    np.random.randint(10, size=(3, 3, 5)).tolist(),
    list('XYZ'), list('ABC')
).applymap(np.array)

df.loc['X', 'B'] = np.nan
df.loc['Z', 'A'] = np.nan

element_wise_mean

df2               = df.stack().groupby(level=1)
element_wise_mean = df2.apply(np.mean, axis=0)

element_wise_sd

element_wise_sd   = df2.apply(np.std, axis=0)
TypeError: setting an array element with a sequence.

Try on numpy array values - df2.apply(lambda x: np.std(x.values))? — Zero
– Zero, Commented Sep 18, 2017 at 11:34
I know somebody who would be very happy to see your seed value. — cs95
– cs95, Commented Sep 18, 2017 at 11:38
@cᴏʟᴅsᴘᴇᴇᴅ ah sorry, I'm still new and not aware of conventions here, I'll make a reference :-) — Svenno Nito
– Svenno Nito, Commented Sep 18, 2017 at 13:01

jezrael · Accepted Answer · 2017-09-18 12:06:48Z

3

Applying np.std using lambda with converting to numpy array is working for me :

element_wise_std = df2.apply(lambda x: np.std(np.array(x), 0))
#axis=0 is by default, so can be omit
#element_wise_std = df2.apply(lambda x: np.std(np.array(x)))
print (element_wise_std)
A                            [0.0, 2.5, 1.5, 0.0, 2.0]
B                            [4.0, 0.5, 1.0, 2.5, 0.0]
C    [2.62466929134, 2.16024689947, 2.94392028878, ...
dtype: object

Or solution from comment:

element_wise_std = df2.apply(lambda x: np.std(x.values, 0))
print (element_wise_std)
A                            [0.0, 2.5, 1.5, 0.0, 2.0]
B                            [4.0, 0.5, 1.0, 2.5, 0.0]
C    [2.62466929134, 2.16024689947, 2.94392028878, ...
dtype: object

I try explain more:

First reshape by stack - columns are added to index and Multiindex is created.

print (df.stack())
X  A    [4, 8, 1, 1, 9]
   C    [8, 2, 8, 4, 9]
Y  A    [4, 3, 4, 1, 5]
   B    [1, 2, 6, 2, 7]
   C    [7, 1, 1, 7, 8]
Z  B    [9, 3, 8, 7, 7]
   C    [2, 6, 3, 1, 9]
dtype: object

Then groupby(level=1) means group by first level of Multiindex - (by values A, B, C) and apply some function. Here it is np.std.

Pandas not working with arrays or lists very nice, so converting is necessary. (It looks like bug)

edited Sep 18, 2017 at 12:06

answered Sep 18, 2017 at 11:33

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ken Syme Over a year ago

A pandas column is a sequence, and in this case each sequence is an array. It looks like the pandas implementation is not playing nice with using this sequence of arrays. By doing x.values or np.array(x) the column is explicitly converted to a 2D array and so it works thereafter. Weird it works with mean and not std - would probably raise an issue on the pandas github to see what else could be going on

jezrael Over a year ago

@KenSyme - Nice idea - I post it here.

Svenno Nito Over a year ago

Amazing thanks! It is counter intuitive to me that np.mean nd np.std should behave differently on the same dataset, but it really works this way. Would love to hear from you again once you hear why it is like that.

Tony · Accepted Answer · 2017-09-18 11:51:28Z

2

Jezrael beat me to this:

To answer your question about .groupby(), try .apply(print). You'll see what is returned, and made to be used in apply functions:

df2 = df.stack().groupby(axis=1) #groups by the second index of df.stack()
df2.apply(print)
X  A    [4, 8, 1, 1, 9]
Y  A    [4, 3, 4, 1, 5]
Name: A, dtype: object
Y  B    [1, 2, 6, 2, 7]
Z  B    [9, 3, 8, 7, 7]
Name: B, dtype: object
X  C    [8, 2, 8, 4, 9]
Y  C    [7, 1, 1, 7, 8]
Z  C    [2, 6, 3, 1, 9]
Name: C, dtype: object

Conversely, try:

df3 = df.stack().groupby(level=0) #this will group by the first index of df.stack()
df3.apply(print)
X  A    [4, 8, 1, 1, 9]
   C    [8, 2, 8, 4, 9]
Name: X, dtype: object
Y  A    [4, 3, 4, 1, 5]
   B    [1, 2, 6, 2, 7]
   C    [7, 1, 1, 7, 8]
Name: Y, dtype: object
Z  B    [9, 3, 8, 7, 7]
   C    [2, 6, 3, 1, 9]
Name: Z, dtype: object

answered Sep 18, 2017 at 11:51

Tony

1,3002 gold badges14 silver badges38 bronze badges

1 Comment

Svenno Nito Over a year ago

.apply(print) was exactly what I needed to visualize what is going on, thanks a bunch!

Collectives™ on Stack Overflow

Python: how to get element-wise standard deviation of multiple arrays in a dataframe

2 Answers 2

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related