2

I have a rather big dataframe (df) containing arrays and NaN in each cell, the first 3 rows look like this:

df:
                 A                B                C
X  [4, 8, 1, 1, 9]              NaN  [8, 2, 8, 4, 9]
Y  [4, 3, 4, 1, 5]  [1, 2, 6, 2, 7]  [7, 1, 1, 7, 8]
Z              NaN  [9, 3, 8, 7, 7]  [2, 6, 3, 1, 9]

I already know (thanks to piRSquared) how to take the element-wise mean over rows for each column so that I get this:

element_wise_mean:
A                        [4.0, 5.5, 2.5, 1.0, 7.0]
B                        [5.0, 2.5, 7.0, 4.5, 7.0]
C    [5.66666666667, 3.0, 4.0, 4.0, 8.66666666667]

Now I wonder how to get the respective standard deviation, any idea? Also, I don't understand yet what groupby() is doing, could someone explain its function in more detail?


df

np.random.seed([3,14159])
df = pd.DataFrame(
    np.random.randint(10, size=(3, 3, 5)).tolist(),
    list('XYZ'), list('ABC')
).applymap(np.array)

df.loc['X', 'B'] = np.nan
df.loc['Z', 'A'] = np.nan

element_wise_mean

df2               = df.stack().groupby(level=1)
element_wise_mean = df2.apply(np.mean, axis=0)

element_wise_sd

element_wise_sd   = df2.apply(np.std, axis=0)
TypeError: setting an array element with a sequence.
4
  • 1
    Try on numpy array values - df2.apply(lambda x: np.std(x.values))? Commented Sep 18, 2017 at 11:34
  • I know somebody who would be very happy to see your seed value. Commented Sep 18, 2017 at 11:38
  • @cᴏʟᴅsᴘᴇᴇᴅ is that pirsquared? Commented Sep 18, 2017 at 11:45
  • @cᴏʟᴅsᴘᴇᴇᴅ ah sorry, I'm still new and not aware of conventions here, I'll make a reference :-) Commented Sep 18, 2017 at 13:01

2 Answers 2

3

Applying np.std using lambda with converting to numpy array is working for me :

element_wise_std = df2.apply(lambda x: np.std(np.array(x), 0))
#axis=0 is by default, so can be omit
#element_wise_std = df2.apply(lambda x: np.std(np.array(x)))
print (element_wise_std)
A                            [0.0, 2.5, 1.5, 0.0, 2.0]
B                            [4.0, 0.5, 1.0, 2.5, 0.0]
C    [2.62466929134, 2.16024689947, 2.94392028878, ...
dtype: object

Or solution from comment:

element_wise_std = df2.apply(lambda x: np.std(x.values, 0))
print (element_wise_std)
A                            [0.0, 2.5, 1.5, 0.0, 2.0]
B                            [4.0, 0.5, 1.0, 2.5, 0.0]
C    [2.62466929134, 2.16024689947, 2.94392028878, ...
dtype: object

I try explain more:

First reshape by stack - columns are added to index and Multiindex is created.

print (df.stack())
X  A    [4, 8, 1, 1, 9]
   C    [8, 2, 8, 4, 9]
Y  A    [4, 3, 4, 1, 5]
   B    [1, 2, 6, 2, 7]
   C    [7, 1, 1, 7, 8]
Z  B    [9, 3, 8, 7, 7]
   C    [2, 6, 3, 1, 9]
dtype: object

Then groupby(level=1) means group by first level of Multiindex - (by values A, B, C) and apply some function. Here it is np.std.

Pandas not working with arrays or lists very nice, so converting is necessary. (It looks like bug)

Sign up to request clarification or add additional context in comments.

3 Comments

A pandas column is a sequence, and in this case each sequence is an array. It looks like the pandas implementation is not playing nice with using this sequence of arrays. By doing x.values or np.array(x) the column is explicitly converted to a 2D array and so it works thereafter. Weird it works with mean and not std - would probably raise an issue on the pandas github to see what else could be going on
@KenSyme - Nice idea - I post it here.
Amazing thanks! It is counter intuitive to me that np.mean nd np.std should behave differently on the same dataset, but it really works this way. Would love to hear from you again once you hear why it is like that.
2

Jezrael beat me to this:

To answer your question about .groupby(), try .apply(print). You'll see what is returned, and made to be used in apply functions:

df2 = df.stack().groupby(axis=1) #groups by the second index of df.stack()
df2.apply(print)
X  A    [4, 8, 1, 1, 9]
Y  A    [4, 3, 4, 1, 5]
Name: A, dtype: object
Y  B    [1, 2, 6, 2, 7]
Z  B    [9, 3, 8, 7, 7]
Name: B, dtype: object
X  C    [8, 2, 8, 4, 9]
Y  C    [7, 1, 1, 7, 8]
Z  C    [2, 6, 3, 1, 9]
Name: C, dtype: object

Conversely, try:

df3 = df.stack().groupby(level=0) #this will group by the first index of df.stack()
df3.apply(print)
X  A    [4, 8, 1, 1, 9]
   C    [8, 2, 8, 4, 9]
Name: X, dtype: object
Y  A    [4, 3, 4, 1, 5]
   B    [1, 2, 6, 2, 7]
   C    [7, 1, 1, 7, 8]
Name: Y, dtype: object
Z  B    [9, 3, 8, 7, 7]
   C    [2, 6, 3, 1, 9]
Name: Z, dtype: object

1 Comment

.apply(print) was exactly what I needed to visualize what is going on, thanks a bunch!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.