Python: get the element-wise mean of multiple arrays in a dataframe

Question

I have a 16x10 panda dataframe with 1x35000 arrays (or NaN) in each cell. I want to take the element-wise mean over rows for each column.

      1       2       3       ...       10
1    1x35000 1x35000 1x35000           1x35000

2    1x35000 NaN     1x35000           1x35000

3    1x35000 NaN     1x35000           NaN

...

16   1x35000 1x35000 NaN               1x35000

To avoid misunderstandings: take the first element of each array in the first column and take the mean. Then take the second element of each array in the first column and take the mean again. In the end I want to have a 1x10 dataframe with one 1x35000 array each per column. The array should be the element-wise mean of my former arrays.

      1       2       3       ...       10
1    1x35000 1x35000 1x35000           1x35000

Do you have an idea to get there elegantly preferably without for-loops?

Given the mixed dtype data (with NaN), I think performing mean for each cell iteratively might be the way to go here. — Divakar
– Divakar, Commented Sep 15, 2017 at 21:53

piRSquared · Accepted Answer · 2017-09-15 22:29:51Z

3

Setup

np.random.seed([3,14159])
df = pd.DataFrame(
    np.random.randint(10, size=(3, 3, 5)).tolist(),
    list('XYZ'), list('ABC')
).applymap(np.array)

df.loc['X', 'B'] = np.nan
df.loc['Z', 'A'] = np.nan

df

                 A                B                C
X  [4, 8, 1, 1, 9]              NaN  [8, 2, 8, 4, 9]
Y  [4, 3, 4, 1, 5]  [1, 2, 6, 2, 7]  [7, 1, 1, 7, 8]
Z              NaN  [9, 3, 8, 7, 7]  [2, 6, 3, 1, 9]

Solution

g = df.stack().groupby(level=1)
g.apply(np.sum, axis=0) / g.size()

A                        [4.0, 5.5, 2.5, 1.0, 7.0]
B                        [5.0, 2.5, 7.0, 4.5, 7.0]
C    [5.66666666667, 3.0, 4.0, 4.0, 8.66666666667]
dtype: object

If you insist on the shape you presented

g = df.stack().groupby(level=1)
(g.apply(np.sum, axis=0) / g.size()).to_frame().T

                           A                          B                                              C
0  [4.0, 5.5, 2.5, 1.0, 7.0]  [5.0, 2.5, 7.0, 4.5, 7.0]  [5.66666666667, 3.0, 4.0, 4.0, 8.66666666667]

edited Sep 15, 2017 at 22:29

answered Sep 15, 2017 at 22:08

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Svenno Nito Over a year ago

Amazing thanks! Very smart thoughts! I have two follow up questions if I may? First, I don't really understand what groupby(level=1) is doing - if I print g after this step it only shows <pandas.core.groupby.SeriesGroupBy object at 0x000000F86E0A4940>, could you elloborate on that? Second, g.mean(axis = 0) should do the same as g.apply(np.sum, axis=0) / g.size(), right?

Divakar · Accepted Answer · 2017-09-15 23:42:30Z

Approach #1 : Loopy

Given the mixed dtype input data, we might want to loop through for performance efficiency. So, looping with explicit loops or under-the-hood usages of .apply/.applymap would be the solutions that could be suggested.

Here's one way looping through columns -

mask = ~df.isnull().values
n = df.shape[1]
out = np.empty((1,n),dtype=object)
for i in range(n):
    out[0,i] = df.iloc[mask[:,i],i].mean()
df_out = pd.DataFrame(out)

Sample input, output -

In [326]: df
Out[326]: 
              0             1             2
0  [4, 0, 1, 6]  [4, 2, 2, 2]  [5, 3, 5, 4]
1           NaN  [0, 5, 6, 8]           NaN
2           NaN           NaN           NaN
3           NaN           NaN           NaN

In [327]: df_out
Out[327]: 
                      0                     1                     2
0  [4.0, 0.0, 1.0, 6.0]  [2.0, 3.5, 4.0, 5.0]  [5.0, 3.0, 5.0, 4.0]

Approach #2 : Vectorized

If you have to vectorize, here's one way using matrix-multiplication to replace the mean-reductions and that could bring about improvements for large data -

mask = ~df.isnull().values
v = np.vstack(df.values[mask])
r,c = np.where(mask)
n = df.shape[1]
pos_mask = c == np.arange(n)[:,None]
out = pos_mask.dot(v)/np.bincount(c).astype(float)[:,None]
df_out1 = pd.DataFrame(out)

Sample output -

In [328]: df_out1
Out[328]: 
     0    1    2    3
0  4.0  0.0  1.0  6.0
1  2.0  3.5  4.0  5.0
2  5.0  3.0  5.0  4.0

Approach #3 : Vectorized one more

Making use of np.add.reduceat to get those mean-reductions -

mask = ~df.T.isnull().values
v = np.vstack(df.values.T[mask])
count = mask.sum(1)
out0 = np.add.reduceat(v, np.r_[0,count.cumsum()[:-1]])
out = out0/count[:,None].astype(float)
df_out2 = pd.DataFrame(out)

Thanks a lot for your effort Divakar! There seem to be many solutions to the problem. I think I'll use piRSquared's answer, It's a bit shorter and pretty easy to understand.

Collectives™ on Stack Overflow

Python: get the element-wise mean of multiple arrays in a dataframe

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related