Pandas - Perform mean() on a MultiIndexed DataFrame with numpy arrays

Question

Given a multi indexed Pandas DataFrame containing Numpy arrays, I would like to know how to get mean values for each columns for a given index level.

>>> pd.__version__
'1.0.5'
>>> a = np.array(range(20)).reshape(-1,2)
>>> d = pd.concat([pd.DataFrame({(i%len(a)//2,i%2): {'a': np.array(v), 'b': np.array([4,4])}}).T for i, v in enumerate(a)])
>>> d
            a       b
0 0    [0, 1]  [4, 4]
  1    [2, 3]  [4, 4]
1 0    [4, 5]  [4, 4]
  1    [6, 7]  [4, 4]
2 0    [8, 9]  [4, 4]
  1  [10, 11]  [4, 4]
3 0  [12, 13]  [4, 4]
  1  [14, 15]  [4, 4]
4 0  [16, 17]  [4, 4]
  1  [18, 19]  [4, 4]
>>> d['a'].mean()
array([ 9., 10.])
>>> d['b'].mean()
array([4., 4.])

So far so good.

Problem

The problem comes when I want to perform .mean() on all columns or on a given level of the index.

Getting the mean of the DataFrame instead of the d[<column>] Series, we only get the mean for the 1st element in the numpy arrays

>>> d.mean()
a    9.0
b    4.0
Name: 0, dtype: float64

And we get errors when trying specific index levels

>>> d.mean(level=0)
Traceback (most recent call last):
[ ... ]
pandas.core.base.DataError: No numeric types to aggregate
>>> d['a'].mean(level=1)
Traceback (most recent call last):
[ ... ]
pandas.core.base.DataError: No numeric types to aggregate

Expected output

>>> d.mean()
a   [9., 10.]
b    [4., 4.]
>>> d.mean(level=0)
          a       b
0    [1, 2]  [4, 4]
1    [5, 6]  [4, 4]
2   [9, 10]  [4, 4]
3  [13, 14]  [4, 4]
4  [17, 18]  [4, 4]

>>> d['a'].mean(level=1)
0    [8, 9]
1  [10, 11]

I know that Pandas doesn't pretend to handle Numpy arrays very well, but it looks like a Pandas bug to me, but I'd like to know how to work around it?

nimbous · Accepted Answer · 2020-07-21 13:02:04Z

Here is an alternative way to generate expected output as below:

Get multi-index level values:

level_vals_0 = set(d.index.get_level_values(0))
level_vals_1 = set(d.index.get_level_values(1))

Generate output 1:

output = {
    'a': [d.loc[(level_vals_0, level_vals_1), 'a'].mean()],
    'b': [d.loc[(level_vals_0, level_vals_1), 'b'].mean()]
}

pd.DataFrame(output).T

Output 1:

a   [9.0, 10.0]
b   [4.0, 4.0]

Generate output 2:

output = {
    'a': [d.loc[i, 'a'].mean() for i in level_vals_0],
    'b': [d.loc[i, 'b'].mean() for i in level_vals_0]
}

pd.DataFrame(output)

Output:

a   b
0   [1.0, 2.0]  [4.0, 4.0]
1   [5.0, 6.0]  [4.0, 4.0]
2   [9.0, 10.0] [4.0, 4.0]
3   [13.0, 14.0]    [4.0, 4.0]
4   [17.0, 18.0]    [4.0, 4.0]

Generate output 3:

output = {
    'a': [d.loc[(level_vals_0, i), 'a'].mean() for i in level_vals_1],
    'b': [d.loc[(level_vals_0, i), 'b'].mean() for i in level_vals_1]
}

pd.DataFrame(output)

Output:

a   b
0   [8.0, 9.0]  [4.0, 4.0]
1   [10.0, 11.0]    [4.0, 4.0]

Let's try · Accepted Answer · 2020-07-21 10:30:45Z

1

There are probably easier ways to achieve it using pandas. But I figured it out this one:

pd.DataFrame([d.iloc[:,i].mean() for i in range(2)], columns = ["a","b"])

    a   b
0   9.0 10.0
1   4.0 4.0

pd.DataFrame([[d.iloc[range(2*i,2*i+2),j].mean() for i in range(5)] for j in range(2)], index = ["a","b"]).T

    a               b
0   [1.0, 2.0]      [4.0, 4.0]
1   [5.0, 6.0]      [4.0, 4.0]
2   [9.0, 10.0]     [4.0, 4.0]
3   [13.0, 14.0]    [4.0, 4.0]
4   [17.0, 18.0]    [4.0, 4.0]

pd.DataFrame([d.iloc[range(0,10,2),0].mean(), d.iloc[range(1,10,2),0].mean()], columns = ["a","b"])

    a    b
0   8.0  9.0
1   10.0 11.0

answered Jul 21, 2020 at 10:30

Let's try

1,0589 silver badges20 bronze badges

3 Comments

AlexLoss Over a year ago

Thanks! However your reply assumes the number of lines in the DataFrame. Perhaps I should mention that I am looking for a solution that is size-agnostic. My actual index is a lot more dirty and I can't simply regenerate my index with a range(some_number)

Let's try Over a year ago

I would try to calculate the some_number for each of the cases depending on the size of your data and once you have it try to apply this method

AlexLoss Over a year ago

Yeah I think I can iterate over d.index but I think that will be a big performance loss

AlexLoss · Accepted Answer · 2020-07-21 12:42:45Z

1

After some more head scratching, I decided to split the work into Series which have a good behaviour.

def my_mean(df, level=None):
  if level is not None:
    return pd.DataFrame({
      col: {
        id: series.mean() for id, series in df[col].groupby(level=level)
      } for col in df.columns.values
    })
  else:
    return pd.DataFrame({col: df[col].mean() for col in df.columns.values})

Which output is close-enough for what I need

>>> my_mean(d)
     0     1
a  9.0  10.0
b  4.0   4.0
>>> my_mean(d, 0)
              a           b
0    [1.0, 2.0]  [4.0, 4.0]
1    [5.0, 6.0]  [4.0, 4.0]
2   [9.0, 10.0]  [4.0, 4.0]
3  [13.0, 14.0]  [4.0, 4.0]
4  [17.0, 18.0]  [4.0, 4.0]
>>> my_mean(d, 1)
              a           b
0    [8.0, 9.0]  [4.0, 4.0]
1  [10.0, 11.0]  [4.0, 4.0]

answered Jul 21, 2020 at 12:42

AlexLoss

5934 silver badges17 bronze badges

1 Comment

nimbous Over a year ago

I think this is neat and tidy. Just a note that you have generated slightly a different output for my_mean(d) compare to the expected output in your question.

Collectives™ on Stack Overflow

Pandas - Perform mean() on a MultiIndexed DataFrame with numpy arrays

Problem

Expected output

3 Answers 3

Comments

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Problem

Expected output

3 Answers 3

Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related