3

Given a multi indexed Pandas DataFrame containing Numpy arrays, I would like to know how to get mean values for each columns for a given index level.

>>> pd.__version__
'1.0.5'
>>> a = np.array(range(20)).reshape(-1,2)
>>> d = pd.concat([pd.DataFrame({(i%len(a)//2,i%2): {'a': np.array(v), 'b': np.array([4,4])}}).T for i, v in enumerate(a)])
>>> d
            a       b
0 0    [0, 1]  [4, 4]
  1    [2, 3]  [4, 4]
1 0    [4, 5]  [4, 4]
  1    [6, 7]  [4, 4]
2 0    [8, 9]  [4, 4]
  1  [10, 11]  [4, 4]
3 0  [12, 13]  [4, 4]
  1  [14, 15]  [4, 4]
4 0  [16, 17]  [4, 4]
  1  [18, 19]  [4, 4]
>>> d['a'].mean()
array([ 9., 10.])
>>> d['b'].mean()
array([4., 4.])

So far so good.

Problem

The problem comes when I want to perform .mean() on all columns or on a given level of the index.

Getting the mean of the DataFrame instead of the d[<column>] Series, we only get the mean for the 1st element in the numpy arrays

>>> d.mean()
a    9.0
b    4.0
Name: 0, dtype: float64

And we get errors when trying specific index levels

>>> d.mean(level=0)
Traceback (most recent call last):
[ ... ]
pandas.core.base.DataError: No numeric types to aggregate
>>> d['a'].mean(level=1)
Traceback (most recent call last):
[ ... ]
pandas.core.base.DataError: No numeric types to aggregate

Expected output

>>> d.mean()
a   [9., 10.]
b    [4., 4.]
>>> d.mean(level=0)
          a       b
0    [1, 2]  [4, 4]
1    [5, 6]  [4, 4]
2   [9, 10]  [4, 4]
3  [13, 14]  [4, 4]
4  [17, 18]  [4, 4]

>>> d['a'].mean(level=1)
0    [8, 9]
1  [10, 11]

I know that Pandas doesn't pretend to handle Numpy arrays very well, but it looks like a Pandas bug to me, but I'd like to know how to work around it?

3 Answers 3

2

Here is an alternative way to generate expected output as below:

Get multi-index level values:

level_vals_0 = set(d.index.get_level_values(0))
level_vals_1 = set(d.index.get_level_values(1))

Generate output 1:

output = {
    'a': [d.loc[(level_vals_0, level_vals_1), 'a'].mean()],
    'b': [d.loc[(level_vals_0, level_vals_1), 'b'].mean()]
}

pd.DataFrame(output).T

Output 1:

a   [9.0, 10.0]
b   [4.0, 4.0]

Generate output 2:

output = {
    'a': [d.loc[i, 'a'].mean() for i in level_vals_0],
    'b': [d.loc[i, 'b'].mean() for i in level_vals_0]
}

pd.DataFrame(output)

Output:

a   b
0   [1.0, 2.0]  [4.0, 4.0]
1   [5.0, 6.0]  [4.0, 4.0]
2   [9.0, 10.0] [4.0, 4.0]
3   [13.0, 14.0]    [4.0, 4.0]
4   [17.0, 18.0]    [4.0, 4.0]

Generate output 3:

output = {
    'a': [d.loc[(level_vals_0, i), 'a'].mean() for i in level_vals_1],
    'b': [d.loc[(level_vals_0, i), 'b'].mean() for i in level_vals_1]
}

pd.DataFrame(output)

Output:

a   b
0   [8.0, 9.0]  [4.0, 4.0]
1   [10.0, 11.0]    [4.0, 4.0]
Sign up to request clarification or add additional context in comments.

Comments

1

There are probably easier ways to achieve it using pandas. But I figured it out this one:

pd.DataFrame([d.iloc[:,i].mean() for i in range(2)], columns = ["a","b"])

    a   b
0   9.0 10.0
1   4.0 4.0

pd.DataFrame([[d.iloc[range(2*i,2*i+2),j].mean() for i in range(5)] for j in range(2)], index = ["a","b"]).T

    a               b
0   [1.0, 2.0]      [4.0, 4.0]
1   [5.0, 6.0]      [4.0, 4.0]
2   [9.0, 10.0]     [4.0, 4.0]
3   [13.0, 14.0]    [4.0, 4.0]
4   [17.0, 18.0]    [4.0, 4.0]

pd.DataFrame([d.iloc[range(0,10,2),0].mean(), d.iloc[range(1,10,2),0].mean()], columns = ["a","b"])

    a    b
0   8.0  9.0
1   10.0 11.0

3 Comments

Thanks! However your reply assumes the number of lines in the DataFrame. Perhaps I should mention that I am looking for a solution that is size-agnostic. My actual index is a lot more dirty and I can't simply regenerate my index with a range(some_number)
I would try to calculate the some_number for each of the cases depending on the size of your data and once you have it try to apply this method
Yeah I think I can iterate over d.index but I think that will be a big performance loss
1

After some more head scratching, I decided to split the work into Series which have a good behaviour.

def my_mean(df, level=None):
  if level is not None:
    return pd.DataFrame({
      col: {
        id: series.mean() for id, series in df[col].groupby(level=level)
      } for col in df.columns.values
    })
  else:
    return pd.DataFrame({col: df[col].mean() for col in df.columns.values})

Which output is close-enough for what I need

>>> my_mean(d)
     0     1
a  9.0  10.0
b  4.0   4.0
>>> my_mean(d, 0)
              a           b
0    [1.0, 2.0]  [4.0, 4.0]
1    [5.0, 6.0]  [4.0, 4.0]
2   [9.0, 10.0]  [4.0, 4.0]
3  [13.0, 14.0]  [4.0, 4.0]
4  [17.0, 18.0]  [4.0, 4.0]
>>> my_mean(d, 1)
              a           b
0    [8.0, 9.0]  [4.0, 4.0]
1  [10.0, 11.0]  [4.0, 4.0]

1 Comment

I think this is neat and tidy. Just a note that you have generated slightly a different output for my_mean(d) compare to the expected output in your question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.