Element wise mean of numpy arrays of different sizes

Question

So there is a csv file I'm reading where I'm focusing on col3 where the rows have the values of different lengths where initially it was being read as a type str but was fixed using pd.eval.

df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})


row e.g. [0, 100, -200, 300, -150...]

There are many rows of different sizes and I want to calculate the element wise average, where I have followed this solution. I first ran into the Numpy VisibleDeprecationWarning error which I fixed using this. But for the last step of the solution using np.nanmean I'm running into a new error which is

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

My code looks like this so far:

import pandas as pd
import numpy as np
import itertools 

df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})

datafile = df[(df['col1'] == 'Red') & (df['col2'] == Name) & ((df['col4'] == 'EX') | (df['col5'] == 'EX'))]
   
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning) 
ar = np.array(list(itertools.zip_longest(df['col3'], fillvalue=np.nan)))
print(ar)
np.nanmean(ar,axis=1)

the arrays print like this

And the error is pointing towards the last line

The error I can see if pointing towards the arrays being of type object but I'm not sure how to fix it.

The warning that you choose to ignore is telling you that you have a 'ragged array', that will be object dtype. It is not a normal multidimensional array; Check the shape; it is probably 1d. np.nanmean works on a float array, replacing the nan with 0s. It can't operate on your array. — hpaulj
– hpaulj, Commented Jan 22, 2023 at 19:36
Despite your use of zip_longest, it looks like your element arrays differ in length. Try [a.shape for a in ar] to see if that's true. Ignoring the warning does not force it to make a numeric dtype array. The warning tells you to explicitly specify dtype=object. — hpaulj
– hpaulj, Commented Jan 22, 2023 at 19:38
Checked the shape using len(a) for a in ar as shape doesn't work as it's a tuple and it was all 1 — ursula
– ursula, Commented Jan 22, 2023 at 19:46
How would I create a float array? Do I have to change the way I read my csv file or is it something I add after — ursula
– ursula, Commented Jan 22, 2023 at 19:47

hpaulj · Accepted Answer · 2023-01-22 19:50:44Z

Make a ragged array:

In [23]: arr = np.array([np.arange(5), np.ones(5),np.zeros(3)],object)
In [24]: arr
Out[24]: 
array([array([0, 1, 2, 3, 4]), array([1., 1., 1., 1., 1.]),
       array([0., 0., 0.])], dtype=object)

Note the shape and dtype.

Try to use mean on it:

In [25]: np.mean(arr)
Traceback (most recent call last):
  Input In [25] in <cell line: 1>
    np.mean(arr)
  File <__array_function__ internals>:180 in mean
  File /usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3432 in mean
    return _methods._mean(a, axis=axis, dtype=dtype,
  File /usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:180 in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
ValueError: operands could not be broadcast together with shapes (5,) (3,)

Apply mean to each element array works:

In [26]: [np.mean(a) for a in arr]
Out[26]: [2.0, 1.0, 0.0]

Trying to use zip_longest:

In [27]: import itertools
In [28]: list(itertools.zip_longest(arr))
Out[28]: 
[(array([0, 1, 2, 3, 4]),),
 (array([1., 1., 1., 1., 1.]),),
 (array([0., 0., 0.]),)]

No change. We can use it by unpacking the arr - but it has padded the arrays in the wrong way:

In [29]: list(itertools.zip_longest(*arr))
Out[29]: [(0, 1.0, 0.0), (1, 1.0, 0.0), (2, 1.0, 0.0), (3, 1.0, None), (4, 1.0, None)]

zip_longest can be used to pad lists, but it takes more thought than this.

If we make an array from that list:

In [35]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan)))
Out[35]: 
array([[ 0.,  1.,  0.],
       [ 1.,  1.,  0.],
       [ 2.,  1.,  0.],
       [ 3.,  1., nan],
       [ 4.,  1., nan]])

and transpose it, we can take the nanmean:

In [39]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan))).T
Out[39]: 
array([[ 0.,  1.,  2.,  3.,  4.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 0.,  0.,  0., nan, nan]])
In [40]: np.nanmean(_, axis=1)
Out[40]: array([2., 1., 0.])

Thanks for the help and the very thorough explanation. Was confused because the values didn't match up to watch I had in excel but it was because I transposed it. If I skip the transposition part I'm getting what I want to achieve since I want to get the average by comparing the first element of all arrays and so forth

Collectives™ on Stack Overflow

Element wise mean of numpy arrays of different sizes

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related