0

So there is a csv file I'm reading where I'm focusing on col3 where the rows have the values of different lengths where initially it was being read as a type str but was fixed using pd.eval.

df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})


row e.g. [0, 100, -200, 300, -150...]

There are many rows of different sizes and I want to calculate the element wise average, where I have followed this solution. I first ran into the Numpy VisibleDeprecationWarning error which I fixed using this. But for the last step of the solution using np.nanmean I'm running into a new error which is

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

My code looks like this so far:

import pandas as pd
import numpy as np
import itertools 

df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})

datafile = df[(df['col1'] == 'Red') & (df['col2'] == Name) & ((df['col4'] == 'EX') | (df['col5'] == 'EX'))]
   
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning) 
ar = np.array(list(itertools.zip_longest(df['col3'], fillvalue=np.nan)))
print(ar)
np.nanmean(ar,axis=1)

the arrays print like this enter image description here

And the error is pointing towards the last line enter image description here

The error I can see if pointing towards the arrays being of type object but I'm not sure how to fix it.

4
  • The warning that you choose to ignore is telling you that you have a 'ragged array', that will be object dtype. It is not a normal multidimensional array; Check the shape; it is probably 1d. np.nanmean works on a float array, replacing the nan with 0s. It can't operate on your array. Commented Jan 22, 2023 at 19:36
  • Despite your use of zip_longest, it looks like your element arrays differ in length. Try [a.shape for a in ar] to see if that's true. Ignoring the warning does not force it to make a numeric dtype array. The warning tells you to explicitly specify dtype=object. Commented Jan 22, 2023 at 19:38
  • Checked the shape using len(a) for a in ar as shape doesn't work as it's a tuple and it was all 1 Commented Jan 22, 2023 at 19:46
  • How would I create a float array? Do I have to change the way I read my csv file or is it something I add after Commented Jan 22, 2023 at 19:47

1 Answer 1

1

Make a ragged array:

In [23]: arr = np.array([np.arange(5), np.ones(5),np.zeros(3)],object)
In [24]: arr
Out[24]: 
array([array([0, 1, 2, 3, 4]), array([1., 1., 1., 1., 1.]),
       array([0., 0., 0.])], dtype=object)

Note the shape and dtype.

Try to use mean on it:

In [25]: np.mean(arr)
Traceback (most recent call last):
  Input In [25] in <cell line: 1>
    np.mean(arr)
  File <__array_function__ internals>:180 in mean
  File /usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3432 in mean
    return _methods._mean(a, axis=axis, dtype=dtype,
  File /usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:180 in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
ValueError: operands could not be broadcast together with shapes (5,) (3,) 

Apply mean to each element array works:

In [26]: [np.mean(a) for a in arr]
Out[26]: [2.0, 1.0, 0.0]

Trying to use zip_longest:

In [27]: import itertools
In [28]: list(itertools.zip_longest(arr))
Out[28]: 
[(array([0, 1, 2, 3, 4]),),
 (array([1., 1., 1., 1., 1.]),),
 (array([0., 0., 0.]),)]

No change. We can use it by unpacking the arr - but it has padded the arrays in the wrong way:

In [29]: list(itertools.zip_longest(*arr))
Out[29]: [(0, 1.0, 0.0), (1, 1.0, 0.0), (2, 1.0, 0.0), (3, 1.0, None), (4, 1.0, None)]

zip_longest can be used to pad lists, but it takes more thought than this.

If we make an array from that list:

In [35]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan)))
Out[35]: 
array([[ 0.,  1.,  0.],
       [ 1.,  1.,  0.],
       [ 2.,  1.,  0.],
       [ 3.,  1., nan],
       [ 4.,  1., nan]])

and transpose it, we can take the nanmean:

In [39]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan))).T
Out[39]: 
array([[ 0.,  1.,  2.,  3.,  4.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 0.,  0.,  0., nan, nan]])
In [40]: np.nanmean(_, axis=1)
Out[40]: array([2., 1., 0.])
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the help and the very thorough explanation. Was confused because the values didn't match up to watch I had in excel but it was because I transposed it. If I skip the transposition part I'm getting what I want to achieve since I want to get the average by comparing the first element of all arrays and so forth

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.