5

I have a dataframe where one of the columns is a numpy array:

 DF

      Name                     Vec
 0  Abenakiite-(Ce) [0.0, 0.0, 0.0, 0.0, 0.0, 0.043, 0.0, 0.478, 0...
 1  Abernathyite    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
 2  Abhurite        [0.176, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.235, 0...
 3  Abswurmbachite  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.0,...

When I check the data type of each element, the correct data type is returned.

 type(DF['Vec'].iloc[1])
 numpy.ndarray

I save this into a csv file:

DF.to_csv('.\\file.csv',sep='\t')

Now, when I read the file again,

new_DF=pd.read_csv('.\\file.csv',sep='\t')

and check the datatype of Vec at index 1:

type(new_DF['Vec'].iloc[1])   
str

The size of the numpy array is 1x127.

The data type has changed from a numpy array to a string. I can also see some new line elements in the individual vectors. I think this might be due to some problem when the vector is written into a csv but I don't know how to fix it. Can someone please help?

Thanks!

9
  • 1
    The information about data types is not saved into a CSV file. There is no way for Pandas CSV reader to know that what you attempt to read used to be a NumPy array in the past life. You should either save the array separately as a .npy file or transform the string back into an array yourself. Commented Jun 19, 2018 at 17:54
  • You should use dtype in read_csv. It is mentioned in the documentation Commented Jun 19, 2018 at 17:57
  • 1
    What else do you expect. csv is a text file? The string format of an array, e.g. '[0 1 2]' is the only way it can write the 2nd column. It can't write some sort of binary representation of the array (except maybe using pickle.dumps). Look at the csv file (with any text viewer). Commented Jun 19, 2018 at 18:11
  • I changed the read_csv command to: new_DF=pd.read_csv('.\\file.csv',sep='\t',dtype={'Vec':np.ndarray}) However, the new error is : dtype <class 'numpy.ndarray'> not understood Commented Jun 19, 2018 at 18:15
  • dtype refers to the elements of an array, not the type of the array as a whole. I don't think read_csv can handle this type of input. It may be possible, though to process those strings after they are in the dataframe. Commented Jun 19, 2018 at 18:38

2 Answers 2

7

In the comments I made a mistake and said dtype instead of converters. What you want is to convert them as you read them using a function. With some dummy variables:

df=pd.DataFrame({'name':['name1','name2'],'Vec':[np.array([1,2]),np.array([3,4])]})
df.to_csv('tmp.csv')
def converter(instr):
    return np.fromstring(instr[1:-1],sep=' ')
df1=pd.read_csv('tmp.csv',converters={'Vec':converter})
df1.iloc[0,2]
array([1., 2.])
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you! This totally worked. What is the last line: df1.iloc[0,2]. It returns 'name1'
it was just to show that Vec column is converted to an array.
Hi, could you take a look at my very similar problem? I followed your answer but only received empty [] fields. Thanks stackoverflow.com/questions/60960170/…
0

The answer above works. If you get empty lists, add the list slicing [1:-1] !

This converts the string [-2.0797753, 3.6340227, -1.7011836]

to -2.0797753, 3.6340227, -1.7011836

which is the required format for np.fromstring https://numpy.org/doc/stable/reference/generated/numpy.fromstring.html

1 Comment

This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.