1

I have a pandas dataframe, for which one of the columns holds 2D numpy arrays corresponding to pixel data from grayscale images. These 2D numpy arrays have the shape (480, 640) or (490, 640). The dataframe has other columns containing other information. I then generate a csv file out of it through pandas' to_csv() function. Now my issue is: my 2D numpy arrays all appear as strings in my CSV, so how can I read them back and convert them into 2D numpy arrays again?

I know there are similar questions on StackOverflow, but I couldn't find any that really focuses on 2D numpy arrays. They seem to be mostly about 1D numpy arrays, and the solutions provided don't seem to work.

Any help is greatly appreciated.

UPDATE:

As requested, I am adding some code below to clarify what my problem is.

# Function to switch images to grayscale format
grayscale(img):
  cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Iterating through my dataframe (called data), reading all image files, making them grayscale and then adding them to my collection.
grayscale_images = []
for index, row in data.iterrows():
  img_path = row['Image path']
  cv_image = cv2.imread(img_path)
  gray = grayscale(cv_image)
  grayscale_images.append(gray)

# Make numpy array elements show without truncation
np.set_printoptions(threshold=sys.maxsize)

# Adding a new column to the dataframe containing each image's numpy array corresponding to pixels
data['Image data'] = grayscale_images

So when I'm done doing that and other operations on other columns, I export my dataframe to CSV like this:

data.to_csv('new_dataset.csv', index=False)

In a different Jupyter notebook, I try to read my CSV file and then extract my image's numpy arrays to feed them to a convolutional neural network as input, as part of supervised training.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
import re

data = pd.read_csv('new_dataset.csv')
# data.head() -- It looks fine here

# Config to make numpy arrays display in their entirety without truncation
np.set_printoptions(threshold=sys.maxsize)

# Checking if I can extract a 2D numpy array for conversion from a cell.
# That's where I notice it's a string, and I'm having trouble turning it back to a 2D numpy array
image_arr = data.iloc[0,0]

But, I'm stuck converting back my string-type representation from my CSV file into a 2D numpy array, especially one with the shape (490, 640) as it was before I exported the dataframe to CSV.

14
  • 2
    What is the reason for storing the dataframe as a CSV file? Will it be read by another program that requires a CSV input? If not, I suggest using pickle. Commented Jan 6, 2020 at 22:26
  • @DYZ I will be reading from the CSV (as a dataset) in a TensorFlow model, because I'm creating a convolutional neural network using Keras, to classify the images. Do you still recommend pickle? Commented Jan 6, 2020 at 22:29
  • 1
    If your CSV file is simply temporary storage, then I recommend using pickle. Commented Jan 6, 2020 at 22:30
  • @DYZ Actually I wish to share it with other colleagues as well, and it's not really temporary storage. I guess that's where I'm undecided. Commented Jan 6, 2020 at 22:36
  • 1
    You can share your pickle files with your colleagues as well. As long as you do not plan to feed your CSV files into third-part software that is capable of recognizing numpy arrays as strings, there is no point in using CSV. Commented Jan 7, 2020 at 0:19

3 Answers 3

1

Construct a csv with array strings:

In [385]: arr = np.empty(1, object)                                             
In [386]: arr[0]=np.arange(12).reshape(3,4)                                     
In [387]: S = pd.Series(arr,name='x')                                           
In [388]: S                                                                     
Out[388]: 
0    [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]
Name: x, dtype: object
In [389]: S.to_csv('series.csv')                                                
/usr/local/bin/ipython3:1: FutureWarning: The signature of `Series.to_csv` was aligned to that of `DataFrame.to_csv`, and argument 'header' will change its default value from False to True: please pass an explicit value to suppress this warning.
  #!/usr/bin/python3
In [390]: cat series.csv                                                        
0,"[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]"

load it:

In [391]: df = pd.read_csv('series.csv',header=None)                            
In [392]: df                                                                    
Out[392]: 
   0                                                1
0  0  [[ 0  1  2  3]\n [ 4  5  6  7]\n [ 8  9 10 11]]

In [394]: astr=df[1][0]                                                         
In [395]: astr                                                                  
Out[395]: '[[ 0  1  2  3]\n [ 4  5  6  7]\n [ 8  9 10 11]]'

parse the string representation of the array:

In [396]: astr.split('\n')                                                      
Out[396]: ['[[ 0  1  2  3]', ' [ 4  5  6  7]', ' [ 8  9 10 11]]']

In [398]: astr.replace('[','').replace(']','').split('\n')                      
Out[398]: [' 0  1  2  3', '  4  5  6  7', '  8  9 10 11']
In [399]: [i.split() for i in _]                                                
Out[399]: [['0', '1', '2', '3'], ['4', '5', '6', '7'], ['8', '9', '10', '11']]
In [400]: np.array(_, int)                                                      
Out[400]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

No guarantee that that's the prettiest cleanest parsing, but it gives an idea of the work you have to do. I'm reinventing the wheel, but searching for a duplicate was taking too long.

If possible try to avoid saving such a dataframe as csv. csv format is meant for a clean 2d table, simple consistent columns separated by a delimiter.

And for the most part avoid dataframes/series like this. A Series can have object dtype. And each object element can be complex, such as a list, dictionary, or array. But I don't think pandas has special functions to handle those cases.

numpy also has object dtypes (as my arr), but a list is often just as good, if not better. Constructing such an array can be tricky. Math on such an array is hit or miss. Iteration on an object array is slower than iteration on a list.

===

re might work as well. For example replacing whitespace with comma:

In [408]: re.sub('\s+',',',astr)                                                
Out[408]: '[[,0,1,2,3],[,4,5,6,7],[,8,9,10,11]]'

Still not quite right. There are leading commas that will choke eval.

Sign up to request clarification or add additional context in comments.

1 Comment

Your detailed answer largely contained the solution for my issue. I just made a few tweaks, but thanks a lot for this! I was able to combine everything in a single function after making changes to the code, and run it on the "Image data" column of my dataframe, using the apply() function in Pandas. It's all good now; all image data strings are now converted to 2D numpy arrays.
0

data = pd.read_csv('new_dataset.csv')

Method1: data.values

Method2: data.to_numpy()

If data.shape is 2D DataFrame, then the above two methods will give your 2D numpy array. Have a try!


Here is a demo:

df = pd.DataFrame(data={"A": [np.random.randn(480, 640), np.random.randn(490, 640)], "B": np.arange(5, 7)})

print(type(df.to_numpy()[0, 0]))  # <class 'numpy.ndarray'>
print(df.to_numpy()[0, 0].shape)  # (480, 640)

print(type(df.to_numpy()[1, 0]))  # <class 'numpy.ndarray'>
print(df.to_numpy()[1, 0].shape)  # (490, 640)

I'm going to work in a while, you can try it first, and ask again if you have any questions.

10 Comments

@Isaac Asante, It happens that I am familiar with the work you are doing,data.values or data.to_numpy() will give what you need.
I’m not sure I understand how this helps OP.
He just hopes to convert the DataFrame read from the previously-stored CSV into numpy, which is often done in academic work related to machine learning.
No, that’s not what he’s trying to do. Have you read the post?
@AyiF Hmm... thanks, but sorry, that does not solve my issue. I do get numpy arrays returned, but they're arrays of strings, and the shape is wrong. They also contain \n characters, and so on.
|
0

Add two columns to the data dataframe : the grayscale image to converted to bytes using np.tostring() and the original shape.

grayscale_images = []
grayscale_shapes = []

for index, row in data.iterrows():
  img_path = row['Image path']
  cv_image = cv2.imread(img_path)
  gray = grayscale(cv_image)
  grayscale_images.append(gray.tostring())
  grayscale_shapes.append(gray.shape)

Read the CSV, then recover the 2d np array using 'np.fromstring()` and reset the correct shape.

  imagedata = np.fromstring(df.loc(...))   # index the image cell
  imagedata.shape = df.loc(...)            # index the corresponding shape

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.