Efficient way of importing images in Python

Question

I have a dataset of about 22,000 images (round about 900 Mb for the whole thing) and I wanted to import it into Python to train a CNN.

I use the following code to import it and save it all in an array called X :

import scipy.misc as sm

for i in range (start, end):

    imageLink = "./dataSet/" + str(dataSet[i, 0]) + "/" + str(dataSet[i, 1])
    image = sm.imread(imageLink)
    X = np.append(X, image, axis = 0)

There are a few issues with this,

It's incredibly slow. About 30 minutes imports only about 1000 images into python and it gets slower as the number of images grow.
It takes up a lot of RAM. Importing about 2000 images takes about 16 GB of RAM (My machine has only 16GB, so I end up using swap memory, which makes it even slower I suppose).

The images are all sized 640 × 480.

Am I doing something wrong or is this normal? Is there any better/faster method to import images?

Thank you.

Hmm, appending to an array/list is not fast and may take a memory hit. Note that an append is actually a copy of the whole array with the new element added at the end. Instead since you know the size of the array (you know how many images you are going to consume) pre allocate the array (np.zeros() or np.empty() ) and use an index to place it in the array. Should speed up your loop and may help with memory issues also... — Yohst
– Yohst, Commented Dec 24, 2017 at 14:08
I see. I did not know that. I will make the changes. Thank you! — Zohair
– Zohair, Commented Dec 24, 2017 at 14:09

Pavel · Accepted Answer · 2017-12-24 14:02:01Z

2

Here are some general recommendations for this type of task:

Upgrade to a fast SSD, if you haven't already. No matter what the processing does, fast hardware is crucial.
Don't load the whole data set into memory. Build a batching mechanism that loads e.g. 100 files at a time, processes them, and frees memory for the next batch.
Use a second thread to build the next batch while the first one is being processed.
Introduce a separate pre-processing step that converts JPEG images read through imread to Numpy data structures with all required normalization steps. Store Numpy objects to disk so that your main training process only needs to read them using numpy.fromfile().

answered Dec 24, 2017 at 14:02

Pavel

7,6022 gold badges33 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Zohair Over a year ago

Aaah I see. This is really helpful. Thank you so much!

Collectives™ on Stack Overflow

Efficient way of importing images in Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related