1

I have a dataset of about 22,000 images (round about 900 Mb for the whole thing) and I wanted to import it into Python to train a CNN.

I use the following code to import it and save it all in an array called X :

import scipy.misc as sm

for i in range (start, end):

    imageLink = "./dataSet/" + str(dataSet[i, 0]) + "/" + str(dataSet[i, 1])
    image = sm.imread(imageLink)
    X = np.append(X, image, axis = 0)

There are a few issues with this,

  1. It's incredibly slow. About 30 minutes imports only about 1000 images into python and it gets slower as the number of images grow.

  2. It takes up a lot of RAM. Importing about 2000 images takes about 16 GB of RAM (My machine has only 16GB, so I end up using swap memory, which makes it even slower I suppose).

The images are all sized 640 × 480.

Am I doing something wrong or is this normal? Is there any better/faster method to import images?

Thank you.

4
  • 2
    Hmm, appending to an array/list is not fast and may take a memory hit. Note that an append is actually a copy of the whole array with the new element added at the end. Instead since you know the size of the array (you know how many images you are going to consume) pre allocate the array (np.zeros() or np.empty() ) and use an index to place it in the array. Should speed up your loop and may help with memory issues also... Commented Dec 24, 2017 at 14:08
  • I see. I did not know that. I will make the changes. Thank you! Commented Dec 24, 2017 at 14:09
  • Let us know how much of an impact the change is... Commented Dec 24, 2017 at 14:24
  • Alright, will do! Commented Dec 24, 2017 at 14:26

1 Answer 1

2

Here are some general recommendations for this type of task:

  1. Upgrade to a fast SSD, if you haven't already. No matter what the processing does, fast hardware is crucial.
  2. Don't load the whole data set into memory. Build a batching mechanism that loads e.g. 100 files at a time, processes them, and frees memory for the next batch.
  3. Use a second thread to build the next batch while the first one is being processed.
  4. Introduce a separate pre-processing step that converts JPEG images read through imread to Numpy data structures with all required normalization steps. Store Numpy objects to disk so that your main training process only needs to read them using numpy.fromfile().
Sign up to request clarification or add additional context in comments.

1 Comment

Aaah I see. This is really helpful. Thank you so much!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.