Numpy concatenation 3D array - Out of Memory - Bigger dataset

Question

I have run into an Out of Memory problem while running a python script. The trace reads -

490426.070081] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice,task=python3,pid=18456,uid=1003
[490426.070085] Out of memory: Killed process 18456 (python3) total-vm:82439932kB, anon-rss:63127200kB, file-rss:4kB, shmem-rss:0kB
[490427.453131] oom_reaper: reaped process 18456 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I strongly suspect it is because of the concatenations I do in the script when the smaller test sample script was applied a larger dataset of 105,000 entries.

So a bit of overview of how my script looks. I have about 105,000 rows of timestamps and other data.

dataset -
2020-05-24T10:44:37.923792|[0.0, 0.0, -0.246047720313072, 0.0]
2020-05-24T10:44:36.669264|[1.0, 1.0, 0.0, 0.0]
2020-05-24T10:44:37.174584|[1.0, 1.0, 0.0, 0.0]
2020-05-24T10:57:53.345618|[0.0, 0.0, 0.0, 0.0]

For each Nth timestamp there are N*3 images. For example - 4 timestamps = 12 images. I would like to concatenate all the 3 images for every timestamp as one in axis = 2. Result dimension would be 70x320x9. Then go through all the rows in such a way and get an end tensor of dimension Nx70x320x9

I solved that with help from here -- Python - Numpy 3D array - concatenate issues using dictionary for each timestamp and concatenating later.

collected_images[timepoint].append(image)
.
.
.
output = []
for key, val in collected_iamges.items():
    temp = np.concatenate(val, axis=2)
    output.append(temp[np.newaxis, ...])

output = np.concatenate(output, axis=0)

However,as you would've guessed when applied to 105K timestamps(105K *3 images), the script crashes with OOM. This is where I seek your help.

I'm looking for ideas to solve this bottleneck. What other strategy can I use to accomplish my requirement.
Is it possible to do some modifications to overcome the kernel OOM behaviour temporarily?

Is there a need to hold all those rows in memory? Can you batch process and append output to MD5? — eNc
– eNc, Commented Jun 10, 2020 at 14:54
I'm sorry did you mean hd5? As far as I'm aware MD5 is used for hash function? If it is hd5, I'm going to write into a hdf5 file after all these tasks. If it is indeed MD5, can you please guide me how I can do that? — Deepak
– Deepak, Commented Jun 10, 2020 at 16:19
Yes I meant hd5. It seems to me like you may be holding too much data in memory when you don't really need to. Check out what this guy did, maybe it will give some inspiration stackoverflow.com/a/5559069/503835 — eNc
– eNc, Commented Jun 10, 2020 at 16:23
Thanks. Yes, I want to split to batches but I'm running to error as I'm doing some logic mistake. I will consult that link and wait till others can perform give me some ideas. — Deepak
– Deepak, Commented Jun 10, 2020 at 16:40
I took your suggestion -- didn't load them all into memory by clearing the list for each iteration. Thanks! — Deepak
– Deepak, Commented Jun 11, 2020 at 15:52

Han-Kwang Nienhuys · Accepted Answer · 2020-06-11 10:13:07Z

0

If you know the size of your dataset, you can generate a file-mapped array of a predefined size:

import numpy as np
n = 105000
a = np.memmap('array.dat', dtype='int16', mode='w+', shape=(n, 320, 7, 9))

You can use a as a numpy array, but it is stored on disk rather than in memory. Change the data type from int16 to whatever is suitable for your data (int8, float32, etc.).

You probably don't want to use slices like a[:, i, :, :] because those will be very slow.

edited Jun 11, 2020 at 10:13

answered Jun 10, 2020 at 22:13

Han-Kwang Nienhuys

3,3242 gold badges15 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Deepak Over a year ago

Thank you. Just to clarify, after initialising the memmap, can I carry on with concatenation as I intended. Or should I make some changes?

Han-Kwang Nienhuys Over a year ago

No: allocate the right size from the beginning and don't concatenate. This may require two passes: one to read/count the rows without the image data, one to read the image data.

Deepak Over a year ago

Thanks. I solved the issue by modifying my logic. Your solution is something I will keep in mind and pass on to my team who face similar issues.

Deepak · Accepted Answer · 2020-06-11 15:50:51Z

0

I solved the issue!

It took a while to revise my logic. The key change was to empty the list after every iteration and figuring out how to maintain the desired dimension. With a bit of help, I made changes to eliminate the dictionary and doing concatenation twice. Just used a list, appended it and concatenate at each iteration but emptied the 3 images' list for the next iteration. Doing this, saved loading everything in memory.

Here is the sample of that code-

collected_images = [] 
images_concat = [] 
collected_images.append(image) #appending each image 3 times.

concate_img = np.concatenate(collected_images, axis=2) #70x320x9
images_concat.append(concate_img) #nx70x320x9
collected_images = []

answered Jun 11, 2020 at 15:50

Deepak

12610 bronze badges

2 Comments

Han-Kwang Nienhuys Over a year ago

IMO it's better to avoid array concatenation altogether, especially on large arrays inside loops. I have been heavily using numpy on large arrays for years and I use np.concatenate so rarely that I can't even remember how the axis parameter works.

Deepak Over a year ago

Got it. I will keep that in mind. So you always do np.memmap for large arrays?

Collectives™ on Stack Overflow

Numpy concatenation 3D array - Out of Memory - Bigger dataset

2 Answers 2

3 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related