Conversion to numpy array crashing RAM

Question

I have a list of numpy arrays. The list has 200000 elements and each array is of size 3504. This works fine in my RAM. type(x)

(Pdb) type(x)
<type 'list'>
(Pdb) len(x)
200001
(Pdb) type(x[1])
<type 'numpy.ndarray'>
(Pdb) x[1].shape
(3504L,)

The problem is that now I convert the list to numpy array and it exceeds by RAM 100% usage and freezes/crashes my PC. My intent to convert is to perform some feature scaling and PCA.

EDIT: I want to convert each sample to concatenate array of earlier 1000 samples plus itself.

def take_previous_data(X_train,y):
    temp_train_data=X_train[1000:]
    temp_labels=y[1000:] 
    final_train_set=[]
    for index,row in enumerate(temp_train_data):
        actual_index=index+1000
        data=X_train[actual_index-1000:actual_index+1].ravel()
        __,cd_i=pywt.dwt(data,'haar')
        final_train_set.append(cd_i)
    return final_train_set,y


x,y=take_previous_data(X_train,y)

Why don't you read your data as a numpy.array in the first place? — Eli Korvigo
– Eli Korvigo, Commented Aug 25, 2015 at 18:28
I am appending numpy.arrays to a list, which efficient than appending to a numpy array. — Abhishek Bhatia
– Abhishek Bhatia, Commented Aug 25, 2015 at 18:29
Perhaps you could consider single precision or a smaller integer type — Jens Munk
– Jens Munk, Commented Aug 25, 2015 at 18:32
Python lists are much less efficient than numpy arrays. By converting x to numpy array you are duplicating the memory, which is probably why it crashes. There are many ways (much more efficient than using list) to initialize your data as numpy arrays. Where are you reading your appended numpy arrays from? I mean, the problem is not that numpy crashes, the problem is that your reading data logic is what needs to be improved. — Imanol Luengo
– Imanol Luengo, Commented Aug 25, 2015 at 18:38
Indeed, appending to a list takas O(1) amortised, but you don't have to append in the first place. You can make a lazy generator and give it to numpy.fromiter while specifying data type and shape. This way you'll get your array without any intermediate data structures. — Eli Korvigo
– Eli Korvigo, Commented Aug 25, 2015 at 19:03

Community · Accepted Answer · 2017-05-23 12:06:16Z

2

You could try rewriting take_previous_data as a generator function that lazily yields rows of your final array, then use np.fromiter, as Eli suggested:

from itertools import chain

def take_previous_data(X_train,y):
    temp_train_data=X_train[1000:]
    temp_labels=y[1000:] 
    for index,row in enumerate(temp_train_data):
        actual_index=index+1000
        data=X_train[actual_index-1000:actual_index+1].ravel()
        __,cd_i=pywt.dwt(data,'haar')
        yield cd_i

gen = take_previous_data(X_train, y)

# I'm assuming that by "int" you meant "int64"
x = np.fromiter(chain.from_iterable(gen), np.int64)

# fromiter gives a 1D output, so we reshape it into a (200001, 3504) array
x.shape = 200001, -1

Another option would be to pre-allocate the output array and fill in the rows as you go along:

def take_previous_data(X_train, y):
    temp_train_data=X_train[1000:]
    temp_labels=y[1000:] 
    out = np.empty((200001, 3504), np.int64)
    for index,row in enumerate(temp_train_data):
        actual_index=index+1000
        data=X_train[actual_index-1000:actual_index+1].ravel()
        __,cd_i=pywt.dwt(data,'haar')
        out[index] = cd_i
    return out

From our chat conversation, it seems that the fundamental issue is that you can't actually fit the output array itself in memory. In that case, you could adapt the second solution to use np.memmap to write the output array to disk:

def take_previous_data(X_train, y):
    temp_train_data=X_train[1000:]
    temp_labels=y[1000:] 
    out = np.memmap('my_array.mmap', 'w+', shape=(200001, 3504), dtype=np.int64)
    for index,row in enumerate(temp_train_data):
        actual_index=index+1000
        data=X_train[actual_index-1000:actual_index+1].ravel()
        __,cd_i=pywt.dwt(data,'haar')
        out[index] = cd_i
    return out

One other obvious solution would be to reduce the bit depth of your array. I've assumed that by int you meant int64 (the default integer type in numpy). f you could switch to a lower bit depth (e.g. int32, int16 or maybe even int8), you could drastically reduce your memory requirements.

edited May 23, 2017 at 12:06

CommunityBot

11 silver badge

answered Aug 25, 2015 at 19:18

ali_m

74.6k28 gold badges230 silver badges314 bronze badges

Sign up to request clarification or add additional context in comments.

18 Comments

Abhishek Bhatia Over a year ago

Thanks! You didn't reshape though, please check.

ali_m Over a year ago

Yes I did. You can reshape an array in place by assigning to its .shape attribute. The -1 means to infer the size of the array in that dimension, based on the total number of elements.

Eli Korvigo Over a year ago

I believe cd_i is a sequence, hence you need to call np.fromiter(itertools.chain(*gen), dtype=np.int64) for np.fromiter to work, because it only accepts 1D data streams. I haven't slept for quite a while, so I can be wrong.

ali_m Over a year ago

@EliKorvigo Good spot.

ali_m Over a year ago

Did you see the line from itertools import chain?

|

Collectives™ on Stack Overflow

Conversion to numpy array crashing RAM

1 Answer 1

18 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

18 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related