How to efficiently construct a numpy array from a large set of data?

Question

If I have a huge list of lists in memory and I wish to convert it into an array, does the naive approach cause python to make a copy of all the data, taking twice the space in memory? Should I convert a list of lists, vector by vector instead by popping?

# for instance
list_of_lists = [[...], ..., [...]]
arr = np.array(list_of_lists)

Edit: Is it better to create an empty array of a known size and then populate it incrementally thus avoiding the list_of_lists object entirely? Could this be accomplished by something as simply as some_array[i] = some_list_of_float_values?

Note that a numpy array and a python list are completely different concepts. If you have some time, I would suggest having a look at this publication: arxiv.org/pdf/1102.1523.pdf. — cel
– cel, Commented Feb 25, 2015 at 12:35

will · Accepted Answer · 2015-02-25 15:58:33Z

2

I'm just puttign theis here as it's a bit long for a comment.

Have you read the numpy documentation for array?

numpy.array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)
"""
...
copy : bool, optional
  If true (default), then the object is copied. Otherwise, a copy will
  only be made if __array__ returns a copy, if obj is a nested sequence,
  or if a copy is needed to satisfy any of the other requirements (dtype,
  order, etc.).
...
"""

When you say you don't want to copy the data of the original array when creating the numpy array, what data structure are you hoping to end up with?

A lot of the speed up you get from using numpy is because the C arrays that are created are contiguous in memory. An array in python is just an array of pointers to objects, so you have to go and find the objects every time - which isn't the case in numpy, as it's not written in python.

If you want to just have the numpy array reference the python arrays in your 2D array, then you'll lose the performance gains.

if you do np.array(my_2D_python_array, copy=False) i don't know what it will actually produce, but you could easily test it yourself. Look at the shape of the array, and see what kind of objects it houses.

If you want the numpy array to be contiguous though, as some point you're going to have to allocate all of the memory it needs (which if it's as large as you're suggesting, it sounds like it might be difficult to find a contiguous section large enough).

Sorry that was pretty rambling, just a comment. How big are the actual arrays you're looking at?

Here's a plot of the cpu usage and memory usage of a small sample program:

from __future__ import division

#Make a large python 2D array
N, M = 10000, 18750
print "%i x %i = %i doubles = %f GB" % (N, M, N * M, N*M*8/10**9)

#grab pid to moniter memory and cpu usage
import os
pid = os.getpid()

os.system("python moniter.py -p " + str(pid) + " &")


print "building python matrix"
large_2d_array = [[n + m*M for n in range(N)] for m in range(M)]


import numpy
from datetime import datetime

print datetime.now(), "creating numpy array with copy"
np1 = numpy.array(large_2d_array, copy=True)
print datetime.now(), "deleting array"
del(np1)


print datetime.now(), "creating numpy array with copy"
np1 = numpy.array(large_2d_array, copy=False)
print datetime.now(), "deleting array"
del(np1)

enter image description here

1, 2, and 3 are the points where each of the matrices finish being created. Note that the native python array takes up much more memory than the numpy ones - python objects each have their own overhead, and the lists are lists of objects. For the numpy array this is not the case, so it is considerably smaller.

Also note that using the copy on the python object has no effect - new data is always created. You could get around this by creating a numpy array of python objects (using dtype=object), but i wouldn't advise it.

edited Feb 25, 2015 at 15:58

answered Feb 25, 2015 at 12:38

will

10.7k7 gold badges50 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

AturSams Over a year ago

It's about 1.5gb 10k x 150k but I should be able to build the numpy array vector by vector? Would it then realoc a bunch of times? I am not sure how to pull this off but I don't need or want the list_of_lists and the numpy array on the same time.

will Over a year ago

Okay, first off, an array that's 10K by 150K is going to be a lot more than 1.5GB; 6GB for 32bit ints, 12GB for doubles.

AturSams Over a year ago

No, I meant in bytes. The float vector is 150kb :) It's only 30+k floats

will Over a year ago

Ah okay!! that's fine. I'm just looking at the usage now of the different methods. I'll put some plots in shortly.

will Over a year ago

You can always build the numpy array, and then just del(list_of_lists) afterwards.

|

Collectives™ on Stack Overflow

How to efficiently construct a numpy array from a large set of data?

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related