5

I have done over the time many things that require me using the list's .append() function, and also numpy.append() function for numpy arrays. I noticed that both grow really slow when sizes of the arrays are big.

I need an array that is dynamically growing for sizes of about 1 million elements. I can implement this myself, just like std::vector is made in C++, by adding buffer length (reserve length) that is not accessible from the outside. But do I have to reinvent the wheel? I imagine it should be implemented somewhere. So my question is: Does such a thing exist already in Python?

What I mean: Is there in Python an array type that is capable of dynamically growing with time complexity of O(C) most of the time?

4
  • do you mean array (docs.python.org/2/library/array.html)? or list? because those are two different things in python. Arrays do grow slow in python. Commented Dec 9, 2016 at 14:43
  • 1
    (Just some thoughts): I do think CPython's list/arraylist-behaviour is similar to std::vector, but does not give you the manual-control available in c++. I can imagine, that there is possibly much more possible on the algorithmics side in your use-cases (especially np.append should be not a very common operation). If not, you can always build some std::vector wrapper with cython. I also like blue_note's mentioning of deque as i love the 2-step process of growing linked-list based data + array-transform at the end. Commented Dec 9, 2016 at 14:44
  • @MooingRawr The reason why I involved list in this is because it doesn't have to be consecutive in memory, and still it's slow with bigger arrays. So I honestly don't understand what's going on. Commented Dec 9, 2016 at 14:53
  • @TheQuantumPhysicist: Python's list is contiguous; it is not a linked list. It's like a C++ std::vector of pointers. Commented Dec 9, 2016 at 18:01

2 Answers 2

2

The memory of numpy arrays is well described in its docs, and has been discussed here a lot. List memory layout has also been discussed, though usually just contrast to numpy.

A numpy array has a fixed size data buffer. 'growing' it requires creating a new array, and copying data to it. np.concatenate does that in compiled code. np.append as well as all the stack functions use concatenate.

A list has, as I understand it, a contiguous data buffer that contains pointers to objects else where in memeory. Python maintains some freespace in that buffer, so additions with list.append are relatively fast and easy. But when the freespace fills up, it has to create a new buffer and copy pointers. I can see where that could get expensive with large lists.

So a list will have store a pointer for each element, plus the element itself (e.g. a float) somewhere else in memory. In contrast the array of floats stores the floats themselves as contiguous bytes in its buffer. (Object dtype arrays are more like lists).

The recommended way to create an array iteratively is to build the list with append, and create the array once at the end. Repeated np.append or np.concatenate is relatively expensive.

deque was mentioned. I don't know much about how it stores its data. The docs say it can add elements at the start just as easily as at the end, but random access is slower than for a list. That implies that it stores data in some sort of linked list, so that finding the nth element requires traversing the n-1 links before it. So there's a trade off between growth ease and access speed.

Adding elements to the start of a list requires making a new list of pointers, with the new one(s) at the start. So adding, and removing elements from the start of a regular list, is much more expensive than doing that at the end.

Recommending software is outside of the core SO purpose. Others may make suggestions, but don't be surprised if this gets closed.

There are file formats like HDF5 that a designed for large data sets. They accommodate growth with features like 'chunking'. And there are all kinds of database packages.

Sign up to request clarification or add additional context in comments.

7 Comments

I'm sorry, but this is not software recommendation, and if recommending a way to do things is software recommendation, then 99% of SO's articles have to be closed. While I thank you for the information provided, that sentence is very nonconstructive and irritating and is a strawman. Please remove it.
I use HDF5 for the same project, btw, but I'm talking about things in memory, not in files.
How you use the data is just as important as how you grow it. Usually you use it many times, but only grow it once. I'd lean toward using list append to make arrays from small pieces (number/row at a time), and array concatenate to join large arrays into larger ones - until I start to hit memory limits.
Currently my solution is to resize twice, before and after adding the data. First I resize to the maximum expected size, and then I resize back when I'm done adding. This is a quick-and-dirty solution that's the fastest I'm gonna get.
You are talking about using the x.resize method (in-place), and immediately assigning new values? I haven't used that before. Looks like you have to careful about using the array between resize operations. And it may be hard to use with anything but a 1d array. np.resize makes a copy.
|
1

Both use an underlying array. Instead, you can use collections.deque which is made for specifically adding and removing elements at both ends with O(1) complexity

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.