Allocating large arrays in memory with Python

Question

The code

import array, itertools
a = array.array('B', itertools.repeat(0, 3715948544))

takes almost 7 minutes to run on my machine (6m44s). The computer has 8 Gb of RAM and runs Linux with CPython 3.4.3. How can I obtain an array-like object with 1-byte unsigned int entries faster, preferably using the Python standard library? Numpy can allocate it instantly (in less than 1 millisecond).

You should give up on the standard library and use numpy instead — Cody Piersall
– Cody Piersall, Commented Nov 3, 2015 at 21:31
Wait, if you know NumPy can solve your problems, why did you ask the question? — user2357112
– user2357112, Commented Nov 3, 2015 at 21:33
@user2357112, I want to have as few external dependencies as possible, because I will distribute this code later and I don't want to bother with setting up a lot of libraries. — Pastafarianist
– Pastafarianist, Commented Nov 3, 2015 at 21:37
What uses 3 billion values but isn't science-related enough where Numpy is an extremely low bar? Could you distribute it with Anaconda? — Nick T
– Nick T, Commented Nov 3, 2015 at 21:40

user2357112 · Accepted Answer · 2015-11-03 21:32:22Z

6

a = array.array('B', [0]) * 3715948544

Sequence multiplication, analogous to how you'd create a giant list of zeros. Note that anything you want to do with this giant array is probably going to be as slow as your initial attempt to create it.

answered Nov 3, 2015 at 21:32

user2357112

286k32 gold badges490 silver badges570 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Pastafarianist Over a year ago

This takes about 2 seconds on my computer. Thanks!

Sven Marnach · Accepted Answer · 2015-11-03 21:34:39Z

4

If you really can't use NumPy, you can try how far you can get with the built-in bytearray:

a = bytearray(3715948544)

This should finish in a couple of seconds at most.

answered Nov 3, 2015 at 21:34

Sven Marnach

608k123 gold badges967 silver badges865 bronze badges

7 Comments

Pastafarianist Over a year ago

This is about as fast as sequence multiplication from a different answer, approximately 1.95 seconds. Thank you!

Sven Marnach Over a year ago

@Pastafarianist It still has the same problem as the sequence multiplication – you can't really do anything useful with the giant buffer after creating it without things becoming terribly slow.

Pastafarianist Over a year ago

I've just tested bytearray vs array.array vs numpy. In my particular case, bytearray is the fastest, array.array is a close second and numpy loses by a factor of almost 2. I am running a particular calculation, where this array is basically a huge counter: I read a value, add +1, write it back and check if it has exceeded a threshold. With bytearray I get about 7.5 sec/iteration, array.array is about 7.75 seconds and numpy is 14.6 seconds.

user2357112 Over a year ago

@Pastafarianist: Have you considered collections.Counter? Or just a regular dict? The 7.5 seconds number sounds like you aren't using anywhere near all the array cells. (That, or maybe you mean that one read/increment/write/check cycle takes 7.5 seconds, which is crazy slow.)

user2357112 Over a year ago

@Pastafarianist: At that rate, using almost all the entries is going to take you about 14 hours, and that's if none of the counts go past 1. If the average value is 10, it'll take almost a week. You might want to look into Cython, or just write this in C. It sounded like a 7-minute startup lag was a deal-breaker, so 14 hours to a week of runtime probably isn't a great proposition.

|

wflynny · Accepted Answer · 2015-11-03 21:32:26Z

1

At first I thought numpy would be fastest, but as pointed out by Sven, bytearray is pretty quick for 10000. Try your luck with bytearray on 3billion.

In [1]: import numpy as np

In [2]: import array, itertools

In [3]: %timeit array.array('B', itertools.repeat(0, 10000))
1000 loops, best of 3: 456 µs per loop

In [4]: %timeit np.zeros(10000, dtype='uint8')
1000000 loops, best of 3: 924 ns per loop

In [5]: %timeit bytearray(10000)
1000000 loops, best of 3: 328 ns per loop

answered Nov 3, 2015 at 21:32

wflynny

18.6k6 gold badges50 silver badges69 bronze badges

1 Comment

Sven Marnach Over a year ago

Numpy is of course the best solution, since it's the only solution that allows you to actually do something with the array after creating it.

Collectives™ on Stack Overflow

Allocating large arrays in memory with Python

3 Answers 3

1 Comment

7 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

7 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related