6

The code

import array, itertools
a = array.array('B', itertools.repeat(0, 3715948544))

takes almost 7 minutes to run on my machine (6m44s). The computer has 8 Gb of RAM and runs Linux with CPython 3.4.3. How can I obtain an array-like object with 1-byte unsigned int entries faster, preferably using the Python standard library? Numpy can allocate it instantly (in less than 1 millisecond).

8
  • 1
    Why are you preallocating it? Commented Nov 3, 2015 at 21:30
  • You should give up on the standard library and use numpy instead Commented Nov 3, 2015 at 21:31
  • 3
    Wait, if you know NumPy can solve your problems, why did you ask the question? Commented Nov 3, 2015 at 21:33
  • 2
    @user2357112, I want to have as few external dependencies as possible, because I will distribute this code later and I don't want to bother with setting up a lot of libraries. Commented Nov 3, 2015 at 21:37
  • 2
    What uses 3 billion values but isn't science-related enough where Numpy is an extremely low bar? Could you distribute it with Anaconda? Commented Nov 3, 2015 at 21:40

3 Answers 3

6
a = array.array('B', [0]) * 3715948544

Sequence multiplication, analogous to how you'd create a giant list of zeros. Note that anything you want to do with this giant array is probably going to be as slow as your initial attempt to create it.

Sign up to request clarification or add additional context in comments.

1 Comment

This takes about 2 seconds on my computer. Thanks!
4

If you really can't use NumPy, you can try how far you can get with the built-in bytearray:

a = bytearray(3715948544)

This should finish in a couple of seconds at most.

7 Comments

This is about as fast as sequence multiplication from a different answer, approximately 1.95 seconds. Thank you!
@Pastafarianist It still has the same problem as the sequence multiplication – you can't really do anything useful with the giant buffer after creating it without things becoming terribly slow.
I've just tested bytearray vs array.array vs numpy. In my particular case, bytearray is the fastest, array.array is a close second and numpy loses by a factor of almost 2. I am running a particular calculation, where this array is basically a huge counter: I read a value, add +1, write it back and check if it has exceeded a threshold. With bytearray I get about 7.5 sec/iteration, array.array is about 7.75 seconds and numpy is 14.6 seconds.
@Pastafarianist: Have you considered collections.Counter? Or just a regular dict? The 7.5 seconds number sounds like you aren't using anywhere near all the array cells. (That, or maybe you mean that one read/increment/write/check cycle takes 7.5 seconds, which is crazy slow.)
@Pastafarianist: At that rate, using almost all the entries is going to take you about 14 hours, and that's if none of the counts go past 1. If the average value is 10, it'll take almost a week. You might want to look into Cython, or just write this in C. It sounded like a 7-minute startup lag was a deal-breaker, so 14 hours to a week of runtime probably isn't a great proposition.
|
1

At first I thought numpy would be fastest, but as pointed out by Sven, bytearray is pretty quick for 10000. Try your luck with bytearray on 3billion.

In [1]: import numpy as np

In [2]: import array, itertools

In [3]: %timeit array.array('B', itertools.repeat(0, 10000))
1000 loops, best of 3: 456 µs per loop

In [4]: %timeit np.zeros(10000, dtype='uint8')
1000000 loops, best of 3: 924 ns per loop

In [5]: %timeit bytearray(10000)
1000000 loops, best of 3: 328 ns per loop

1 Comment

Numpy is of course the best solution, since it's the only solution that allows you to actually do something with the array after creating it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.