17

I need to create a large numpy array containing random boolean values without hitting the swap.

My laptop has 8 GB of RAM. Creating a (1200, 2e6) array takes less than 2 s and use 2.29 GB of RAM:

>>> dd = np.ones((1200, int(2e6)), dtype=bool)
>>> dd.nbytes/1024./1024
2288.818359375

>>> dd.shape
(1200, 2000000)

For a relatively small (1200, 400e3), np.random.randint is still quite fast, taking roughly 5 s to generate a 458 MB array:

db = np.array(np.random.randint(2, size=(int(400e3), 1200)), dtype=bool)
print db.nbytes/1024./1024., 'Mb'

But if I double the size of the array to (1200, 800e3) I hit the swap, and it takes ~2.7 min to create db ;(

cmd = """
import numpy as np
db = np.array(np.random.randint(2, size=(int(800e3), 1200)), dtype=bool)
print db.nbytes/1024./1024., 'Mb'"""

print timeit.Timer(cmd).timeit(1)

Using random.getrandbits takes even longer (~8min), and also uses the swap:

from random import getrandbits
db = np.array([not getrandbits(1) for x in xrange(int(1200*800e3))], dtype=bool)

Using np.random.randint for a (1200, 2e6) just gives a MemoryError.

Is there a more efficient way to create a (1200, 2e6) random boolean array?

1 Answer 1

14

One problem with using np.random.randint is that it generates 64-bit integers, whereas numpy's np.bool dtype uses only 8 bits to represent each boolean value. You are therefore allocating an intermediate array 8x larger than necessary.

A workaround that avoids intermediate 64-bit dtypes is to generate a string of random bytes using np.random.bytes, which can be converted to an array of 8-bit integers using np.fromstring. These integers can then be converted to boolean values, for example by testing whether they are less than 255 * p, where p is the desired probability of each element being True:

import numpy as np

def random_bool(shape, p=0.5):
    n = np.prod(shape)
    x = np.fromstring(np.random.bytes(n), np.uint8, n)
    return (x < 255 * p).reshape(shape)

Benchmark:

In [1]: shape = 1200, int(2E6)

In [2]: %timeit random_bool(shape)
1 loops, best of 3: 12.7 s per loop

One important caveat is that the probability will be rounded down to the nearest multiple of 1/256 (for an exact multiple of 1/256 such as p=1/2 this should not affect accuracy).


Update:

An even faster method is to exploit the fact that you only need to generate a single random bit per 0 or 1 in your output array. You can therefore create a random array of 8-bit integers 1/8th the size of the final output, then convert it to np.bool using np.unpackbits:

def fast_random_bool(shape):
    n = np.prod(shape)
    nb = -(-n // 8)     # ceiling division
    b = np.fromstring(np.random.bytes(nb), np.uint8, nb)
    return np.unpackbits(b)[:n].reshape(shape).view(np.bool)

For example:

In [3]: %timeit fast_random_bool(shape)
1 loops, best of 3: 5.54 s per loop
Sign up to request clarification or add additional context in comments.

5 Comments

Your last solution will be even faster if, rather than doing .astype(np.bool), you instead go with .view(np.bool) or .astype(np.bool, copy=False), as either one will spare you a copy of the full array.
@Jaime Thanks - I always forget that .astype() returns a copy by default
thank's @ali_m this random boolean array is in the context of a numpy-broadcasting question : stackoverflow.com/q/34496409/3313834
The "generate bytes and compare to 255*p" strategy has the limitation that the probabilities are rounded to multiples of 1/256, and not always the "right" multiple of 1/256.
@user2357112 You're right - I've edited my answer to mention this caveat, although I'm not sure it's possible to do any better without allocating a larger intermediate array.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.