2

Based on the answers here it doesn't seem like there's an easy way to fill a 2D numpy array with data from a generator.

However, if someone can think of a way to vectorize or otherwise speed up the following function I would appreciate it.

The difference here is that I want to process the values from the generator in batches rather than create the whole array in memory. The only way I could think of doing that was with a for loop.

import numpy as np
from itertools import permutations

permutations_of_values = permutations(range(1,20), 7)

def array_from_generator(generator, arr):
    """Fills the numpy array provided with values from
    the generator provided. Number of columns in arr
    must match the number of values yielded by the 
    generator."""
    count = 0
    for row in arr:
        try:
            item = next(generator)
        except StopIteration:
            break
        row[:] = item
        count += 1
    return arr[:count,:]

batch_size = 100000

empty_array = np.empty((batch_size, 7), dtype=int)
batch_of_values = array_from_generator(permutations_of_values, empty_array)

print(batch_of_values[0:5])

Output:

[[ 1  2  3  4  5  6  7]
 [ 1  2  3  4  5  6  8]
 [ 1  2  3  4  5  6  9]
 [ 1  2  3  4  5  6 10]
 [ 1  2  3  4  5  6 11]]

Speed test:

%timeit array_from_generator(permutations_of_values, empty_array)
10 loops, best of 3: 137 ms per loop

ADDITION:

As suggested by @COLDSPEED (thanks) here is a version that uses a list to gather the data from the generator. It's about twice as fast as above code. Can anyone improve on this:

permutations_of_values = permutations(range(1,20), 7)

def array_from_generator2(generator, rows=batch_size):
    """Creates a numpy array from a specified number 
    of values from the generator provided."""
    data = []
    for row in range(rows):
        try:
            data.append(next(generator))
        except StopIteration:
            break
    return np.array(data)

batch_size = 100000

batch_of_values = array_from_generator2(permutations_of_values, rows=100000)

print(batch_of_values[0:5])

Output:

[[ 1  2  3  4  5  6  7]
 [ 1  2  3  4  5  6  8]
 [ 1  2  3  4  5  6  9]
 [ 1  2  3  4  5  6 10]
 [ 1  2  3  4  5  6 11]]

Speed test:

%timeit array_from_generator2(permutations_of_values, rows=100000)
10 loops, best of 3: 85.6 ms per loop
5
  • 1
    It should be simpler to fill a list and then call np.array on the resultant. Commented Sep 18, 2017 at 1:57
  • 1
    fromiter, as discussed in a couple of the linked answers, is the only way of creating an array directly from the output of a generator. Otherwise you need to create a list and build or fill in the array from that. Generators can save memory during intermediate processing (c.f. to the list equivalent), but aren't any faster. Commented Sep 18, 2017 at 2:11
  • 1
    fromiter would be great but it only works with series (1-dimensional arrays). Commented Sep 18, 2017 at 2:19
  • Can you know ahead of time the dimensions? Then you can still use fromiter Commented Sep 18, 2017 at 2:32
  • If you read the documentation, it states that fromiter creates "a new 1-dimensional array from an iterable object". What I am trying to do here is 2-dimensional because each item from the generator is a tuple of 7 values. Maybe it's time to extend fromiter to handle multi-dimensional iterators... Commented Sep 18, 2017 at 2:48

1 Answer 1

3

You can calculate the sizes ahead in essentially constant time. Just do that, and use numpy.fromiter:

In [1]: import math, from itertools import permutations, chain

In [2]: def n_chose_k(n, k, fac=math.factorial):
    ...:     return fac(n)/fac(n-k)
    ...:

In [3]: def permutations_to_array(r, k):
    ...:     n = len(r)
    ...:     size = int(n_chose_k(n, k))
    ...:     it = permutations(r, k)
    ...:     arr = np.fromiter(chain.from_iterable(it),
    ...:                       count=size,  dtype=int)
    ...:     arr.size = size//k, k
    ...:     return arr
    ...:

In [4]: arr = permutations_to_array(range(1,20), 7)

In [5]: arr.shape
Out[5]: (36279360, 7)

In [6]: arr[0:5]
Out[6]:
array([[ 1,  2,  3,  4,  5,  6,  7],
       [ 1,  2,  3,  4,  5,  6,  8],
       [ 1,  2,  3,  4,  5,  6,  9],
       [ 1,  2,  3,  4,  5,  6, 10],
       [ 1,  2,  3,  4,  5,  6, 11]])

This will work as long as r is limited to sequences with a len.

Edited to add an implementation I cooked up for a generator of batchsize*k chunks, with a trim option!

import math
from itertools import repeat, chain

import numpy as np

def n_chose_k(n, k, fac=math.factorial):
    return fac(n)/fac(n-k)

def permutations_in_batches(r, k, batchsize=None, fill=0, dtype=int, trim=False):
    n = len(r)
    size = int(n_chose_k(n, k))
    if batchsize is None or batchsize > size:
        batchsize = size
    perms = chain.from_iterable(permutations(r, k))
    count = batchsize*k
    remaining = size - count
    while remaining > 0:
        current = np.fromiter(perms, count=count, dtype=dtype)
        current.shape = batchsize, k
        yield current
        remaining -= count
    if remaining: # remaining is negative
        remaining = -remaining
        if not trim:
            padding = repeat(fill, remaining)
            finalcount = count
            finalshape = batchsize, k
        else:
            q = remaining//k # always divisible q%k==0
            finalcount = q*k
            padding = repeat(fill, remaining)
            finalshape = q, k
        current =  np.fromiter(chain(perms, padding), count=finalcount, dtype=dtype)
        current.shape = finalshape
    else: # remaining is 0
        current = np.fromiter(perms, count=batchsize, dtype=dtype)
        current.shape = batchsize, k
    yield current
Sign up to request clarification or add additional context in comments.

7 Comments

Where does n_chose_k come from?
@Bill sorry, forgot to put that in the answer
Very good thanks. I think the dimensions of the resulting array are not quite right though. Shouldn't it be count=size*k and arr.resize((size, k))?
@Bill, In my experience the fromiter count doesn't make a big difference in run time. And arr.reshape(-1,k) doesn't require this size either.
@Bill ended up cooking one up for fun if you want to see it
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.