Based on the answers here it doesn't seem like there's an easy way to fill a 2D numpy array with data from a generator.
However, if someone can think of a way to vectorize or otherwise speed up the following function I would appreciate it.
The difference here is that I want to process the values from the generator in batches rather than create the whole array in memory. The only way I could think of doing that was with a for loop.
import numpy as np
from itertools import permutations
permutations_of_values = permutations(range(1,20), 7)
def array_from_generator(generator, arr):
"""Fills the numpy array provided with values from
the generator provided. Number of columns in arr
must match the number of values yielded by the
generator."""
count = 0
for row in arr:
try:
item = next(generator)
except StopIteration:
break
row[:] = item
count += 1
return arr[:count,:]
batch_size = 100000
empty_array = np.empty((batch_size, 7), dtype=int)
batch_of_values = array_from_generator(permutations_of_values, empty_array)
print(batch_of_values[0:5])
Output:
[[ 1 2 3 4 5 6 7]
[ 1 2 3 4 5 6 8]
[ 1 2 3 4 5 6 9]
[ 1 2 3 4 5 6 10]
[ 1 2 3 4 5 6 11]]
Speed test:
%timeit array_from_generator(permutations_of_values, empty_array)
10 loops, best of 3: 137 ms per loop
ADDITION:
As suggested by @COLDSPEED (thanks) here is a version that uses a list to gather the data from the generator. It's about twice as fast as above code. Can anyone improve on this:
permutations_of_values = permutations(range(1,20), 7)
def array_from_generator2(generator, rows=batch_size):
"""Creates a numpy array from a specified number
of values from the generator provided."""
data = []
for row in range(rows):
try:
data.append(next(generator))
except StopIteration:
break
return np.array(data)
batch_size = 100000
batch_of_values = array_from_generator2(permutations_of_values, rows=100000)
print(batch_of_values[0:5])
Output:
[[ 1 2 3 4 5 6 7]
[ 1 2 3 4 5 6 8]
[ 1 2 3 4 5 6 9]
[ 1 2 3 4 5 6 10]
[ 1 2 3 4 5 6 11]]
Speed test:
%timeit array_from_generator2(permutations_of_values, rows=100000)
10 loops, best of 3: 85.6 ms per loop
np.arrayon the resultant.fromiter, as discussed in a couple of the linked answers, is the only way of creating an array directly from the output of a generator. Otherwise you need to create a list and build or fill in the array from that. Generators can save memory during intermediate processing (c.f. to the list equivalent), but aren't any faster.fromiterwould be great but it only works with series (1-dimensional arrays).fromiterfromitercreates "a new 1-dimensional array from an iterable object". What I am trying to do here is 2-dimensional because each item from the generator is a tuple of 7 values. Maybe it's time to extendfromiterto handle multi-dimensional iterators...