I have implemented a data iterator that takes objects from two numpy arrays and does very intensive CPU computations on them before returning them. I want to parallelize this using Dask. Here is a simple version of this iterator class:
import numpy as np
class DataIterator:
def __init__(self, x, y):
self.x = x
self.y = y
def __len__(self):
return len(self.x)
def __getitem__(self, idx):
item1, item2 = x[idx], y[idx]
# Do some very heavy computations here by
# calling other methods and return
return item1, item2
x = np.random.randint(20, size=(20,))
y = np.random.randint(50, size=(20,))
data_gen = DataIterator(x, y)
Right now, I iterate over the items using a simple for loop like this:
for i, (item1, item2) in enumerate(data_gen):
print(item1, item2)
But this is really really slow. Can someone please help in parallelizing it using dask?