I have a pipeline working with LocalCluster:
from distributed import Client
client = Client()
list_of_queries = [...] # say 1_000 queries
loaded_data = client.map(sql_data_loader, list_of_queries)
processed_data = client.map(data_processor, loaded_data)
writer_results = client.map(data_writer, processed_data)
results = client.gather(writer_results)
Everything works, but not quite as I would expect.
Looking at dashboard's status page I see somethings like this:
sql_data_loader 900 / 1000
data_processor 0 / 1000
data_writer 0 / 1000
I.e. tasks are executed sequentially as opposed to "in parallel". As a result data_processor does not start executing until all 1000 queries have been loaded. And data_writer waits until 'data_processor' finishes processing all its futures.
Based on previous experience with dask where dask.delayed was used instead of client.map
expected behavior would be something like:
sql_data_loader 50 / 1000
data_processor 10 / 1000
data_writer 5 / 1000
Is this a false expectation or is there something I am missing with how to set up pipeline to ensure behavior that would be similar to dask.delayed?