Python: parallelize for loop for database query

Question

I will break down this question into two parts.

I have code similar to this

for data in data_list:
  rslt = query in databae where data == 10 # just some pseudo database query to give example but this single query usually takes around 30-50seconds.
  if rslt.property == 'some condition here':
     return rslt

The conditions here are

We have to return the first element of the data_list that matches the condition after query.
Each database query for each elements take around 30-40s.
data_list is usually very big, around 15-20k elements
Unfortunately we can not do a single database query for whole data_list. We have to do this in loop or one element at a time.

Now my questions are,

How can I optimize this process. Currently this whole process takes around 3-4hrs.
I read about python threading and multiprocessing but I am confused about which one would be appropriate in this case.

Tomerikoo · Accepted Answer · 2019-07-29 08:46:14Z

1

You could consider using a multiprocessing Pool. You can then use map to send chunks of your iterable to the workers of the Pool to work on according to a given function. So, lets say your query is a function, say, query(data):

def query(data):
    rslt = query in databae where data == 10
    if rslt.property == 'some condition here':
        return rslt

We will use the pool like so:

from multiprocessing import Pool

with Pool() as pool:
    results = pool.map(query, data_list)

Now to your rquierement we will find the first one:

print(next(filter(None, results)))

Note that using the function query this way means that results will be a list of rslts and Nones and we are looking for the first non-None result.

A few Notes:

Note that the Pool's constrctor's first argument is processes which allows you to choose how many processes the pool will hold:

If processes is None then the number returned by os.cpu_count() is used.
Note that map has aslo the chunksize argument which is default to 1 and allows to choose the size of the chunks passed to the workers:

This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. The (approximate) size of these chunks can be specified by setting chunksize to a positive integer.
Continuing with map, the docs recommend using imap for large iterables with a specific chunk for better efficiency:

Note that it may cause high memory usage for very long iterables. Consider using imap() or imap_unordered() with explicit chunksize option for better efficiency.

And from the imap docs:

The chunksize argument is the same as the one used by the map() method. For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.

So we could actually be more efficient and do:
```
chunksize = 100
processes = 10

with Pool(processes=processes) as pool:
    print(next(filter(None, pool.imap(query, data_list, chunksize=chunksize))))
```
And here you could play with chunksize and even processes (back from the Pool) and see what combination produces the best results.
If you are interested, you could easily switch to threads instead of processes by simply changing your import statement to:
```
from multiprocessing.dummy import Pool
```
As the docs say:

multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.

Hope this helps in any way

edited Jul 29, 2019 at 8:46

answered Jul 28, 2019 at 21:44

Tomerikoo

19.6k16 gold badges57 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

asdfkjasdfjk Over a year ago

Your method loop through the complete list, how can I break the loop if first rslt is not None. If I get any result then I don't want to continue the loop.

Tomerikoo Over a year ago

As I mentioned in Note (3) you can use imap which is a lazy version of map. In my example we are wrapping it with filter which is also an iterator, and by doing next on that we are only working on the minimal number of chunks that will produce the first result. And also BTW there is no actual loop

Tomerikoo Over a year ago

@asdfkjasdfjk did you try any of the methods? did it improve timing?

Collectives™ on Stack Overflow

Python: parallelize for loop for database query

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related