Multi-processing of multiple functions

Question

Let's assume that I have 100k jsons with quite a lot of data in them and data_jsons is the list with the names of these files.

Also, let's assume that I have 3 functions: 1) upload_data() 2) data_preprocess_1() 3) data_preprocess_2()

These functions can be called for each json separately so they are all paralellisable.

What is the best way to multi-process my code in total?

One option (very roughly described) is the following:

import os
from multiprocessing import Pool


def upload_data():
...


def data_preprocess_1():
...


def data_preprocess_2():
...


if __name__ == '__main__':

    pool = Pool(processes=os.cpu_count())                       
    temp_1 = pool.map(upload_data, json_files)

    pool = Pool(processes=os.cpu_count())                       
    temp_2 = pool.map(data_preprocess_1, temp_1)

    pool = Pool(processes=os.cpu_count())                       
    final = pool.map(data_preprocess_2, temp_2)

But as far as I understand in this way I parellelise each function separately whereas I could do it for all of them together to avoid loading temp_1 and temp_2 with all my data (which will capture quite a lot of memory).

The option (very roughly described) to avoid this I think is the following:

import os
from multiprocessing import Pool


def upload_data():
...


def data_preprocess_1():
...


def data_preprocess_2():
...

def data_all():
    upload_data()
    data_preprocess_1()
    data_preprocess_2()
...



if __name__ == '__main__':

    pool = Pool(processes=os.cpu_count())                       
    final = pool.map(data_all, data_jsons)

Is there any other option which I am missing?

Am I thinking something wrong about the options which I described?

Just to make clear the reason why I do not want to merge these 3 functions into one is because the code in each one of them performs a different sub-task.

Hans Musgrave · Accepted Answer · 2019-11-05 01:21:19Z

2

For any optimization problem, start from the very beginning with benchmarks.

That said, you'll almost certainly want to have a mechanism like data_all() rather than using intermediate storage. For many cases where you might want to apply multi-processing, the dominant cost is just moving objects from the memory for one process to the memory for another process, and the only way AFAIK to offset that is to do more work for each bit of data transferred.

To your other question about whether there are any other options you're missing, there are tons. You can have different kinds of batching, streaming, or other kinds of manipulations and transformations at any step of the process that can alter the performance characteristics of the pipeline. Peak memory usage in particular can be reduced with other kinds of architectures, but whether that matters (or is practical) really depends on your exact data.

answered Nov 5, 2019 at 1:21

Hans Musgrave

7,2012 gold badges21 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Outcast Over a year ago

Thank you for your answer (upvote). So regarding your 2nd paragraph, you seem to agree with my 2nd option. Regarding your 3rd paragraph, as you say there are infinite options of different batching, manipulation etc. My question was mainly if there is any obvious one which is a "game-changer"? For example, at my post, the 2nd option is a game-changer in comparison with the 1st one in the sense that the former is one order better than the latter generally speaking. In the same way, I was wondering if there any other option which is "one order" better than my 2nd option (but it seems no?).

Hans Musgrave Over a year ago

Not necessarily. There aren't any game-changers that will make it faster per se. However, if your results don't fit easily in RAM or have other undesirable properties, blindly throwing Pool.map at the problem won't work very well (it returns a list, so every result is cached in RAM at once). One approach in such scenarios is to iterate through your data in batches, apply Pool.map to sufficiently small batches, and yield results from those batches (or push to a Pub/Sub system, store in a DB or a file, whatever you're doing with the data).

Hans Musgrave Over a year ago

If we're looking at this purely as a total time taken optimization for data that's large enough to need multiprocessing, complicated enough that your data_all() takes longer than it takes to copy your data from one process to another, and small enough to easily fit in RAM, then you really only have the two options you first mentioned to worry about, and between those the data_all() approach is almost certainly better (though still maybe not faster than not multiprocessing at all). If you have more constraints you'll get more complicated solutions like batching.

Outcast Over a year ago

Yes, adding batching to Pool must make it even better. So if I understand well you also think that the 2nd option at my post is a game-changer comparing to the 1st and there is not another obvious game-changer comparing to the 2nd (unless we apply batching which could make a new game-changer).

Collectives™ on Stack Overflow

Multi-processing of multiple functions

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related