1

Let's assume that I have 100k jsons with quite a lot of data in them and data_jsons is the list with the names of these files.

Also, let's assume that I have 3 functions: 1) upload_data() 2) data_preprocess_1() 3) data_preprocess_2()

These functions can be called for each json separately so they are all paralellisable.

What is the best way to multi-process my code in total?

One option (very roughly described) is the following:

import os
from multiprocessing import Pool


def upload_data():
...


def data_preprocess_1():
...


def data_preprocess_2():
...


if __name__ == '__main__':

    pool = Pool(processes=os.cpu_count())                       
    temp_1 = pool.map(upload_data, json_files)

    pool = Pool(processes=os.cpu_count())                       
    temp_2 = pool.map(data_preprocess_1, temp_1)

    pool = Pool(processes=os.cpu_count())                       
    final = pool.map(data_preprocess_2, temp_2)

But as far as I understand in this way I parellelise each function separately whereas I could do it for all of them together to avoid loading temp_1 and temp_2 with all my data (which will capture quite a lot of memory).

The option (very roughly described) to avoid this I think is the following:

import os
from multiprocessing import Pool


def upload_data():
...


def data_preprocess_1():
...


def data_preprocess_2():
...

def data_all():
    upload_data()
    data_preprocess_1()
    data_preprocess_2()
...



if __name__ == '__main__':

    pool = Pool(processes=os.cpu_count())                       
    final = pool.map(data_all, data_jsons)

Is there any other option which I am missing?

Am I thinking something wrong about the options which I described?

Just to make clear the reason why I do not want to merge these 3 functions into one is because the code in each one of them performs a different sub-task.

0

1 Answer 1

2

For any optimization problem, start from the very beginning with benchmarks.

That said, you'll almost certainly want to have a mechanism like data_all() rather than using intermediate storage. For many cases where you might want to apply multi-processing, the dominant cost is just moving objects from the memory for one process to the memory for another process, and the only way AFAIK to offset that is to do more work for each bit of data transferred.

To your other question about whether there are any other options you're missing, there are tons. You can have different kinds of batching, streaming, or other kinds of manipulations and transformations at any step of the process that can alter the performance characteristics of the pipeline. Peak memory usage in particular can be reduced with other kinds of architectures, but whether that matters (or is practical) really depends on your exact data.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for your answer (upvote). So regarding your 2nd paragraph, you seem to agree with my 2nd option. Regarding your 3rd paragraph, as you say there are infinite options of different batching, manipulation etc. My question was mainly if there is any obvious one which is a "game-changer"? For example, at my post, the 2nd option is a game-changer in comparison with the 1st one in the sense that the former is one order better than the latter generally speaking. In the same way, I was wondering if there any other option which is "one order" better than my 2nd option (but it seems no?).
Not necessarily. There aren't any game-changers that will make it faster per se. However, if your results don't fit easily in RAM or have other undesirable properties, blindly throwing Pool.map at the problem won't work very well (it returns a list, so every result is cached in RAM at once). One approach in such scenarios is to iterate through your data in batches, apply Pool.map to sufficiently small batches, and yield results from those batches (or push to a Pub/Sub system, store in a DB or a file, whatever you're doing with the data).
If we're looking at this purely as a total time taken optimization for data that's large enough to need multiprocessing, complicated enough that your data_all() takes longer than it takes to copy your data from one process to another, and small enough to easily fit in RAM, then you really only have the two options you first mentioned to worry about, and between those the data_all() approach is almost certainly better (though still maybe not faster than not multiprocessing at all). If you have more constraints you'll get more complicated solutions like batching.
Yes, adding batching to Pool must make it even better. So if I understand well you also think that the 2nd option at my post is a game-changer comparing to the 1st and there is not another obvious game-changer comparing to the 2nd (unless we apply batching which could make a new game-changer).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.