Let's assume that I have 100k jsons with quite a lot of data in them and data_jsons is the list with the names of these files.
Also, let's assume that I have 3 functions:
1) upload_data()
2) data_preprocess_1()
3) data_preprocess_2()
These functions can be called for each json separately so they are all paralellisable.
What is the best way to multi-process my code in total?
One option (very roughly described) is the following:
import os
from multiprocessing import Pool
def upload_data():
...
def data_preprocess_1():
...
def data_preprocess_2():
...
if __name__ == '__main__':
pool = Pool(processes=os.cpu_count())
temp_1 = pool.map(upload_data, json_files)
pool = Pool(processes=os.cpu_count())
temp_2 = pool.map(data_preprocess_1, temp_1)
pool = Pool(processes=os.cpu_count())
final = pool.map(data_preprocess_2, temp_2)
But as far as I understand in this way I parellelise each function separately whereas I could do it for all of them together to avoid loading temp_1 and temp_2 with all my data (which will capture quite a lot of memory).
The option (very roughly described) to avoid this I think is the following:
import os
from multiprocessing import Pool
def upload_data():
...
def data_preprocess_1():
...
def data_preprocess_2():
...
def data_all():
upload_data()
data_preprocess_1()
data_preprocess_2()
...
if __name__ == '__main__':
pool = Pool(processes=os.cpu_count())
final = pool.map(data_all, data_jsons)
Is there any other option which I am missing?
Am I thinking something wrong about the options which I described?
Just to make clear the reason why I do not want to merge these 3 functions into one is because the code in each one of them performs a different sub-task.