Multiple instances of the same function asynchronously in python

Question

I have a little script that does a few simple tasks. Running Python 3.7.

One of the tasks has to merge some files together which can be a little time consuming.

It loops through multiple directories, then each directory gets passed to the function. The function just loops through the files and merges them.

Instead of waiting for it to finish one directory, then onto the next one, then wait, then onto the next one, etc...

I'd like to utilize the horsepower/cores/threads to have the script merging the PDF's in multiple directories at once, together, which should shave time.

I've got something like this:

if multi_directories:
    if os.path.isdir('merged'):
        pass
    else:
        os.makedirs('merged')
    for directory in multi_directories:
        merge_pdfs(directory)

My merge PDF function looks like this:

def merge_pdfs(directory):
    root_dir = os.path.dirname(os.path.abspath(__file__))
    merged_dir_location = os.path.join(root_dir, 'merged')
    dir_title = directory.rsplit('/', 1)[-1]
    file_list = [file for file in os.listdir(directory)]
    merger = PdfFileMerger()
    for pdf in file_list:
        file_to_open = os.path.join(directory, pdf)
        merger.append(open(file_to_open, 'rb'))
        file_to_save = os.path.join(
            merged_dir_location,
            dir_title+"-merged.pdf"
        )
    with open(file_to_save, "wb") as fout:
        merger.write(fout)
    return True

This works great - but merge_pdfs runs slow in some instances where there are a high number of PDF's in the directory.

Essentially - I want to be a be able to loop through multi_directories and create a new thread or process for each directory and merge the PDF's at the same time.

I've looked at asyncio, multithreading and a wealth of little snippets here and there but can't seem to get it to work.

mutliprocessing.Pool would probably be the way to go here. Create a pool that executes the merger, then use its apply function to iterate over the list of directories. — torek
– torek, Commented Nov 9, 2018 at 19:32

Strigo · Accepted Answer · 2018-11-09 20:04:47Z

3

You can do something like:

from multiprocessing import Pool
n_processes = 2
...
if multi_directories:
    if os.path.isdir('merged'):
        pass
    else:
        os.makedirs('merged')
    pool = Pool(n_processes)
    pool.map(merge_pdfs, multi_directories)

It should help if the bottleneck is CPU usage. But it may make things even worse if the bottleneck is HDD, cause reading several files in parallel from one physical HDD is usually slower then reading them consecutively. Try it with different values of n_processes.

BTW, to make list from iterable use list(): file_list = list(os.listdir(directory)). And since listdir() returns List, you can just write file_list = os.listdir(directory)

answered Nov 9, 2018 at 20:04

Strigo

1064 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Hanny Over a year ago

This worked a treat - and on some of my bigger pdf sets dropped the time this process took considerably (multiple minutes)! Thanks for the explanations and the tweaks in other areas! Very appreciated.

Collectives™ on Stack Overflow

Multiple instances of the same function asynchronously in python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related