0

I have a little script that does a few simple tasks. Running Python 3.7.

One of the tasks has to merge some files together which can be a little time consuming.

It loops through multiple directories, then each directory gets passed to the function. The function just loops through the files and merges them.

Instead of waiting for it to finish one directory, then onto the next one, then wait, then onto the next one, etc...

I'd like to utilize the horsepower/cores/threads to have the script merging the PDF's in multiple directories at once, together, which should shave time.

I've got something like this:

if multi_directories:
    if os.path.isdir('merged'):
        pass
    else:
        os.makedirs('merged')
    for directory in multi_directories:
        merge_pdfs(directory)

My merge PDF function looks like this:

def merge_pdfs(directory):
    root_dir = os.path.dirname(os.path.abspath(__file__))
    merged_dir_location = os.path.join(root_dir, 'merged')
    dir_title = directory.rsplit('/', 1)[-1]
    file_list = [file for file in os.listdir(directory)]
    merger = PdfFileMerger()
    for pdf in file_list:
        file_to_open = os.path.join(directory, pdf)
        merger.append(open(file_to_open, 'rb'))
        file_to_save = os.path.join(
            merged_dir_location,
            dir_title+"-merged.pdf"
        )
    with open(file_to_save, "wb") as fout:
        merger.write(fout)
    return True

This works great - but merge_pdfs runs slow in some instances where there are a high number of PDF's in the directory.

Essentially - I want to be a be able to loop through multi_directories and create a new thread or process for each directory and merge the PDF's at the same time.

I've looked at asyncio, multithreading and a wealth of little snippets here and there but can't seem to get it to work.

3
  • 1
    mutliprocessing.Pool would probably be the way to go here. Create a pool that executes the merger, then use its apply function to iterate over the list of directories. Commented Nov 9, 2018 at 19:32
  • @torek - Are there some examples of that somewhere? Commented Nov 9, 2018 at 19:36
  • Probably lots here on StackOverflow. Commented Nov 9, 2018 at 19:51

1 Answer 1

3

You can do something like:

from multiprocessing import Pool
n_processes = 2
...
if multi_directories:
    if os.path.isdir('merged'):
        pass
    else:
        os.makedirs('merged')
    pool = Pool(n_processes)
    pool.map(merge_pdfs, multi_directories)

It should help if the bottleneck is CPU usage. But it may make things even worse if the bottleneck is HDD, cause reading several files in parallel from one physical HDD is usually slower then reading them consecutively. Try it with different values of n_processes.

BTW, to make list from iterable use list(): file_list = list(os.listdir(directory)). And since listdir() returns List, you can just write file_list = os.listdir(directory)

Sign up to request clarification or add additional context in comments.

1 Comment

This worked a treat - and on some of my bigger pdf sets dropped the time this process took considerably (multiple minutes)! Thanks for the explanations and the tweaks in other areas! Very appreciated.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.