1

I am creating some scripts to help my database import while using docker. I currently have a directory filled with data, and I want to import as quickly as possibly.

All of the work done is all single threaded, so I wanted to speed things up by passing off multiple jobs at once to each thread on my server.

This is done by this code I've written.

#!/usr/bin/python
import sys
import subprocess

cities = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"];

for x in cities:
    dockerscript = "docker exec -it worker_1 perl import.pl ./public/%s %s mariadb" % (x,x)
    p = subprocess.Popen(dockerscript, shell=True, stderr=subprocess.PIPE)

This works fine if I have more than 10 cores, each gets its own. What I want to do is set it up, so if I have 4 cores, the first 4 iterations of the dockerscript runs, 1 to 4, and 5 to 10 wait.

Once any of the 1 to 4 completes, 5 is started and so on until it is all completed.

I am just having a harder time figuring out how to do this is python

Thanks

2 Answers 2

2

You should use multiprocessing.Pool() which will automatically create one process per core, then submit your jobs to it. Each job will be a function which calls subprocess to start Docker. Note that you need to make sure the jobs are synchronous--i.e. the Docker command must not return before it is done working.

Sign up to request clarification or add additional context in comments.

1 Comment

With using mutiprocessing.Pool() is this really just a replacement line for subprocess.Popen()? The for loop "for x in cities" all stays the same.
1

John already has the answer but there are a couple of subtleties worth mentioning. A thread pool is fine for this application because the thread just spends its time blocked waiting for the subprocess to terminate. And you can use map with chunksize=1 so the pool goes back to the parent to fetch a new job on each iteration.

#!/usr/bin/python
import sys
import subprocess
import multiprocessing.pool

cities = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"]

def run_docker(city):
    return subprocess.call(['docker', 'exec', '-it', 'worker_1', 'perl',
        'import.pl', './public/{0}'.format(city), city, 'mariadb'])

pool = multiprocessing.pool.ThreadPool()
results = pool.map(run_docker, cities, chunksize=1)
pool.close()
pool.join()

2 Comments

So between Johns answer, what is the map, and chunksize=1 doing that just using mutiprocessing.Pool() doesn't accomplish?
You can use multiprocessing.Pool equally well, except that it has more overhead. For linux, its a fork and for windows its an exec of a new python process plus pickling of the context needed to run. Then input/output needs to be piped between processes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.