Multithreading of For loop in python

Question

I am creating some scripts to help my database import while using docker. I currently have a directory filled with data, and I want to import as quickly as possibly.

All of the work done is all single threaded, so I wanted to speed things up by passing off multiple jobs at once to each thread on my server.

This is done by this code I've written.

#!/usr/bin/python
import sys
import subprocess

cities = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"];

for x in cities:
    dockerscript = "docker exec -it worker_1 perl import.pl ./public/%s %s mariadb" % (x,x)
    p = subprocess.Popen(dockerscript, shell=True, stderr=subprocess.PIPE)

This works fine if I have more than 10 cores, each gets its own. What I want to do is set it up, so if I have 4 cores, the first 4 iterations of the dockerscript runs, 1 to 4, and 5 to 10 wait.

Once any of the 1 to 4 completes, 5 is started and so on until it is all completed.

I am just having a harder time figuring out how to do this is python

Thanks

John Zwinck · Accepted Answer · 2017-01-25 04:25:44Z

2

You should use multiprocessing.Pool() which will automatically create one process per core, then submit your jobs to it. Each job will be a function which calls subprocess to start Docker. Note that you need to make sure the jobs are synchronous--i.e. the Docker command must not return before it is done working.

answered Jan 25, 2017 at 4:25

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Anthony Taylor Over a year ago

With using mutiprocessing.Pool() is this really just a replacement line for subprocess.Popen()? The for loop "for x in cities" all stays the same.

tdelaney · Accepted Answer · 2017-01-25 05:02:07Z

1

John already has the answer but there are a couple of subtleties worth mentioning. A thread pool is fine for this application because the thread just spends its time blocked waiting for the subprocess to terminate. And you can use map with chunksize=1 so the pool goes back to the parent to fetch a new job on each iteration.

#!/usr/bin/python
import sys
import subprocess
import multiprocessing.pool

cities = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"]

def run_docker(city):
    return subprocess.call(['docker', 'exec', '-it', 'worker_1', 'perl',
        'import.pl', './public/{0}'.format(city), city, 'mariadb'])

pool = multiprocessing.pool.ThreadPool()
results = pool.map(run_docker, cities, chunksize=1)
pool.close()
pool.join()

answered Jan 25, 2017 at 5:02

tdelaney

77.9k6 gold badges91 silver badges129 bronze badges

2 Comments

Anthony Taylor Over a year ago

So between Johns answer, what is the map, and chunksize=1 doing that just using mutiprocessing.Pool() doesn't accomplish?

tdelaney Over a year ago

You can use multiprocessing.Pool equally well, except that it has more overhead. For linux, its a fork and for windows its an exec of a new python process plus pickling of the context needed to run. Then input/output needs to be piped between processes.

Collectives™ on Stack Overflow

Multithreading of For loop in python

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related