2

The use case is as follows : I have a script that runs a series of non-python executables to reduce (pulsar) data. I right now use subprocess.Popen(..., shell=True) and then the communicate function of subprocess to capture the standard out and standard error from the non-python executables and the captured output I log using the python logging module.

The problem is: just one core of the possible 8 get used now most of the time. I want to spawn out multiple processes each doing a part of the data set in parallel and I want to keep track of progres. It is a script / program to analyze data from a low frequencey radio telescope (LOFAR). The easier to install / manage and test the better. I was about to build code to manage all this but im sure it must already exist in some easy library form.

1
  • "runs a series of non-python executables" All at the same time? Or serially? Please include a snippet of working code to explain what you're doing. Commented Nov 3, 2010 at 11:07

3 Answers 3

2

The subprocess module can start multiple processes for you just fine, and keep track of them. The problem, though, is reading the output from each process without blocking any other processes. Depending on the platform there's several ways of doing this: using the select module to see which process has data to be read, setting the output pipes non-blocking using the fnctl module, using threads to read each process's data (which subprocess.Popen.communicate itself uses on Windows, because it doesn't have the other two options.) In each case the devil is in the details, though.

Something that handles all this for you is Twisted, which can spawn as many processes as you want, and can call your callbacks with the data they produce (as well as other situations.)

Sign up to request clarification or add additional context in comments.

Comments

2

Maybe Celery will serve your needs.

Comments

0

If I understand correctly what you are doing, I might suggest a slightly different approach. Try establishing a single unit of work as a function and then layer on the parallel processing after that. For example:

  1. Wrap the current functionality (calling subprocess and capturing output) into a single function. Have the function create a result object that can be returned; alternatively, the function could write out to files as you see fit.
  2. Create an iterable (list, etc.) that contains an input for each chunk of data for step 1.
  3. Create a multiprocessing Pool and then capitalize on its map() functionality to execute your function from step 1 for each of the items in step 2. See the python multiprocessing docs for details.

You could also use a worker/Queue model. The key, I think, is to encapsulate the current subprocess/output capture stuff into a function that does the work for a single chunk of data (whatever that is). Layering on the parallel processing piece is then quite straightforward using any of several techniques, only a couple of which were described here.

2 Comments

The problem is the code has to run on a cluster computers which have python 2.5 and the multiprocessing module is in python since 2.6. :/
It is nearly trivial to install python in your home directory and use that one instead. That said, if you are on a cluster, then you might have other options including batch submission to a queuing system.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.