Python subprocess returns wrong exit code

Question

I wrote a script to launch a number of processes (simple unit tests) to run in parallel. It will do N jobs with num_workers parallel processes at a time.

My first implementation ran the processes in batches of num_workers and seemed to be working fine (I used the false command here to test the behavior)

import subprocess

errors = 0
num_workers = 10
N = 100
i = 0

while i < N:
    processes = []
    for j in range(i, min(i+num_workers, N)):
        p = subprocess.Popen(['false'])
        processes.append(p)

    [p.wait() for p in processes]
    exit_codes = [p.returncode for p in processes]

    errors += sum(int(e != 0) for e in exit_codes)
    i += num_workers

print(f"There were {errors}/{N} errors")

However, the tests do not take equal amounts of time so I was sometimes waiting for a slow test to finish. Thus I rewrote it to keep assigning tasks as they finished

import subprocess
import os


errors = 0
num_workers = 40
N = 100
assigned = 0
completed = 0
processes = set()

while completed < N:
    if assigned < N:
        p = subprocess.Popen(['false'])
        processes.add((assigned, p))
        assigned += 1
    if len(processes) >= num_workers or assigned == N:
        os.wait()

    for i, p in frozenset(processes):
        if p.poll() is not None:
            completed += 1
            processes.remove((i, p))
            err = p.returncode
            print(i, err)
            if err != 0:
                errors += 1

print(f"There were {errors}/{N} errors")

However this produces wrong results for the last few processes. For instance, in the example above it produces 98/100 errors instead of 100. I checked and this has nothing to do with concurrency; the 2 latest jobs were returning with exit code 0 for some reason.

Why does this happen?

Consider using multiprocessing instead of managing your own parallel processes. — sam
– sam, Commented Feb 21, 2018 at 16:37
@ndmeiri What race condition? Only the main thread reads/writes the errors variable. — BingsF
– BingsF, Commented Feb 21, 2018 at 17:25
Not a race condition; os.wait() clobbers the return code for poll(). — Nathan Vērzemnieks
– Nathan Vērzemnieks, Commented Feb 23, 2018 at 1:01

Nathan Vērzemnieks · Accepted Answer · 2018-02-23 01:00:24Z

The problem is with os.wait(). It doesn't only wait for a child process to exit: it also returns the pid and "exit status indication" of that child, as the documentation says. This requires waiting until the child process has terminated; but once the child has terminated, its return code is no longer available for poll. Here's a simple test to reproduce the problem:

false_runner.py

import os
import subprocess
p = subprocess.Popen(['false'], stderr=subprocess.DEVNULL)
pid, retcode = os.wait()
print("From os.wait: {}".format(retcode))
print("From popen object before poll: {}".format(p.returncode))
p.poll()
print("From popen object after poll: {}".format(p.returncode))

output

njv@organon:~/tmp$ python false_runner.py
From os.wait: 256
From Popen object before poll: None
From Popen object after poll: 0

The source code for _internal_poll, called by Popen.poll, makes it clear what's happening here: when Popen tries to call _waitpid on its child process's pid, it gets ChildProcessError: [Errno 10] No child processes, and assigns itself a returncode of 0, because there's no way to determine the child process's return code at this point.

The reason this happens only for the last couple of subprocesses in your example is because os.wait is only getting called for the or assigned == N case, and only once or twice, because your subprocess is so fast. If you slow it down a bit, you'll get more random behavior.

As for a fix: I'd probably just replace os.wait() with a sleep.

Collectives™ on Stack Overflow

Python subprocess returns wrong exit code

1 Answer 1

false_runner.py

output

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

false_runner.py

output

Comments

Your Answer

Sign up or log in

Post as a guest

Related