[RFC] Reducing process creation overhead in LLVM regression tests

On Windows, process creation incurs a significant overhead, especially when antivirus software is installed. I estimate the overhead per process on my work PC to be around 0.027 seconds, and Lit spawned 246200 processes during check-llvm, i.e. around 10 minutes of CPU time spent just creating and destroying processes. This RFC demonstrates that we can reduce the time taken by check-llvm by at least 20% on Windows by merging processes.

Inspired by @rgal’s observation that process creation is a significant overhead for Lit tests, I have been trying some ideas to reduce the number of process invocations during check-llvm, to hopefully reduce testing times and CI workload (examining the CI logs on GitHub, running the regression tests on Windows seems to take about 8 minutes - example from a random pull request I found). The main approach I have explored is introducing new frontends for opt/llc that operate on multiple modules in one process, and mechanically “merging” a group of tests just before testing by replacing them with one test that invokes the new multi-module tool.

I drafted a very rough prototype of this idea here and the results are promising, with a reduction in testing runtime on my Windows PC from 950 to 775 seconds, or 236 to 61 seconds if only tests identified as mergeable are considered. Surprisingly, a noticeable difference on Linux (WSL) was also observed, though not as dramatic as the one observed on Windows.

Prototype implementation

Merged tests consist of a single run directive invoking the multi-module version of the tool (these are optmany and llcmany in my proof-of-concept and reside in the llvm/tools directory), the result of which is piped into FileCheck with the test file as the check file as usual. Following this are a series of INPUT_FILE directives specifying the paths to the original test modules, which are interpreted by the multi-module tool. Finally, the FileCheck directives from the original tests are extracted and appended to the merged tests, interspersed by checks for special boundary labels outputted by the multi-module tool.

An example of a merged test is as follows:

RUN: optmany -o - -hide-filename < %s -O2 -S | FileCheck %s
INPUT_FILE: llvm\test\Transforms\InstCombine\no-unwind-inline-asm.ll
INPUT_FILE: llvm\test\Transforms\InstCombine\unwind-inline-asm.ll
CHECK: TEST_BEGIN
CHECK-LABEL: INPUT_FILE llvm\test\Transforms\InstCombine\no-unwind-inline-asm.ll
CHECK: define dso_local void @test()
CHECK-NEXT: entry:
CHECK-NEXT: tail call void asm sideeffect
CHECK-NEXT: ret void
CHECK: TEST_END
CHECK: TEST_BEGIN
CHECK-LABEL: INPUT_FILE llvm\test\Transforms\InstCombine\unwind-inline-asm.ll
CHECK: define dso_local void @test()
CHECK-NEXT: entry:
CHECK-NEXT: invoke void asm sideeffect unwind
CHECK: %0 = landingpad { ptr, i32 }
CHECK: resume { ptr, i32 } %0
CHECK: TEST_END

These are created and run by a Python program called test_consolidator in llvm/utils. To try it, first update the information in test_consolidator.json in the project root and then invoke test_consolidator/main.py from the LLVM clone directory and, once it’s set up and prompts for a command, run all-merged to run all regression tests with as many merged as possible - note that some TableGen tests fail when ran this way; this is because they rely on relative paths but the prototype implementation copies all tests to a temporary directory.

As for deciding which tests can be merged, the prototype implementation just sees tests with the exact same run directive (tests with multiple run directives are treated as multiple tests with one each) and the same REQUIRES/UNSUPPORTED directives and lit.local.cfg as mergeable. This results in 17832 regression tests (42% of all .ll tests) being identified as mergeable; these are converted into 5272 merged tests, reducing the total number of processes invoked from 248091 to 197581.

I also experimented with parallelising optmany but this did not result in a meaningful speed improvement during actual test runs, as Lit is already running tests in parallel.

Prototype limitations

Swathes of code are copied from llc.cpp and optdriver.cpp into llcmany and optmany. This doesn’t just violate the DRY principle but also has the major flaw that changes made to this code will not be tested by the merged tests. If this idea is to be viable then this code must be extracted into library functions (e.g. optInit and optProcessModule) which are called by opt and optmany (this follows the iniative already started by the extraction of optMain to a static library). This refactoring would be the main cost of implementing this solution.

The prototype test consolidator is quite slow to create the merged tests, taking 20 seconds to identify mergeable tests and 1 minute 50 seconds to create the merged tests on my PC. It is written in Python and relies on lots of regexes for parsing out FileCheck directives reliably(ish) and more - I am quite new to writing regexes so this is definitely a major area for optimisation. It should be possible to make this process very fast, so that it can be run just before testing. Also, it is certain that you can invent a Lit test which confuses it (I only made sure all mergeable tests that are normally invoked by check-llvm still pass after merging).

Another issue is that when a merged test fails, it requires some work to work out which test it was. A potential solution could be automatically re-running the individual tests from the merged tests that were reported as failures - this could be implemented by parsing Lit’s output or by integrating with Lit.

Other approaches

A few other approaches I considered are:

  1. Mechanically merging as many opt/llc tests as possible by concatenating them into one module before testing.
    • I decided that this approach, while the least invasive (it would only require the introduction of one “merge tests” script that is ran before testing), would be too fragile as merging modules would invalidate the existing FileCheck directives, especially if identifiers, metadata IDs, etc. have to be uniquified.
  2. Compiling the tools to be tested as shared libraries and invoking them within the Lit process.
    • This is the most powerful approach in that it could theoretically eliminate the process creation overhead entirely, and is also the most resilient to being upset by unusually structured Lit tests. However, it poses the extremely invasive restriction on all the tools for which the idea is applied that the main function must be idempotent and should never call exit or similar, as this would exit from the Lit worker itself - unless some hackery is done like linking with a fake exit function, but this is too evil.

Is reducing process invocations to speed up regression testing a useful pursuit? Do you think the trade-off of maintenance and increased testing complexity and fragility is worth it for the potential reduction in testing times? Does anyone have other ideas to achieve the same goal in a better way?

CC @jmorse

4 Likes

I do think that the general direction here is worth pursuing – this is most problematic on Windows, but increased process startup overhead is also one of the big downsides of enabling LLVM_LINK_LLVM_DYLIB builds on Linux, because dynamic relocations add significant overhead to short-running test processes.

However, I’m not really a fan of the specific approach you are using. This kind of pre- and post-processing of tests looks very fragile to me.

The broad alternative I’d consider is to support a daemonized running mode for tools like opt, and corresponding support in lit to send certain “commands” to the corresponding daemon instead, keeping a pool of such processes live. This should be mostly transparent from the perspective of the test.

6 Likes

Also want to point out both merging and daemon approach also needs to work great when tests are failing, especially the tool just crashes. It needs to ping point exactly which tests are failing, hopefully without re-executing (since it can be non-deterministic).

I guess another competing idea will be just going fully distributable so you can run it as wide as possible. Usually one lit test has a very small dependency, and you can distribute that very well, but we don’t encode fine-grain dependencies in CMake to allow that.

1 Like

I agree with Nikita that adding new binaries that can support running multiple files at the same time and using lit to merge the test lines isn’t the best option. The patch you have looks like it duplicates a bunch of code and will end up having a pretty high maintenance overhead.

However, I do think this work is quite valuable. I have recently spent a bunch of time working on eliminating process start overhead for testing across the monorepo by working towards enabling lit’s internal shell by default on Linux. That eliminates bash invocations and has resulted in a 10-15% speedup in a couple of the big test suites. We’ve been focusing on this primarily for premerge. Most of the time spent running premerge is spent running lit tests, and we’re trying to drive down test latency that way. We’ve mostly been focused on the low hanging fruit (like finishing up the enablement of the internal shell), but the next step on that front would probably be some sort of daemon mode.

There is prior art for a daemon mode in LLVM ( ⚙ D86351 WIP: llvm-buildozer and https://www.youtube.com/watch?v=usPL_DROn4k). I don’t think anyone has made it work specifically for testing, but I would certainly be interested in seeing it. If someone ends up putting in the work to do this, it would also be great if in addition to the LLVM tools, we could also get it working with clang. That would probably have similar impact on the clang test suite and might have some impact on other test suites like libc++. However, the libc++ test suite doesn’t spend much time in process startup compared to the other test suites last I checked (and disabling PIE doesn’t have a large impact on the test suite time). I think a daemon mode would have appropriate maintenance/test speedup tradeoffs, assuming the work is done to make sure it reports crashes correctly.

1 Like

Based purely on what you’ve described, my biggest concern would be that merging processes would not mean these processes would then share memory, rather than each one having a completely clean state to work with, which could theoretically mean tests pass when merged, but fail when separate (or indeed vice versa). I acknowledge that this class of bugs is likely to be rare. Perhaps it would be sufficient to make this configurable at cmake time, with developers having it enabled and build bots having it disabled or similar.

1 Like

FWIW, Carbon uses a single process to run tests ( carbon-lang/testing/file_test at trunk · carbon-language/carbon-lang · GitHub carbon-lang/toolchain/testing/file_test.cpp at trunk · carbon-language/carbon-lang · GitHub ) that might be worth some inspiration. Combined with an LLVM busybox perhaps it could work for most of LLVM’s test suite - though the current implementation’s probably closer to clang, so maybe trying it with in-process clang might be an easier prototype/increment.

4 Likes

Thanks everyone for the kind and constructive feedback. I agree the test merging approach isn’t ideal. I’m currently prototyping the daemon-based approach.

Thanks for looking into this, it’s something I’ve talked about often over lunch, but I have no action to show for it.

There is further prior art here to look at in the Chromium test launcher (2013 announcement, implementation). One of the innovations here was to re-run failing tests in isolation at the end of the test run to distinguish between failures caused by state carried over from prior test execution, vs true positive failures of the test when run by itself.

With a more centrally-controlled corporate project like Chromium, you can imagine that there is some dedicated cleanup crew that pulls data on test re-runs and tries to continuously keep them under control.

4 Likes

Hi everyone, I’ve been experimenting with a prototype for the daemonized test running idea which can be found here.

So far, I’ve just been applying the idea to opt and FileCheck and comparing performance in the llvm/Transforms folder. Lit collects the stdout and stderr from the daemon, as well as the return code, and displays them as normal. If the daemon exits unexpectedly during a test, Lit takes the exit code as the exit code for the test and restarts the daemon, so crashing is taken care of. To try the daemonized testing, invoke Lit with --use-daemon-tools.

On Windows, running all the llvm/Transforms tests with regular Lit takes 137 seconds on my PC. With my prototype for running opt and FileCheck as daemons, it takes just 32 seconds, although 0.97% of tests in the Transforms directory fail incorrectly.

The biggest problem for the approach is static state in LLVM. Static variables are not reset between daemon tasks, so any static variable that may affect the output can break tests - for example, NumDevirtCalls in WholeProgramDevirt.cpp wasn’t being reset between runs, causing the Transforms/WholeProgramDevirt/import.ll test to fail unexpectedly - I think the causes for the remaining failures are similar. The daemonization approach could only be applied universally if all state which is shared between modules and may change the output is removed and forbidden. Another idea I had is, for tests that fail when run using the daemon, to try re-running them without it - this would cover cases where the daemonization causes incorrect failures, but doesn’t solve the issue pointed out by jh7370 that shared state could cause tests to pass incorrectly.

The daemonization is transparent from the perspective of the Lit test, although it may be sensible to make a way for a test to opt out of running tools in daemon mode in case the test relies on state that can’t be properly reset, or even to make it opt-in on a per-invocation basis (e.g. RUN: %opt_daemon rather than RUN: opt).

3 Likes

An almost 80% decrease in test times is quite nice! Trying to figure out what to do with the command line options is probably one of the biggest obstacles to landing this. It would be really nice to fix this for good and scope cl::opt values to a LLVMContext or similar. That would be a really nice improvement in general. It would also open up the possibility of doing daemon mode in threads instead of processes, but I think there would be some problems with that around handling crashes and I/O. I’m not sure if anyone has done a design of how that would look though.

I’m wondering if for prototyping purposes, it would be possible to hack up cl:: so that you can reset the command line options that are set to their default values. I don’t think that would be difficult to implement and it would allow for getting full results (eg running the entirety of check-llvm and similar). I would even be fine landing this with that approach, assuming there aren’t other issues with global state, and we have a plan to do the proper thing in the future, ideally with resourcing to finish it.