[GSoC 2025] Input-Gen: A Scalable Framework for Stateful Input Generation

andrewka · April 10, 2025, 2:14pm

I am posting here the project which I discussed in my GSoC proposal to see what other discussion I can have about any future steps.
Description: This project aims to enhance the Input-Gen tool, a scalable framework for stateful input generation, to extend coverage and compatibility with LLVM-supported languages (C/C++, Rust, Julia, and Swift). Introductory Discourse post here. Input-Gen generates inputs for arbitrary program fragments by instrumenting LLVM Intermediate Representation (IR) code, and afterwards capturing and replaying program states. The tool operates through a multi-stage process involving module preparation, LLVM IR instrumentation, runtime execution, and input storage. By utilizing the LLVM ComPile dataset, a large amount of inputs can be generated and evaluated for determining the accuracy of the Input-Gen. The goal is to improve the tool’s accuracy when executing with arbitrary IR files, enabling its adoption for practical purposes defined by LLVM developers, such as comprehensive testing, performance tuning, and ML training.

Expected Results: Enhanced accuracy of the Input-Gen tool, increased coverage percentage, and successful instrumentation and execution of generated inputs from IR bitcode files or modules. By the end of the GSoC timeline, Input-Gen is expected to achieve a larger number of successfully instrumented and executed functions, as well as a higher number of basic blocks executed for each IR file on average. This is relative to previous results discussed in the Input-Gen paper. This will be accomplished by directly editing input-gen.cpp and its associated files, found here.

Project Size: Medium
Requirement: Basic C & C++ skills, familiarity with LLVM IR features
Confirmed Mentors: Aiden Grossman, Ivan Ivanov, Johannes Doerfert

andrewka · April 10, 2025, 2:16pm

Current Individual Progress: The Input-Gen tool has been run on two x86 architecture systems and has shown promising results. Initial testing with the ComPile dataset has demonstrated successful instrumentation and execution of generated inputs. The mass input generation was run using the run_local_mass_input_gen.sh script with a configuration that specified the dataset location as ~/.cache/huggingface/datasets/llvm-ml___com_pile (which was supplied to HuggingFace Datasets.load_dataset()), LLVM installation directory as /path/to/llvm-input-gen-install, jugfile data location as $SCRIPT_DIR/jugfile.jugdata, and output directory as /path/to/compile-input-gen-out.
The shell arguments were:

VERBOSE=1 ADDITIONAL_FLAGS="--verbose -g" JUG=run START=0 END=99 LANGUAGE=c \
./scripts/run_local_mass_input_gen.sh

Verbosity was added to visually confirm the tool was executing correctly.

The key statistics from the run are:

685 functions
665 inputs generated (all)
618 inputs generated for functions with normal exit paths
657 inputs ran (all)
645 inputs ran for functions with normal exit paths
7246 basic blocks from all IR files
3577 basic blocks executed

Visual comparison with the results previously obtained using Input-Gen is still too early, as there were issues that need to be addressed, and not enough files used by the tool to make a comparison. These results currently serve as an indication that the tool was performing expectedly.

One of the pressing issues with executing Input-Gen was an error with branch-hints.ll in the lit-test suite, which needs to be addressed. Additionally, a modification was made to llvm/tools/input-gen/input-gen.cpp as shown in the following diff:

@@ -354,7 +354,7 @@ public:
                          std::string RuntimeName) {
     if (ClCompileInputGenExecutables) {
       LLVM_DEBUG(dbgs() << "Compiling" << ExecutableName << "\n";
-      SmallVector<StringRef, 10> Args = {Clang,         "-ldl",     "-rdynamic",
+      SmallVector<StringRef, 10> Args = {Clang,         "-ldl",     "-rdynamic", "--gcc-toolchain=/packages/gcc/13.2.0",
                                          RuntimeName,   ModuleName, "-o",
                                          ExecutableName};

This was done because I was unable to successfully build LLVM using LLVM_ENABLE_LIBCXX, which would provide clang++ the standard C++ library (libc++) that Input-Gen needs. So, I instead relied on the GNU C++ library (libstd++), but this is not a reliable solution.

akorobeynikov · April 10, 2025, 3:56pm

Hello

It is a bit strange to see this here now:

It is usually mentor who submits the project proposal. The project submission guidelines were posted at LLVM+GSoC 2025: call for mentors and projects!
Proposal submission deadline already passed

jdoerfert · April 10, 2025, 5:55pm

@akorobeynikov, From what Andrew told me, he thought he had to make a discourse post after he submitted his project proposal. Maybe there was just some confusion going around.

~ J

akorobeynikov · April 10, 2025, 6:13pm

Not sure where he got this (see the link above with the instructions). The project was not listed in Open Projects, I was not aware of its presence until recently as well as mentors of it.

We are having lots of irrelevant / spam / LLM-generated proposals this year.

andrewka · April 10, 2025, 6:35pm

@akorobeynikov from my understanding the link above is relevant to those who are mentors for a GSoC project. I submitted my proposal as a contributor, and I created this post to allow for any open discussion of this proposed project. I submitted the proposal recently, so I would understand why there was not awareness of this particular one.

Topic		Replies	Views
Introducing input-gen: Automatically generate runnable inputs for your IR LLVM Project	3	336	July 31, 2024
Request information about LLVM for research LLVM Dev List Archives	1	74	August 23, 2010
Compile programs with the LLVM Compiler as a gsoc project LLVM Dev List Archives	12	181	March 31, 2008
GSoC - LLVM's testing infrastructure LLVM Dev List Archives	0	87	April 7, 2010
Proposal for GSoC project for improving llvm-test testsuite LLVM Dev List Archives	11	142	March 25, 2008

[GSoC 2025] Input-Gen: A Scalable Framework for Stateful Input Generation

Related topics