asan coverage

I don’t view it as a competition, but assuming that we both succeed in our plans, LLVM would then end up with two very similar solutions for code coverage. Does it really make sense to have two?

It depends. If the two will indeed have the same functionality -- then no.
My understanding about your plans is that the upcoming coverage will
provide "counters" (== how many times a bb/edge was executed).
AsanCoverage produces booleans (== 1, iff a function/bb was executed),
which is less information, but faster.
How much faster -- I can't tell w/o your performance numbers.
For our early users the performance is critical and booleans are
sufficient.

If we end up needing both variants, we may want to keep them similar from
user perspective, e.g. have flag combinations like these:
-coverage=per-edge,counter
-coverage=per-function,counter
-coverage=per-block,counter
-coverage=per-function,boolean
-coverage=per-block,boolean

--kcc

Regarding performance, I’ve made a simple coverage with counters and compared it with AsanCoverage.

AsanCoverage produces code like this:

mov 0xe86cce(%rip),%al
test %al,%al
je 48b4a0 # to call __sanitizer_cov

callq 4715b0 <__sanitizer_cov>

A simple counter-based thing (which just increments counters and does nothing else useful) produces this:
incq 0xe719c6(%rip)

The performance is more or less the same, although the issue with false sharing still remains
(http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html)

Do you have any more details about the planned clang coverage?

Thanks,

–kcc

Our instrumentation code is basically done now, so you can try it out and compare the performance. Just build with -finstr-profile-generate. You may want to hack the compiler-rt code in lib/profile/PGOProfiling.c to disable writing the output, since the current implementation is pretty naive.

If it turns out as you suggest, and the different kinds of instrumentation serve different purposes, then I think it would make sense to do both.

Our instrumentation code is basically done now, so you can try it out and
compare the performance. Just build with -finstr-profile-generate.

Is this in trunk?
clang-3.5: error: unknown argument: '-finstr-profile-generate'

--kcc

Ah, that’s -fprofile-instr-generate

I’ve built chromium with " -fprofile-instr-generate -fsanitize=address" – the performance looks good!
The file format from r198638 is indeed rudimentary.
Do you already know how the real output format will look like?

Just to summarize what I think is important:

  • minimal size on disk, minimal amount of files
  • minimal i/o while writing to disk, no lockf or some such
  • separate process produces separate file(s)
  • fast merging of outputs from different runs of the same binary
  • only the coverage output files and the binary (+DSOs) are required to “symbolize” the coverage data (map the data to file names & line numbers)

Ideally, symbolizing the cov data will use the debug info in the binary (i.e. llvm-symbolizer, addr2-line or atos),
this is what we’ve done in AsanCov, but I don’t see a clean way to make it work with counters…

–kcc

Kostya Serebryany <kcc@google.com> writes:

I've built chromium with " -fprofile-instr-generate -fsanitize=address"
-- the performance looks good!
The file format from r198638 is indeed rudimentary.
Do you already know how the real output format will look like?

We have an idea of what it will look like, but we're still working out
some details.

Just to summarize what I think is important:
- minimal size on disk, minimal amount of files
- minimal i/o while writing to disk, no lockf or some such
- separate process produces separate file(s)
- fast merging of outputs from different runs of the same binary
- only the coverage output files and the binary (+DSOs) are required to
"symbolize" the coverage data (map the data to file names & line numbers)

I think we agree on all of these goals. Specifically, we're going for:

- A binary format, one file per program (rather than per-source or
  per-object).
- Each run of the process generates data for that run only.
- Merging outputs will be done by a separate tool
- The data file and the binary should be sufficient for symbolizing.

Additionally, we want fast lookup of data for a particular function, but
that's less important for coverage than it is for PGO.

Ideally, symbolizing the cov data will use the debug info in the binary (i.e.
llvm-symbolizer, addr2-line or atos),
this is what we've done in AsanCov, but I don't see a clean way to make it
work with counters...

We may need some additional info. I haven't put a ton of thought into
this, but I'm hoping we can either (a) use debug info as is or add some
extra (valid) debug info to support this, or (b) add an extra
debug-info-like section to instrumented binaries with the information we
need.

We may need some additional info.

What kind of additional info?

I haven't put a ton of thought into
this, but I'm hoping we can either (a) use debug info as is or add some
extra (valid) debug info to support this, or (b) add an extra
debug-info-like section to instrumented binaries with the information we
need.

I'd try this data format (binary equivalent):

/path/to/binary/or/dso1 num_counters1
pc1 counter1
pc2 counter2
pc3 counter3
...
/path/to/binary/or/dso2 num_counters2
pc1 counter1
pc2 counter2
pc3 counter3
...

I don't see a straightforward way to produce such data today because
individual Instructions do not work as labels.
But I think this can be supported in LLVM codegen.
Here is a *raw* patch with comments, just to get the idea.

Index: lib/CodeGen/CodeGenPGO.cpp

We’re not going to use debug info at all. We’re emitting the counters in the clang front-end. We just need to emit separate info to show how to map those counters to source locations. Mapping to PCs and then using debug info to get from the PCs to the source locations just makes things harder and loses information in the process.

I understand why you don’t want to rely on debug info and instead produce your own section.
We did this with our early version of llvm-based tsan and it was simpler to implement.

But here is a data point to support my suggestion:
chromium binary built with asan, coverage and -gline-tables-only is 1.6Gb.
The same binary is 1.1Gb when stripped, so, the line tables require 500Mb.
Separate line info for coverage will essentially double this amount.
The size of binary is a serious concern for our users, please take it into consideration.

Thanks!
–kcc

Why is the binary size a concern for coverage testing?

Our users combine asan and coverage testing and they do it on thousands machines.
(An older blog post about using asan: http://blog.chromium.org/2012/04/fuzzing-for-security.html)
The binaries need to be shipped to virtual machines, where they will be run.
The VMs are very short of disk and the network bandwidth has a cost too.
We may be able to ship stripped binaries to those machine but this will complicate the logic immensely.

Besides, zip-ed binaries are stored for several revisions every day and the storage also costs money.
Just to give you the taste (https://commondatastorage.googleapis.com/chromium-browser-asan/index.html):

asan-symbolized-linux-release-252010.zip 2014-02-19 14:34:24 406.35MB
asan-symbolized-linux-release-252017.zip 2014-02-19 18:22:54 406.41MB
asan-symbolized-linux-release-252025.zip 2014-02-19 21:35:49 406.35MB
asan-symbolized-linux-release-252031.zip 2014-02-20 00:44:25 406.35MB
asan-symbolized-linux-release-252160.zip 2014-02-20 06:30:16 406.34MB
asan-symbolized-linux-release-252185.zip 2014-02-20 09:21:47 408.52MB
asan-symbolized-linux-release-252188.zip 2014-02-20 12:20:05 408.52MB
asan-symbolized-linux-release-252194.zip 2014-02-20 15:01:05 408.52MB
asan-symbolized-linux-release-252218.zip 2014-02-20 18:00:42 408.54MB
asan-symbolized-linux-release-252265.zip 2014-02-20 21:00:03 408.65MB
asan-symbolized-linux-release-252272.zip 2014-02-21 00:00:40 408.66MB

–kcc

Some more data on code size.

I’ve build CPU2006/483.xalancbmk with

a) -O2 -fsanitize=address -m64 -gline-tables-only -mllvm -asan-coverage=1
b) -O2 -fsanitize=address -m64 -gline-tables-only -fprofile-instr-generate

The first is 27Mb and the second is 48Mb.

The extra size comes from _llvm_prf* sections.
You may be able to make these sections more compact, but you will not make them tiny.

The instrumentation code generated by -asan-coverage=1 is less efficient than -fprofile-instr-generate
in several ways (slower, fatter, provides less data).
But it does not add any extra sections to the binary and wins in the overall binary size.
Ideally, I’d like to see such options for -fprofile-instr-generate as well.

–kcc

Some more data on code size.

I’ve build CPU2006/483.xalancbmk with

a) -O2 -fsanitize=address -m64 -gline-tables-only -mllvm -asan-coverage=1
b) -O2 -fsanitize=address -m64 -gline-tables-only -fprofile-instr-generate

The first is 27Mb and the second is 48Mb.

The extra size comes from _llvm_prf* sections.
You may be able to make these sections more compact, but you will not make them tiny.

The instrumentation code generated by -asan-coverage=1 is less efficient than -fprofile-instr-generate
in several ways (slower, fatter, provides less data).
But it does not add any extra sections to the binary and wins in the overall binary size.
Ideally, I’d like to see such options for -fprofile-instr-generate as well.

—kcc

It might make sense to move at least some of the counters into the .bss section so they don’t take up space in the executable.

We’re also seeing that the instrumentation bloats the code size much more than expected and we’re still investigating to see why that is the case.

The __llvm_prf_cnts section is likely the largest. It's zero-initialized,
so it's a good candidate for .bss or similar. The counters are currently in
their own section to make write out easy (just one big array), but we could
change that. Or, is there linker magic that can make a special section
behave like the .bss?

Some more data on code size.

I've build CPU2006/483.xalancbmk with
a) -O2 -fsanitize=address -m64 -gline-tables-only -mllvm -asan-coverage=1
b) -O2 -fsanitize=address -m64 -gline-tables-only -fprofile-instr-generate

The first is 27Mb and the second is 48Mb.

The extra size comes from __llvm_prf_* sections.
You may be able to make these sections more compact, but you will not make them tiny.

The instrumentation code generated by -asan-coverage=1 is less efficient than -fprofile-instr-generate
in several ways (slower, fatter, provides less data).
But it does not add any extra sections to the binary and wins in the overall binary size.
Ideally, I'd like to see such options for -fprofile-instr-generate as well.

--kcc

It might make sense to move at least some of the counters into the .bss section so they don't take up space in the executable.

We're also seeing that the instrumentation bloats the code size much more than expected and we're still investigating to see why that is the case.

The __llvm_prf_cnts section is likely the largest. It's zero-initialized,
so it's a good candidate for .bss or similar. The counters are currently in
their own section to make write out easy (just one big array), but we could
change that. Or, is there linker magic that can make a special section
behave like the .bss?

Possibly. The zerofill stuff is a bit weird, but you should be able to
specify a large enough block to zerofill and a concrete section. A
separate section with the S_ZEROFILL attribute would probably work to
get it all initialized and just switch sections and not use the
zerofill directive.

-eric

Heh, I'm a little lost here. Where can we specify this? I had a look
in MCSectionMachO.cpp, and S_ZEROFILL isn't accessible from LLVM IR.
Should we add logic somewhere to recognize these sections? Will that
actually reduce the executable size? (I tried hacking it in but that
didn't seem to save disk space.)

Also, this doesn't solve ELF. Can we do similar things there?

For clarity, there are three __llvm_prf_* sections. Without having
seen Kostya's data, I'm speculating that __llvm_prf_cnts is the largest
section.

  - __llvm_prf_data is 32B per function, and has pointers into the
    other two sections. This section is necessary to avoid static
    initialization (implemented on Darwin, not quite on ELF).

  - __llvm_prf_names is the mangled name of every function. It should
    be on the same order of magnitude as __llvm_prf_data. This
    section is convenient for writing out the profile, since the names
    are effectively placed in one big char array whose bounds are known
    at link time.

  - __llvm_prf_cnts is 8B per region counter. Each function has one
    at entry and roughly one per CFG-relevant AST node (for, if, etc.).
    This section is convenient for writing out the profile, since the
    counters are effectively placed in one big array whose bounds are
    known at link time. However, I don't think the data in this
    section needs to be explicitly stored in the executable if we can
    somehow make it act like .bss (or the like).

In the latter two cases, it's possible to avoid the special sections if
there are good reasons, but it will add runtime overhead.

>>
>>
>>>
>>>
>>>> Some more data on code size.
>>>>
>>>> I've build CPU2006/483.xalancbmk with
>>>> a) -O2 -fsanitize=address -m64 -gline-tables-only -mllvm
-asan-coverage=1
>>>> b) -O2 -fsanitize=address -m64 -gline-tables-only
-fprofile-instr-generate
>>>>
>>>> The first is 27Mb and the second is 48Mb.
>>>>
>>>> The extra size comes from __llvm_prf_* sections.
>>>> You may be able to make these sections more compact, but you will not
make them tiny.
>>>>
>>>> The instrumentation code generated by -asan-coverage=1 is less
efficient than -fprofile-instr-generate
>>>> in several ways (slower, fatter, provides less data).
>>>> But it does not add any extra sections to the binary and wins in the
overall binary size.
>>>> Ideally, I'd like to see such options for -fprofile-instr-generate as
well.
>>>>
>>>> --kcc
>>>
>>> It might make sense to move at least some of the counters into the
.bss section so they don't take up space in the executable.
>>>
>>> We're also seeing that the instrumentation bloats the code size much
more than expected and we're still investigating to see why that is the
case.
>>
>> The __llvm_prf_cnts section is likely the largest. It's
zero-initialized,
>> so it's a good candidate for .bss or similar. The counters are
currently in
>> their own section to make write out easy (just one big array), but we
could
>> change that. Or, is there linker magic that can make a special section
>> behave like the .bss?
>
> Possibly. The zerofill stuff is a bit weird, but you should be able to
> specify a large enough block to zerofill and a concrete section. A
> separate section with the S_ZEROFILL attribute would probably work to
> get it all initialized and just switch sections and not use the
> zerofill directive.
>
> -eric

Heh, I'm a little lost here. Where can we specify this? I had a look
in MCSectionMachO.cpp, and S_ZEROFILL isn't accessible from LLVM IR.
Should we add logic somewhere to recognize these sections? Will that
actually reduce the executable size? (I tried hacking it in but that
didn't seem to save disk space.)

Also, this doesn't solve ELF. Can we do similar things there?

For clarity, there are three __llvm_prf_* sections. Without having
seen Kostya's data, I'm speculating that __llvm_prf_cnts is the largest
section.

This is what I see on 483.xalancbmk:
15 __llvm_prf_names 0036fcb0 00000000010abf40 00000000010abf40 00cabf40
2**5
16 __llvm_prf_data 001b77e0 000000000141bc00 000000000141bc00 0101bc00
2**5
32 __llvm_prf_cnts 00123468 0000000001af2fe0 0000000001af2fe0 014f2fe0
2**5

__llvm_prf_names is larger than the other two combined.
483.xalancbmk is C++ and the function names are long. The same is true for
most of the code we care about.
Can't we use function PCs instead of function names in this table (and
retrieve the function names from debug info)?

  - __llvm_prf_data is 32B per function, and has pointers into the
    other two sections. This section is necessary to avoid static
    initialization (implemented on Darwin, not quite on ELF).

  - __llvm_prf_names is the mangled name of every function. It should
    be on the same order of magnitude as __llvm_prf_data. This
    section is convenient for writing out the profile, since the names
    are effectively placed in one big char array whose bounds are known
    at link time.

  - __llvm_prf_cnts is 8B per region counter. Each function has one
    at entry and roughly one per CFG-relevant AST node (for, if, etc.).
    This section is convenient for writing out the profile, since the
    counters are effectively placed in one big array whose bounds are
    known at link time. However, I don't think the data in this
    section needs to be explicitly stored in the executable if we can
    somehow make it act like .bss (or the like).

Why can't we simply create this buffer at startup?
We solve similar task in asan, so it's definitely possible.

--kcc

That’s a bit surprising. We should check to make sure there isn’t anything obviously wrong.

We need this to work for environments where minimal runtime support is available (e.g., in kernel code). We used to call malloc() to create some data structures at runtime but that prevented us from using it from kernel code. I’m hoping the additional size will not be such a problem that we have to implement separate solutions for different environments.