Understanding Linux perf report output

Question

Though I can intuitively get most of the results, I'm having hard time fully understanding the output of the perf report command especially for what concerns the call graph, so I wrote a stupid test to solve this issue of mine once for all.

The stupid test

I compiled what follows with:

gcc -Wall -pedantic perf-test.c -lm -o perf-test

No aggressive optimizations to avoid inlining and such.

#include <math.h>

#define N 10000000UL

#define USELESSNESS(n)                          \
    do {                                        \
        unsigned long i;                        \
        double x = 42;                          \
        for (i = 0; i < (n); i++) x = sin(x);   \
    } while (0)

void baz()
{
    USELESSNESS(N);
}

void bar()
{
    USELESSNESS(2 * N);
    baz();
}

void foo()
{
    USELESSNESS(3 * N);
    bar();
    baz();
}

int main()
{
    foo();
    return 0;
}

Flat profiling

perf record ./perf-test
perf report

With these I get:

  94,44%  perf-test  libm-2.19.so       [.] __sin_sse2
   2,09%  perf-test  perf-test          [.] sin@plt
   1,24%  perf-test  perf-test          [.] foo
   0,85%  perf-test  perf-test          [.] baz
   0,83%  perf-test  perf-test          [.] bar

Which sounds reasonable since the heavy work is actually performed by __sin_sse2 and sin@plt is probably just a wrapper, while the overhead of my functions take into account just the loop, overall: 3*N iterations for foo, 2*N for the other two.

Hierarchical profiling

perf record -g ./perf-test
perf report -G
perf report

Now the overhead columns that I get are two: Children (the output is sorted by this one by default) and Self (the same overhead of the flat profile).

Here is where I start feeling I miss something: regardless of the fact that I use -G or not I'm unable to explain the hierarchy in terms of "x calls y" or "y is called by x", for example:

without -G ("y is called by x"):

  -   94,34%    94,06%  perf-test  libm-2.19.so       [.] __sin_sse2
     - __sin_sse2
        + 43,67% foo
        + 41,45% main
        + 14,88% bar
  -   37,73%     0,00%  perf-test  perf-test          [.] main
       main
       __libc_start_main
  -   23,41%     1,35%  perf-test  perf-test          [.] foo
       foo
       main
       __libc_start_main
  -    6,43%     0,83%  perf-test  perf-test          [.] bar
       bar
       foo
       main
       __libc_start_main
  -    0,98%     0,98%  perf-test  perf-test          [.] baz
     - baz
        + 54,71% foo
        + 45,29% bar

Why __sin_sse2 is called by main (indirectly?), foo and bar but not by baz?
Why functions sometimes have a percent and a hierarchy attached (e.g., the last instance of baz) and sometimes not (e.g., the last instance of bar)?

with -G ("x calls y"):

  -   94,34%    94,06%  perf-test  libm-2.19.so       [.] __sin_sse2
     + __sin_sse2
     + __libc_start_main
     + main
  -   37,73%     0,00%  perf-test  perf-test          [.] main
     - main
        + 62,05% foo
        + 35,73% __sin_sse2
          2,23% sin@plt
  -   23,41%     1,35%  perf-test  perf-test          [.] foo
     - foo
        + 64,40% __sin_sse2
        + 29,18% bar
        + 3,98% sin@plt
          2,44% baz
       __libc_start_main
       main
       foo

How should I interpret the first three entries under __sin_sse2?
main calls foo and that's ok, but why if it calls __sin_sse2 and sin@plt (indirectly?) it does not also call bar and baz?
Why do __libc_start_main and main appear under foo? And why foo appears twice?

Suspect is that there are two levels of this hierarchy, in which the second actually represents the "x calls y"/"y is called by x" semantics, but I'm tired to guess so I'm asking here. And the documentation doesn't seem to help.

Sorry for the long post but I hope that all this context may help or act as a reference for someone else too.

Yes, I know but it's fast and it let me record all kinds of crazy events such as cache misses and branch misprediction per symbol, whereas AFAIK gprof can't do it. I used N quite big precisely to avoid what you mentioned; anyway I tried increasing it even further with no luck, I don't know, but I think that it's quite unlikely that no samples are collected even in a 100M iteration loop. — cYrus
– cYrus, Commented Jan 2, 2015 at 17:35
Thank you for posting. 2023, and this is still the best documentation I've found on this topic. I wanted to ask if you have perhaps since found any additional help resources about this? — plafratt
– plafratt, Commented Apr 20, 2023 at 16:22

Community · Accepted Answer · 2020-06-20 09:12:55Z

17

Alright, well, let's ignore temporarily the difference between caller and callee call-graphs, mostly because when I compare the results between these two options on my machine, I only see effects inside the kernel.kallsyms DSO for reasons I don't understand -- relatively new to this myself.

I found that for your example, it's a little easier to read the whole tree. So, using --stdio, let's look at the whole tree for __sin_sse2:

# Overhead    Command      Shared Object                  Symbol
# ........  .........  .................  ......................
#
    94.72%  perf-test  libm-2.19.so       [.] __sin_sse2
            |
            --- __sin_sse2
               |
               |--44.20%-- foo
               |          |
               |           --100.00%-- main
               |                     __libc_start_main
               |                     _start
               |                     0x0
               |
               |--27.95%-- baz
               |          |
               |          |--51.78%-- bar
               |          |          foo
               |          |          main
               |          |          __libc_start_main
               |          |          _start
               |          |          0x0
               |          |
               |           --48.22%-- foo
               |                     main
               |                     __libc_start_main
               |                     _start
               |                     0x0
               |
                --27.84%-- bar
                          |
                           --100.00%-- foo
                                     main
                                     __libc_start_main
                                     _start
                                     0x0

So, the way I read this is: 44% of the time, sin is called from foo; 27% of the time it's called from baz, and 27% from bar.

The documentation for -g is instructive:

 -g [type,min[,limit],order[,key]], --call-graph
       Display call chains using type, min percent threshold, optional print limit and order. type can be either:

       ·   flat: single column, linear exposure of call chains.

       ·   graph: use a graph tree, displaying absolute overhead rates.

       ·   fractal: like graph, but displays relative rates. Each branch of the tree is considered as a new profiled object.

               order can be either:
               - callee: callee based call graph.
               - caller: inverted caller based call graph.

               key can be:
               - function: compare on functions
               - address: compare on individual code addresses

               Default: fractal,0.5,callee,function.

The important piece here is that the default is fractal, and in fractal mode, each branch is a new object.

So, you can see that 50% of the time that baz is called, it's called from bar, and the other 50% it's called from foo.

This isn't always the most useful measure, so it's instructive to look at the results using -g graph:

94.72%  perf-test  libm-2.19.so       [.] __sin_sse2
        |
        --- __sin_sse2
           |
           |--41.87%-- foo
           |          |
           |           --41.48%-- main
           |                     __libc_start_main
           |                     _start
           |                     0x0
           |
           |--26.48%-- baz
           |          |
           |          |--13.50%-- bar
           |          |          foo
           |          |          main
           |          |          __libc_start_main
           |          |          _start
           |          |          0x0
           |          |
           |           --12.57%-- foo
           |                     main
           |                     __libc_start_main
           |                     _start
           |                     0x0
           |
            --26.38%-- bar
                      |
                       --26.17%-- foo
                                 main
                                 __libc_start_main
                                 _start
                                 0x0

This changes to using absolute percentages, where each percentage of time is reported for that call chain: So foo->bar is 26% of the total ticks (which in turn calls baz), and foo->baz (direct) is 12% of the total ticks.

I still have no idea why I don't see any differences between callee and caller graphs though, from the perspective of __sin_sse2.

Update

One thing I did change from your command line is how the callgraphs were gathered. Linux perf by default uses the frame pointer method of reconstructing callstacks. This can be a problem when the compiler uses -fomit-frame-pointer as a default. So I used

perf record --call-graph dwarf ./perf-test

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Feb 4, 2015 at 15:08

Matthew G.

1,35010 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

cYrus Over a year ago

The relative vs absolute percentages thing absolutely makes sense. The main point is that I'm just unable to came up with an output which is similar to yours (which seems legit since __sin_sse2 appears to be called directly by all the three functions, while mine features foo, main and bar only). Can you please post the exact flags that you used?

Matthew G. Over a year ago

@cYrus take a peek at my update... Will post exact command line for perf report after lunch.

cYrus Over a year ago

@MatthewG. well, it looks like it's all about using dwarf... I'll check if that solves all of my listed issues.

Matthew G. Over a year ago

@cYrus I copy and pasted your command line for compilation (though, had to move the linker flag to the end). Kernel is 3.13.0-37-generic (which may matter). The family of command lines for reports was perf report -g {graph,fractal},0.05,calle{e,r} --stdio > {graph,fracal}Calle{e,r}

hunter_tech Over a year ago

This answer may be misleading now. The default configure of caller/callee varies depending on perf's different version, which readers of this post should be aware of.

|

Collectives™ on Stack Overflow

Understanding Linux perf report output

The stupid test

Flat profiling

Hierarchical profiling

1 Answer 1

Update

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

The stupid test

Flat profiling

Hierarchical profiling

1 Answer 1

Update

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related