In this lab, you will experiment with CPU profiling of a simple application. The techniques you learn will be applicable to application profiling as well as profiling the entire system, will apply to user-mode and kernel-mode code, and will apply to any language or framework that has perf
symbol support through the /tmp/perf-$PID.map
convention.
The application you're about to profile is a very simple command-line tool that counts prime numbers in a given range, and uses OpenMP to parallelize the computation. But first, let's build it and make sure everything works. Navigate to the directory that contains primes.c and build it using the following command:
$ gcc -g -fno-omit-frame-pointer -fopenmp primes.c -o primes
Now, experiment with multiple executions of this application with varying numbers of threads by running the following command:
for i in `seq 1 2 16`; do \
OMP_NUM_THREADS=$i bash -c 'time ./primes'; \
done
Elapsed time should stabilize when you create enough threads to keep all the cores saturated throughout the entire run. Because the distribution of work between threads is not ideal, on an N-core system you will need more than N threads to keep them all busy.
It's time to find out what's taking so long to count our primes. Use the following command to get perf
's opinion. You will need a root shell for this.
# export OMP_NUM_THREADS=16
# perf record -g -F 997 -- ./primes
In the preceding command, the
-g
switch indicates that you want the call stack to be captured; the-F
switch lets you set the rate of sampling for events -- 997 times times per second in this case. The trace will include events aggregated across all threads spawned byprimes
. To include events from other processes while the program is running, specify-a
, and if you really want to trace only a single thread within a process, you can use use-t
.
To figure out where the bottleneck is, inspect perf
's report:
# perf report --stdio
It seems that we're spending a whole lot of time in the is_divisible
function. If you want to know where exactly, you can use the following command:
# perf annotate
Note that, as discussed in class, there may be skid in the displayed results. Notably, the specific instruction that seems to be taking most of the time might actually be masquerading for the previous or following instruction (or couple of instructions).
Now that you have the basic picture of what's happening in the application, it would be interesting to inspect everything else. The perf report is quite long, though:
# perf report --stdio | wc -l
1286
So instead of reading long reports, let's take flame graphs for a spin. If you haven't already, clone the FlameGraph repository, and then run the following commands to post-process the perf report:
# perf script > primes.stacks
# FlameGraph/stackcollapse-perf.pl primes.stacks > primes.collapsed
# FlameGraph/flamegraph.pl primes.collapsed > primes.svg
Open the resulting flame graph and enjoy the results. It is now immediately evident that there are no significant stack tree branches other than the is_prime
calling is_divisible
branch that we have already seen.
Finally, you are going to profile some Java code and generate flame graphs for that as well. To help perf
figure out the symbols (function names) for your Java code, you will use perf-map-agent, which attaches to your Java process and generates map files that map dynamically-generated Java machine code to class and method names. perf-map-agent
can also run the whole recording session for you, through a set of handy scripts such as
perf-java-record-stack
and perf-java-report-stack
.
First, run the Java application that we're about to profile. It is a Java version of the same prime-counting app:
$ java -XX:+PreserveFramePointer slowy/App
Do not hit ENTER yet. Instead, in another (non-root) shell, run jps
to find the process id for the Slowy app, and then run the collection tool. After you run perf-java-report-stack
, hit ENTER in the Java app's console so that the collection tool records some meaningful work.
$ jps
2144 Jps
2132 App
$ ./perf-java-report-stack 2132 # use the pid from the previous step
# hit ENTER in the java console
If everything went well, you should see the perf report
ncurses UI, showing the bottlenecks in the Java program. These will likely be App::main
and App::isPrime
, because the smaller methods were optimized away (inlined). You can repeat the experiment and run Slowy with the -XX:-Inline
switch to prevent this optimization and obtain more accurate results that include the App::isDivisible
method.
For some more fun, experiment with collecting CPU samples and then generating flame graphs while the following commands are running:
sha1sum /dev/zero
dd if=/dev/zero of=/dev/null bs=100K count=50K
You might get a lot of [unknown]
frames because of missing debuginfo. Try to obtain debuginfo for the missing components and then try again. If there's no choice, you can always build from source with the -g
option to make sure debuginfo will be available.