-
Notifications
You must be signed in to change notification settings - Fork 24
Standalone Pheno
kerncraft, a loop kernel analysis and performance modeling toolkit, comes with a powerful yet restricted benchmarking model called via kerncraft --pmodel Benchmark
.
In order to overcome the limitations imposed by kerncraft
's static code analysis, a standalone benchmark tool is implemented that works with any loop code.
Given a binary, the benchmarking tool measures the runtime, overall performance, memory bandwidth, data volume and cycles per instruction besides other metrics.
This benchmarking tool is an extension to the kerncraft
toolkit. Hence, it is sufficient to install kerncraft in order to access functionalities of the standalone benchmark tool.
-
Prepare your kernel code (e.g. equip it with
likwid
markers) and create the binary. -
Get a valid machine file for your target machine. Either you can take one from the example machine files, you write one by yourself, or you create one by running the executable
likwid_bench_auto
which is provided bykerncraft
. -
Run the standalone benchmark tool.
Examples that elucidate the usage of the stand-alone benchmarking tool can be found here.
usage: kc-pheno [-h] [--version] --machine MACHINE [--define KEY VALUE]
[--verbose] [--store PICKLE]
[--unit {cy/CL,cy/It,It/s,FLOP/s}] [--cores CORES]
[--clean-intermediates] [--compiler COMPILER]
[--compiler-flags COMPILER_FLAGS] [--datatype DATATYPE]
--flops FLOPS --loop LOOP RANGE [--repetitions REPETITIONS]
[--marker [MARKER]] [--no-phenoecm] [--ignore-warnings]
FILE
Kerncraft's stand-alone benchmarking tool.
positional arguments:
FILE Binary to be benchmarked.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--machine MACHINE, -m MACHINE
Path to machine description yaml file.
--define KEY VALUE, -D KEY VALUE
Define constant to be used in C code. Values must be
integer or match start-stop[:num[log[base]]]. If range
is given, all permutations will be tested. Overwrites
constants from testcase file. Constants with leading
underscores are adaptable, i.e. they can be
--verbose, -v Increase verbosity level.
--store PICKLE Adds results to PICKLE file for later processing.
--unit {cy/CL,cy/It,It/s,FLOP/s}, -u {cy/CL,cy/It,It/s,FLOP/s}
Select the output unit, defaults to model specific if
not given.
--cores CORES, -c CORES
Number of cores to be used in parallel. (default: 1)
The benchmark model will run the code with OpenMP on
as many physical cores.
--clean-intermediates
If set will delete all intermediate files after
completion.
--compiler COMPILER, -C COMPILER
Compiler to use, default is first in machine
description file.
--compiler-flags COMPILER_FLAGS
Compiler flags to use. If not set, flags are taken
from machine description file (-std=c99 is always
added).
--datatype DATATYPE Datatype of sources and destinations of the kernel.
Defaults to 'double'.
--flops FLOPS Number of floating-point operations per inner-most
iteration of the kernel.
--loop LOOP RANGE, -L LOOP RANGE
Define ranges of nested loops. The definition must
match 'start:[step:]end'. 'step' defaults to 1.
--repetitions REPETITIONS, -R REPETITIONS
Number of kernel repetitions. Can be either a fixed
number, a variable name, or the specifier
'marker'.Specifying a variable name, the number of
repetitions is automatically adjusted when the
variable is changed.Specifying 'marker', the number of
repetitions is obtained from likwid-perfctr.
--marker [MARKER] Benchmark using likwid markers.
arguments for stand-alone benchmark model:
benchmark
--no-phenoecm Disables the phenomenological ECM model building.
--ignore-warnings Ignore warnings about mismatched CPU model and
frequency.
Above, all options for the standalone benchmark tool can be seen. As most of the options are directly adapted from kerncraft
, we will only detail differences in the following.
First of all, note that more information has to be provided via the command line as the standalone benchmark tool cannot obtain necessary information from the static code analysis.
Besides the machine file and the binary, it is required to specify loop ranges and the number of flops.
The number of FLOPs provided with the option --flops
corresponds to the number of floating-point operations in the inner-most loop.
Assuming that the loops are long enough, floating-point operations in outer loop iterations can be neglected.
This number can simply be obtained by counting all FLOPs in the inner-most loop.
The specification of the loop ranges is straight-forward. Given a loop range [start, end)
(and optionally a step size), the loop range can be defined as --loop start:step:end
.
If no step size is given, a stride of 1 is assumed.
If you iterate through more than one loop, just add more of the --loop
specifiers.
It is important to mention that the last loop specified is assumed to be the inner-most loop. This is crucial for different step sizes in loops.
In order to ensure benchmark results that do not underlie performance fluctuations, it is necessary to have sufficiently long runtimes for a kernel. One popular choice to extend the runtime is to run the kernel multiple times.
Important: sometimes compilers automatically 'optimize' the code and do not run the kernel multiple times.
To trick the compiler, you might want to add a bogus if statement inside the repetition loop that might look like
if(CONDITION_NEVER_TRUE) dummy(a);
Make sure that the dummy function resides in a separate source file and the condition cannot be evaluated at compile time.
If you add such a repetition loop, you can specify the number of repetitions via --repetitions
or -R
.
There are four possibilities for the --repetitions
option:
- the option is omitted in which case the number of repetitions defaults to 1
- the value is given as fixed value, e.g.
--repetitions 10
- the value is determined by a variable, e.g.
--repetitions N
- the value is obtained from
likwid
marker. In this case, you must also enable--marker
and place thelikwid
marker just before the kernel code (inside the repetition loop).
The advantage of this method is that you can benchmark kernels whose number of repetitions is not known before the run, e.g. when using while
loops.
The disadvantage is that you introduce an overhead that might or might not be negligible. If it is not negligible, a warning is thrown.
One feature of the benchmarking tools is the automatic adjustment of variables to extend the runtime. This adjustment can apply to the number of repetitions, but also to other variables as array sizes, or tolerances etc.
In order to mark variables as adjustable just specify a range for them, e.g. instead of defining a variable N as -D N 10
, you can also write -D N 10:1000
. With this, the benchmark tools will use values from 10 to 1000 for N
to obtain a sufficiently long runtime. In contrast to the loop ranges, you do not define a step size. The benchmarking tool will interpolate the variable value according to the runtime.
There are two modes for the adjustment:
- linear interpolation of the variable
- logarithmic interpolation of the variable
Both can be directly chosen by adding a specifier to the variable definition, e.g. -D N 10:1000:lin
for the (incrementing) linear interpolation or -D tol 1e-5:1e-10:log
for the (decrementing) logarithmic interpolation. Latter is specifically useful when adjusting tolerances.
To mark loop ranges (or more specific: upper loop boundaries) and the number of repetitions as adjustable, we use a little workaround. Instead of specifying the range directly, we assign a variable to the loop/ repetition.
For repetitions we have already seen how this can be done: --repetitions N
. When the variable is changed, the number of repetitions is changed accordingly.
For loop ranges this works quite similar. After the end value, just add the variable specifier as e.g. --loop 1:999:N
. It is important to mention that offsets are automatically taken care of.
Let's have a look at a small example:
We want to apply a simple stencil code on an one-dimensional array with size N=100
. As the stencil accesses all neighbours, we must not loop from 0 to N, but from 1 to N-1. Adding --loop 1:99:N
to the command line, the benchmarking tools detects the offset of 1 in all directions. When changing N to e.g. N=1000
, the loop range is changed accordingly to [1,999)
.
If not further specified, the entire executable is benchmarked. As one is often interested in the performance of a specific region (e.g. a single kernel without previous allocations and calculations), the standalone benchmark tool also supports the usage of likwid (multiple) markers.
After marking your kernel code with those markers and compiling it accordingly (-DLIKWID_PERFMON
must be enabled), simply enable the markers by adding --marker
to the command line command. In this case, all likwid
markers found in the binary are benchmarked and analysed.
If you are interested only in some of the regions in your code, you can also specify the marker names of those regions separated by a comma. With the command line argument --marker d2pt5,d2pt9
the tools ignore all other regions and analyses only 'd2pt5' and 'd2pt9'.
Important: Please note that there are some minor limitations concerning the name of the markers. Even though
likwid
supports all possible names for marker, you must not choose marker names starting with a digit or the empty string. Otherwise the benchmarking tool may break.
Typically, the information about loop ranges and FLOPs differs between the different regions.
You can easily ascribe any information to a specific region by adding the region name to that information.
This is always done in the same fashion for any type of information: separated by a colon, just add the marker names in front of the information.
Example: the number of FLOPs for two-dimensional 5-point and 9-point stencils are 5 and 9, respectively. Therefore, we would add --flops d2pt5:5
and --flops d2pt9:9
to the command line parameters.
If you enabled the marker support, but did not specify all information for every region, the benchmark tool will try to use information that was not ascribed to a specific region. E.g. --marker d2pt5 --flops 5
will still use 5 FLOPs per iteration for the region d2pt5
.
This feature is especially convenient if you want to analyse one region only and do not want to add the region qualifiers to every information.
However, if no 'default' information is found, the tool will break, e.g. when specifying --marker d2pt5 --flops d2pt9:9
.
Assigning likwid
regions works in the same manner for normal and adjustable variables, loop ranges, and the number of repetitions.