Skip to content
Julian Hammer edited this page Jan 9, 2020 · 3 revisions

kerncraft, a loop kernel analysis and performance modeling toolkit, comes with a powerful yet restricted benchmarking model called via kerncraft --pmodel Benchmark. In order to overcome the limitations imposed by kerncraft's static code analysis, a standalone benchmark tool is implemented that works with any loop code. Given a binary, the benchmarking tool measures the runtime, overall performance, memory bandwidth, data volume and cycles per instruction besides other metrics.

Installation

This benchmarking tool is an extension to the kerncraft toolkit. Hence, it is sufficient to install kerncraft in order to access functionalities of the standalone benchmark tool.

Usage

  1. Prepare your kernel code (e.g. equip it with likwid markers) and create the binary.

  2. Get a valid machine file for your target machine. Either you can take one from the example machine files, you write one by yourself, or you create one by running the executable likwid_bench_auto which is provided by kerncraft.

  3. Run the standalone benchmark tool.

Examples that elucidate the usage of the stand-alone benchmarking tool can be found here.

Basic Usage

    usage: kc-pheno [-h] [--version] --machine MACHINE [--define KEY VALUE]
                    [--verbose] [--store PICKLE]
                    [--unit {cy/CL,cy/It,It/s,FLOP/s}] [--cores CORES]
                    [--clean-intermediates] [--compiler COMPILER]
                    [--compiler-flags COMPILER_FLAGS] [--datatype DATATYPE]
                    --flops FLOPS --loop LOOP RANGE [--repetitions REPETITIONS]
                    [--marker [MARKER]] [--no-phenoecm] [--ignore-warnings]
                    FILE
    
    Kerncraft's stand-alone benchmarking tool.
    
    positional arguments:
      FILE                  Binary to be benchmarked.
    
    optional arguments:
      -h, --help            show this help message and exit
      --version             show program's version number and exit
      --machine MACHINE, -m MACHINE
                            Path to machine description yaml file.
      --define KEY VALUE, -D KEY VALUE
                            Define constant to be used in C code. Values must be
                            integer or match start-stop[:num[log[base]]]. If range
                            is given, all permutations will be tested. Overwrites
                            constants from testcase file. Constants with leading
                            underscores are adaptable, i.e. they can be
      --verbose, -v         Increase verbosity level.
      --store PICKLE        Adds results to PICKLE file for later processing.
      --unit {cy/CL,cy/It,It/s,FLOP/s}, -u {cy/CL,cy/It,It/s,FLOP/s}
                            Select the output unit, defaults to model specific if
                            not given.
      --cores CORES, -c CORES
                            Number of cores to be used in parallel. (default: 1)
                            The benchmark model will run the code with OpenMP on
                            as many physical cores.
      --clean-intermediates
                            If set will delete all intermediate files after
                            completion.
      --compiler COMPILER, -C COMPILER
                            Compiler to use, default is first in machine
                            description file.
      --compiler-flags COMPILER_FLAGS
                            Compiler flags to use. If not set, flags are taken
                            from machine description file (-std=c99 is always
                            added).
      --datatype DATATYPE   Datatype of sources and destinations of the kernel.
                            Defaults to 'double'.
      --flops FLOPS         Number of floating-point operations per inner-most
                            iteration of the kernel.
      --loop LOOP RANGE, -L LOOP RANGE
                            Define ranges of nested loops. The definition must
                            match 'start:[step:]end'. 'step' defaults to 1.
      --repetitions REPETITIONS, -R REPETITIONS
                            Number of kernel repetitions. Can be either a fixed
                            number, a variable name, or the specifier
                            'marker'.Specifying a variable name, the number of
                            repetitions is automatically adjusted when the
                            variable is changed.Specifying 'marker', the number of
                            repetitions is obtained from likwid-perfctr.
      --marker [MARKER]     Benchmark using likwid markers.
    
    arguments for stand-alone benchmark model:
      benchmark
    
      --no-phenoecm         Disables the phenomenological ECM model building.
      --ignore-warnings     Ignore warnings about mismatched CPU model and
                            frequency.

Above, all options for the standalone benchmark tool can be seen. As most of the options are directly adapted from kerncraft, we will only detail differences in the following. First of all, note that more information has to be provided via the command line as the standalone benchmark tool cannot obtain necessary information from the static code analysis. Besides the machine file and the binary, it is required to specify loop ranges and the number of flops.

Number of FLOPs

The number of FLOPs provided with the option --flops corresponds to the number of floating-point operations in the inner-most loop. Assuming that the loops are long enough, floating-point operations in outer loop iterations can be neglected. This number can simply be obtained by counting all FLOPs in the inner-most loop.

Loop Ranges

The specification of the loop ranges is straight-forward. Given a loop range [start, end) (and optionally a step size), the loop range can be defined as --loop start:step:end. If no step size is given, a stride of 1 is assumed. If you iterate through more than one loop, just add more of the --loop specifiers. It is important to mention that the last loop specified is assumed to be the inner-most loop. This is crucial for different step sizes in loops.

Repetitions

In order to ensure benchmark results that do not underlie performance fluctuations, it is necessary to have sufficiently long runtimes for a kernel. One popular choice to extend the runtime is to run the kernel multiple times.

Important: sometimes compilers automatically 'optimize' the code and do not run the kernel multiple times.

To trick the compiler, you might want to add a bogus if statement inside the repetition loop that might look like

if(CONDITION_NEVER_TRUE) dummy(a);

Make sure that the dummy function resides in a separate source file and the condition cannot be evaluated at compile time.

If you add such a repetition loop, you can specify the number of repetitions via --repetitions or -R. There are four possibilities for the --repetitions option:

  • the option is omitted in which case the number of repetitions defaults to 1
  • the value is given as fixed value, e.g. --repetitions 10
  • the value is determined by a variable, e.g. --repetitions N
  • the value is obtained from likwid marker. In this case, you must also enable --marker and place the likwid marker just before the kernel code (inside the repetition loop).

The advantage of this method is that you can benchmark kernels whose number of repetitions is not known before the run, e.g. when using while loops. The disadvantage is that you introduce an overhead that might or might not be negligible. If it is not negligible, a warning is thrown.

Adjustable Variables

One feature of the benchmarking tools is the automatic adjustment of variables to extend the runtime. This adjustment can apply to the number of repetitions, but also to other variables as array sizes, or tolerances etc.

Basic Usage

In order to mark variables as adjustable just specify a range for them, e.g. instead of defining a variable N as -D N 10, you can also write -D N 10:1000. With this, the benchmark tools will use values from 10 to 1000 for N to obtain a sufficiently long runtime. In contrast to the loop ranges, you do not define a step size. The benchmarking tool will interpolate the variable value according to the runtime.

There are two modes for the adjustment:

  • linear interpolation of the variable
  • logarithmic interpolation of the variable

Both can be directly chosen by adding a specifier to the variable definition, e.g. -D N 10:1000:lin for the (incrementing) linear interpolation or -D tol 1e-5:1e-10:log for the (decrementing) logarithmic interpolation. Latter is specifically useful when adjusting tolerances.

Loop Ranges and Repetitions

To mark loop ranges (or more specific: upper loop boundaries) and the number of repetitions as adjustable, we use a little workaround. Instead of specifying the range directly, we assign a variable to the loop/ repetition.

For repetitions we have already seen how this can be done: --repetitions N. When the variable is changed, the number of repetitions is changed accordingly.

For loop ranges this works quite similar. After the end value, just add the variable specifier as e.g. --loop 1:999:N. It is important to mention that offsets are automatically taken care of.

Let's have a look at a small example: We want to apply a simple stencil code on an one-dimensional array with size N=100. As the stencil accesses all neighbours, we must not loop from 0 to N, but from 1 to N-1. Adding --loop 1:99:N to the command line, the benchmarking tools detects the offset of 1 in all directions. When changing N to e.g. N=1000, the loop range is changed accordingly to [1,999).

likwid Regions

If not further specified, the entire executable is benchmarked. As one is often interested in the performance of a specific region (e.g. a single kernel without previous allocations and calculations), the standalone benchmark tool also supports the usage of likwid (multiple) markers.

After marking your kernel code with those markers and compiling it accordingly (-DLIKWID_PERFMON must be enabled), simply enable the markers by adding --marker to the command line command. In this case, all likwid markers found in the binary are benchmarked and analysed.
If you are interested only in some of the regions in your code, you can also specify the marker names of those regions separated by a comma. With the command line argument --marker d2pt5,d2pt9 the tools ignore all other regions and analyses only 'd2pt5' and 'd2pt9'.

Important: Please note that there are some minor limitations concerning the name of the markers. Even though likwid supports all possible names for marker, you must not choose marker names starting with a digit or the empty string. Otherwise the benchmarking tool may break.

Typically, the information about loop ranges and FLOPs differs between the different regions. You can easily ascribe any information to a specific region by adding the region name to that information.
This is always done in the same fashion for any type of information: separated by a colon, just add the marker names in front of the information.

Example: the number of FLOPs for two-dimensional 5-point and 9-point stencils are 5 and 9, respectively. Therefore, we would add --flops d2pt5:5 and --flops d2pt9:9 to the command line parameters.

If you enabled the marker support, but did not specify all information for every region, the benchmark tool will try to use information that was not ascribed to a specific region. E.g. --marker d2pt5 --flops 5 will still use 5 FLOPs per iteration for the region d2pt5.
This feature is especially convenient if you want to analyse one region only and do not want to add the region qualifiers to every information.
However, if no 'default' information is found, the tool will break, e.g. when specifying --marker d2pt5 --flops d2pt9:9.

Assigning likwid regions works in the same manner for normal and adjustable variables, loop ranges, and the number of repetitions.

Clone this wiki locally