SpGEMM_Benchmarks

Notes

Kokkos 2.50 is used for below experiments. The KokkosKernels algorithms described here are released as part of Kokkos Kernels version 2.60.
April 24, 2018: Kokkos::deep_copy has a minor bug in Kokkos 2.60 (https://github.com/kokkos/kokkos/issues/1583). Use Kokkos 2.50 until the next Kokkos release.

KNL Benchmark

This benchmark includes comparison of

KokkosKernels methods
- KKSpGEMM (default algorithm in KokkosKernels)
- KKMEM
- KKDENSE
MKL methods: (intel-18.0.128)
- MKL-INS: mkl-inspector executor.
- MKL7: two-phase mkl with no sorting option:7
- MKL8: two-phase mkl with output sorting option:8

The comparison is performed for both KNL's cache mode, and flat ddr memory. All experiments use quadrant mode that has single NUMA domain. All runtimes correspond to the runtime of "NoReuse" case where both symbolic and numeric phases are executed.

Setting up the compilation and run time parameters:

Setting up and replicating results on KNLs: Explains how the experiments are compiled and run with cache-mode. Same procedure can be followed to run on flat-ddr. Only changes required are:
- Node-allocation: allocate node in flat memory mode.
- Run with "numactl --membind 0 executable ..." to skip the use of MCDRAM.

Experiments on Flat DDR memory mode

KNL DDR RAW TABLE: gives a table including the runtimes of the 6 methods on 83 multiplications. For each matrix, table gives the best runtime of the method among different number of threads. Last 6 column gives the number of threads that the method achieved the best performance. For example, all methods achieve the best runtime for audikw_1 with 128 threads except MKL8 on DDR.
KNL DDR Performance Profile: gives the performance profiles of the algorithms for KNL's with flat ddr. For a given x, the y value indicates the number of problem cases, for which a method is less than x times slower than the best result achieved among the compared methods for each individual problem. The max value of y at x=1 is the number of problem cases for which a method achieved the best performance. The x value for which y=83 is the largest slowdown a method showed over any problem, compared with the best observed performance for that problem over all methods.

Experiments on Cache mode

KNL CACHE MODE - RAW TABLE: gives the same table as above for cache-mode.
KNL CACHE MODE - Performance Profile: gives the performance profiles of the algorithms for KNL's with cache mode.

Experiments using processor masking

Above experiments are run using the below environment variable.

export OMP_PROC_BIND=spread

Most methods achieve their best performance using this configuration in most of the datasets. However, this configuration uses 256 threads on 68 cores of KNL (or 128 threads on 68 cores) with either 3 or 4 threads running each core when used without a processor masking. Below we run another benchmark using MPI masking by adding mpirun -np 1 -map-by socket:PE=64 before the executable. This limits only the use of 64 cores. We observe similar results as above.

KNL DDR MODE - Performance Profile with MPI Processor Masking
KNL CACHE MODE - Performance Profile with MPI Processor Masking

Power8 Benchmark

This benchmark includes comparison of

KokkosKernels methods
- KKSpGEMM (default algorithm in KokkosKernels)
- KKMEM
- KKDENSE
viennaCL OpenMP spgemm methods (v. 1.7.1).

All runtimes correspond to the runtime of "NoReuse" case where both symbolic and numeric phases are executed.

Setting up the compilation and run time parameters:

Setting up and replicating results on Power8 CPUs: Explains how the experiments are compiled and run.

Experiment Resuts

Power8 RAW TABLE: gives a table including the runtimes of the 6 methods on 83 multiplications for Power8.
Power8 Performance Profile: gives the performance profiles of the algorithms for Power8.

P100 Benchmark

This benchmark includes comparison of

KokkosKernels methods
- KKSpGEMM (default algorithm in KokkosKernels)
- KKMEM
- KKLP
viennaCL Cuda Implementation (v. 1.7.1)
cuSPARSE (cuda-8)
Nsparse: (v-1.2 July, 2017)

All runtimes correspond to the runtime of "NoReuse" case where both symbolic and numeric phases are executed.

Setting up the compilation and run time parameters:

Setting up and replicating results on P100 GPUs: Explains how the experiments are compiled and run with P100 GPUs.

Experiment Results on Flat DDR memory mode

P100 RAW TABLE: gives a table including the runtimes of the 6 methods on 81 multiplications.
P100 Performance Profile: gives the performance profiles of the algorithms for P100 GPUs.
KKSPGEMM Speedup w.r.t. Nsparse: gives the speedup of KKSPGEMM w.r.t. Nsparse. Each bar shows the geometric mean of the 10 multiplications that are sorted based on FLOPs.