Home

Welcome to the mkfit-hackathon wiki!

To disable optimization in nvcc: -O0 -Xcicc -O0 -Xptxas -O0

nvprof commands:

nvprof   ./multorture
nvprof  --print-gpu-trace ./multorture
time nvprof --metrics flop_count_sp,flop_sp_efficiency,shared_store_transactions,shared_load_transactions,local_load_transactions,local_store_transactions,gld_transactions,gst_transactions,gld_throughput,gst_throughput,gld_requested_throughput,gld_efficiency,gst_requested_throughput,gst_efficiency,l2_read_transactions,l2_write_transactions,l2_utilization,l1_cache_global_hit_rate,l1_shared_utilization,l2_l1_read_hit_rate ./multorture
nvprof --events warps_launched,local_load --metrics ipc,ipc_instance,inst_executed,inst_fp_32,inst_integer,inst_per_warp ./multorture

May 23, morning

Sample code

__global__ void raw_reg_c_mult_loop_kn(const float* const a, const float* const b, 
    float* c, const int N)
{

  int nN = 1000;
  for (int oLoop = 0; oLoop< nN; ++oLoop){
    for (int n = threadIdx.x + blockIdx.x * blockDim.x;
         n < N;
         n += blockDim.x * gridDim.x) {
      
      float a_ar[36];
      float b_ar[36];
      for (int i = 0; i < 36; ++i){
        const int idx = n + N*i;
        a_ar[i] = a[idx];
        b_ar[i] = b[idx];
      }
      for (int i = 0; i < 6; ++i) {
        for (int j = 0; j < 6; ++j) {
          float c_tmp = 0.f;
          for (int k = 0; k < 6; ++k) {
            c_tmp += a_ar[i + 6*k] * b_ar[k + 6*j];
          }
          c[n + N*(i + 6*j)] = c_tmp;
        }
      }
    }
  }//oLoop< nN; ++oLoop){
}

Some extracted numbers with numerology

N 7168 (64)
.. 1032192 bytes in 1 matrix
1.8142ms         1  1.8142ms  1.8142ms  1.8142ms  raw_reg_c_mult_loop_kn(float const *, float const *, float*, int)
428.68ms  1.8094ms            (112 1 1)        (64 1 1)        96        0B        0B
         -           -           -           -  Tesla P100-SXM2         1         7  
            flop_count_sp   Floating Point Operations(Single Precision)  3,096,576,000
.. 1.7 TFLOPS; vs 9.5 specs (same percentage -> OK)
        flop_sp_efficiency                  FLOP Efficiency(Peak Single)      17.72%
 shared_store_transactions                     Shared Store Transactions           0
  shared_load_transactions                      Shared Load Transactions           0
   local_load_transactions                       Local Load Transactions           0
  local_store_transactions                      Local Store Transactions           0
          gld_transactions                      Global Load Transactions   129024000
.. 129024000 = 1000* 7168* 18 -> 18 loads per 72 floats => 16 usable bytes per load
          gst_transactions                     Global Store Transactions    32256000
.. 32256000 = 1000* 7168* 4.5 -> 4.5 stores per 36 floats => 32 usable bytes per store 
            gld_throughput                        Global Load Throughput  2307.3GB/s
.. vs 720.9 GB/s HBM bandwidth ==> looks like caches are actually in use
.. 4.19 GB per call; 32 bytes per load transaction
            gst_throughput                       Global Store Throughput  576.82GB/s
.. 1.05 GB per call; 32 bytes per store transaction
      l2_read_transactions                          L2 Read Transactions    63539508
.. l2_read/gld = 0.49
     l2_write_transactions                         L2 Write Transactions    32256013
            l2_utilization                          L2 Cache Utilization    High (7)
                       ipc                              Executed IPC    1.348572
             inst_executed                     Instructions Executed   174,497,792
                inst_fp_32                   FP Instructions(Single)  1,548,288,000
              inst_integer                      Integer Instructions  1,928,213,504
              inst_per_warp                     Instructions per warp  7.7901e+05
             warps_launched         224         224         224         224

May 23, afternoon

Notes related to the numerology above

https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html#l1-cache On Pascal the data access unit is 32B regardless of whether global loads are cached in L1.
after moving to CUDA 9.2, the numerology is now consistent between expected and reported bandwidth. It looks like we reached the roofline here.

Eigen Results

Use of Eigen in CUDA code is (still) incomplete
- CUDA 9 requires a development version of Eigen
- The most straightforward 6x6 matrix multiplication c[i] = a[i]*b[i] would not compile
Eigen implementation is ~30% faster than our slowest, most naive version, but 9x slower than our best versions
Optimizations that produced our fastest versions had no effect on Eigen
It's good if don't want to work at optimization
- Could be better on more complex calculations; for our simple case in the best versions the compiler completely unrolls the loop

Sample code:

__global__ void eigen_reg_c_mult_kn(const Matrix66* __restrict__ a, const Matrix66* __restrict__ b, Matrix66* c, const int N)
{
  for (int n = threadIdx.x + blockIdx.x * blockDim.x; n < N; n += blockDim.x * gridDim.x) {
    Matrix66 c_reg;
    c_reg = a[n] * b[n];
    c[n] = c_reg;
  }
}

==66068== Profiling result:
Time(%)      Time   Name
 21.33%  84.904ms   void naive_mult_kn<GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>>(float, int=6, int=6, int)
 15.92%  63.396ms   eigen_reg_c_mult_kn(Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6> const *, Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6> const *, Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6>*, int)
 15.74%  62.681ms   eigen_reg_mult_kn(Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6> const *, Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6> const *, Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6>*, int)
 10.73%  42.710ms   void reg_c_mult_kn<GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>>(float, int=6, int=6, int)
  9.31%  37.066ms   void shared_mult_kn<GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>>(float, int=6, int=6, int)
  7.71%  30.709ms   raw_naive_mult_kn(float const *, float const *, float*, int)
  5.14%  20.481ms   raw_shared_mult_kn(float const *, float const *, float*, int)
  4.51%  17.972ms   raw_reg_c_mult_kn(float const *, float const *, float*, int)
  4.01%  15.951ms   void reg_mult_kn<GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>>(float, int=6, int=6, int)
  1.80%  7.1675ms   raw_reg_mult_kn(float const *, float const *, float*, int)

May 24, morning

Discuss and define tasks for the following days [wip]

Develop one of simple test function calls into a roofline benchmark.
- E.g. define 5-10 points that cover a range of (arithmetic intensity , memory use) values.
- This will be used as reference points in comparisons with actual code performance
Re-establish performance of a more realistic setup with GPU.
- E.g. track fitting.
Bottoms-up estimates of tracking compute needs on GPUs
- E.g. Take into account KF math for FLOP counts; simple track and hit data structures for data needs; a few realistic cases of number of hits, seeds, candidates to define the scale. Based on these inputs and some GPU (P100?) compute capabilities as a target, find which scenarios can ideally be close to the roofline.
Possible top-down estimates of tracking compute needs on GPUs from known sources
- E.g. Use track building or full event compute/performance counts from a working x86 example to project required throughput and FLOPs.
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

May 23, morning

May 23, afternoon

Eigen Results

May 24, morning

Clone this wiki locally