Skip to content
Slava Krutelyov edited this page May 24, 2018 · 18 revisions

Welcome to the mkfit-hackathon wiki!

To disable optimization in nvcc: -O0 -Xcicc -O0 -Xptxas -O0

nvprof commands:

nvprof   ./multorture
nvprof  --print-gpu-trace ./multorture
time nvprof --metrics flop_count_sp,flop_sp_efficiency,shared_store_transactions,shared_load_transactions,local_load_transactions,local_store_transactions,gld_transactions,gst_transactions,gld_throughput,gst_throughput,gld_requested_throughput,gld_efficiency,gst_requested_throughput,gst_efficiency,l2_read_transactions,l2_write_transactions,l2_utilization,l1_cache_global_hit_rate,l1_shared_utilization,l2_l1_read_hit_rate ./multorture
nvprof --events warps_launched,local_load --metrics ipc,ipc_instance,inst_executed,inst_fp_32,inst_integer,inst_per_warp ./multorture

May 23, morning

Sample code

__global__ void raw_reg_c_mult_loop_kn(const float* const a, const float* const b, 
    float* c, const int N)
{

  int nN = 1000;
  for (int oLoop = 0; oLoop< nN; ++oLoop){
    for (int n = threadIdx.x + blockIdx.x * blockDim.x;
         n < N;
         n += blockDim.x * gridDim.x) {
      
      float a_ar[36];
      float b_ar[36];
      for (int i = 0; i < 36; ++i){
        const int idx = n + N*i;
        a_ar[i] = a[idx];
        b_ar[i] = b[idx];
      }
      for (int i = 0; i < 6; ++i) {
        for (int j = 0; j < 6; ++j) {
          float c_tmp = 0.f;
          for (int k = 0; k < 6; ++k) {
            c_tmp += a_ar[i + 6*k] * b_ar[k + 6*j];
          }
          c[n + N*(i + 6*j)] = c_tmp;
        }
      }
    }
  }//oLoop< nN; ++oLoop){
}

Some extracted numbers with numerology

N 7168 (64)
.. 1032192 bytes in 1 matrix
1.8142ms         1  1.8142ms  1.8142ms  1.8142ms  raw_reg_c_mult_loop_kn(float const *, float const *, float*, int)
428.68ms  1.8094ms            (112 1 1)        (64 1 1)        96        0B        0B
         -           -           -           -  Tesla P100-SXM2         1         7  
            flop_count_sp   Floating Point Operations(Single Precision)  3,096,576,000
.. 1.7 TFLOPS; vs 9.5 specs (same percentage -> OK)
        flop_sp_efficiency                  FLOP Efficiency(Peak Single)      17.72%
 shared_store_transactions                     Shared Store Transactions           0
  shared_load_transactions                      Shared Load Transactions           0
   local_load_transactions                       Local Load Transactions           0
  local_store_transactions                      Local Store Transactions           0
          gld_transactions                      Global Load Transactions   129024000
.. 129024000 = 1000* 7168* 18 -> 18 loads per 72 floats => 16 usable bytes per load
          gst_transactions                     Global Store Transactions    32256000
.. 32256000 = 1000* 7168* 4.5 -> 4.5 stores per 36 floats => 32 usable bytes per store 
            gld_throughput                        Global Load Throughput  2307.3GB/s
.. vs 720.9 GB/s HBM bandwidth ==> looks like caches are actually in use
.. 4.19 GB per call; 32 bytes per load transaction
            gst_throughput                       Global Store Throughput  576.82GB/s
.. 1.05 GB per call; 32 bytes per store transaction
      l2_read_transactions                          L2 Read Transactions    63539508
.. l2_read/gld = 0.49
     l2_write_transactions                         L2 Write Transactions    32256013
            l2_utilization                          L2 Cache Utilization    High (7)
                       ipc                              Executed IPC    1.348572
             inst_executed                     Instructions Executed   174,497,792
                inst_fp_32                   FP Instructions(Single)  1,548,288,000
              inst_integer                      Integer Instructions  1,928,213,504
              inst_per_warp                     Instructions per warp  7.7901e+05
             warps_launched         224         224         224         224

May 23, afternoon

Notes related to the numerology above

Eigen Results

  • Use of Eigen in CUDA code is (still) incomplete
    • CUDA 9 requires a development version of Eigen
    • The most straightforward 6x6 matrix multiplication c[i] = a[i]*b[i] would not compile
  • Eigen implementation is ~30% faster than our slowest, most naive version, but 9x slower than our best versions
  • Optimizations that produced our fastest versions had no effect on Eigen
  • It's good if don't want to work at optimization
    • Could be better on more complex calculations; for our simple case in the best versions the compiler completely unrolls the loop

Sample code:

__global__ void eigen_reg_c_mult_kn(const Matrix66* __restrict__ a, const Matrix66* __restrict__ b, Matrix66* c, const int N)
{
  for (int n = threadIdx.x + blockIdx.x * blockDim.x; n < N; n += blockDim.x * gridDim.x) {
    Matrix66 c_reg;
    c_reg = a[n] * b[n];
    c[n] = c_reg;
  }
}
==66068== Profiling result:
Time(%)      Time   Name
 21.33%  84.904ms   void naive_mult_kn<GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>>(float, int=6, int=6, int)
 15.92%  63.396ms   eigen_reg_c_mult_kn(Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6> const *, Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6> const *, Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6>*, int)
 15.74%  62.681ms   eigen_reg_mult_kn(Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6> const *, Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6> const *, Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6>*, int)
 10.73%  42.710ms   void reg_c_mult_kn<GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>>(float, int=6, int=6, int)
  9.31%  37.066ms   void shared_mult_kn<GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>>(float, int=6, int=6, int)
  7.71%  30.709ms   raw_naive_mult_kn(float const *, float const *, float*, int)
  5.14%  20.481ms   raw_shared_mult_kn(float const *, float const *, float*, int)
  4.51%  17.972ms   raw_reg_c_mult_kn(float const *, float const *, float*, int)
  4.01%  15.951ms   void reg_mult_kn<GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>>(float, int=6, int=6, int)
  1.80%  7.1675ms   raw_reg_mult_kn(float const *, float const *, float*, int)

May 24, morning

Discuss and define tasks for the following days [wip]

  • Develop one of simple test function calls into a roofline benchmark.
    • E.g. define 5-10 points that cover a range of (arithmetic intensity , memory use) values.
    • This will be used as reference points in comparisons with actual code performance
  • Re-establish performance of a more realistic setup with GPU.
    • E.g. track fitting.
  • Bottoms-up estimates of tracking compute needs on GPUs
    • E.g. Take into account KF math for FLOP counts; simple track and hit data structures for data needs; a few realistic cases of number of hits, seeds, candidates to define the scale. Based on these inputs and some GPU (P100?) compute capabilities as a target, find which scenarios can ideally be close to the roofline.
  • Possible top-down estimates of tracking compute needs on GPUs from known sources
    • E.g. Use track building or full event compute/performance counts from a working x86 example to project required throughput and FLOPs.
  • ...
Clone this wiki locally