Home

Welcome to the mkfit-hackathon wiki!

To disable optimization in nvcc: -O0 -Xcicc -O0 -Xptxas -O0

nvprof commands:

nvprof   ./multorture
nvprof  --print-gpu-trace ./multorture
time nvprof --metrics flop_count_sp,flop_sp_efficiency,shared_store_transactions,shared_load_transactions,local_load_transactions,local_store_transactions,gld_transactions,gst_transactions,gld_throughput,gst_throughput,l2_read_transactions,l2_write_transactions,l2_utilization,l1_cache_global_hit_rate,l1_shared_utilization,l2_l1_read_hit_rate ./multorture
nvprof --events warps_launched,local_load --metrics ipc,ipc_instance,inst_executed,inst_fp_32,inst_integer,inst_per_warp ./multorture

Sample code

__global__ void raw_reg_c_mult_loop_kn(const float* const a, const float* const b, 
    float* c, const int N)
{

  int nN = 1000;
  for (int oLoop = 0; oLoop< nN; ++oLoop){
    for (int n = threadIdx.x + blockIdx.x * blockDim.x;
         n < N;
         n += blockDim.x * gridDim.x) {
      
      float a_ar[36];
      float b_ar[36];
      for (int i = 0; i < 36; ++i){
        const int idx = n + N*i;
        a_ar[i] = a[idx];
        b_ar[i] = b[idx];
      }
      for (int i = 0; i < 6; ++i) {
        for (int j = 0; j < 6; ++j) {
          float c_tmp = 0.f;
          for (int k = 0; k < 6; ++k) {
            c_tmp += a_ar[i + 6*k] * b_ar[k + 6*j];
          }
          c[n + N*(i + 6*j)] = c_tmp;
        }
      }
    }
  }//oLoop< nN; ++oLoop){
}

Some extracted numbers with numerology

N 7168 (64)
.. 1032192 bytes in 1 matrix
1.8142ms         1  1.8142ms  1.8142ms  1.8142ms  raw_reg_c_mult_loop_kn(float const *, float const *, float*, int)
428.68ms  1.8094ms            (112 1 1)        (64 1 1)        96        0B        0B
         -           -           -           -  Tesla P100-SXM2         1         7  
            flop_count_sp   Floating Point Operations(Single Precision)  3,096,576,000
.. 1.7 TFLOPS; vs 9.5 specs (same percentage -> OK)
        flop_sp_efficiency                  FLOP Efficiency(Peak Single)      17.72%
 shared_store_transactions                     Shared Store Transactions           0
  shared_load_transactions                      Shared Load Transactions           0
   local_load_transactions                       Local Load Transactions           0
  local_store_transactions                      Local Store Transactions           0
          gld_transactions                      Global Load Transactions   129024000
.. 129024000 = 1000* 7168* 18 -> 18 loads per 72 floats => 16 usable bytes per load
          gst_transactions                     Global Store Transactions    32256000
.. 32256000 = 1000* 7168* 4.5 -> 4.5 stores per 36 floats => 32 usable bytes per store 
            gld_throughput                        Global Load Throughput  2307.3GB/s
.. vs 720.9 GB/s HBM bandwidth ==> looks like caches are actually in use
.. 4.19 GB per call; 32 bytes per load transaction
            gst_throughput                       Global Store Throughput  576.82GB/s
.. 1.05 GB per call; 32 bytes per store transaction
      l2_read_transactions                          L2 Read Transactions    63539508
.. l2_read/gld = 0.49
     l2_write_transactions                         L2 Write Transactions    32256013
            l2_utilization                          L2 Cache Utilization    High (7)
                       ipc                              Executed IPC    1.348572
             inst_executed                     Instructions Executed   174,497,792
                inst_fp_32                   FP Instructions(Single)  1,548,288,000
              inst_integer                      Integer Instructions  1,928,213,504
              inst_per_warp                     Instructions per warp  7.7901e+05
             warps_launched         224         224         224         224

Notes related to the numerology above

https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html#l1-cache On Pascal the data access unit is 32B regardless of whether global loads are cached in L1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Clone this wiki locally