-
Notifications
You must be signed in to change notification settings - Fork 5
Home
Slava Krutelyov edited this page May 24, 2018
·
18 revisions
Welcome to the mkfit-hackathon wiki!
To disable optimization in nvcc: -O0 -Xcicc -O0 -Xptxas -O0
nvprof commands:
nvprof ./multorture
nvprof --print-gpu-trace ./multorture
time nvprof --metrics flop_count_sp,flop_sp_efficiency,shared_store_transactions,shared_load_transactions,local_load_transactions,local_store_transactions,gld_transactions,gst_transactions,gld_throughput,gst_throughput,gld_requested_throughput,gld_efficiency,gst_requested_throughput,gst_efficiency,l2_read_transactions,l2_write_transactions,l2_utilization,l1_cache_global_hit_rate,l1_shared_utilization,l2_l1_read_hit_rate ./multorture
nvprof --events warps_launched,local_load --metrics ipc,ipc_instance,inst_executed,inst_fp_32,inst_integer,inst_per_warp ./multorture
Sample code
__global__ void raw_reg_c_mult_loop_kn(const float* const a, const float* const b,
float* c, const int N)
{
int nN = 1000;
for (int oLoop = 0; oLoop< nN; ++oLoop){
for (int n = threadIdx.x + blockIdx.x * blockDim.x;
n < N;
n += blockDim.x * gridDim.x) {
float a_ar[36];
float b_ar[36];
for (int i = 0; i < 36; ++i){
const int idx = n + N*i;
a_ar[i] = a[idx];
b_ar[i] = b[idx];
}
for (int i = 0; i < 6; ++i) {
for (int j = 0; j < 6; ++j) {
float c_tmp = 0.f;
for (int k = 0; k < 6; ++k) {
c_tmp += a_ar[i + 6*k] * b_ar[k + 6*j];
}
c[n + N*(i + 6*j)] = c_tmp;
}
}
}
}//oLoop< nN; ++oLoop){
}
Some extracted numbers with numerology
N 7168 (64)
.. 1032192 bytes in 1 matrix
1.8142ms 1 1.8142ms 1.8142ms 1.8142ms raw_reg_c_mult_loop_kn(float const *, float const *, float*, int)
428.68ms 1.8094ms (112 1 1) (64 1 1) 96 0B 0B
- - - - Tesla P100-SXM2 1 7
flop_count_sp Floating Point Operations(Single Precision) 3,096,576,000
.. 1.7 TFLOPS; vs 9.5 specs (same percentage -> OK)
flop_sp_efficiency FLOP Efficiency(Peak Single) 17.72%
shared_store_transactions Shared Store Transactions 0
shared_load_transactions Shared Load Transactions 0
local_load_transactions Local Load Transactions 0
local_store_transactions Local Store Transactions 0
gld_transactions Global Load Transactions 129024000
.. 129024000 = 1000* 7168* 18 -> 18 loads per 72 floats => 16 usable bytes per load
gst_transactions Global Store Transactions 32256000
.. 32256000 = 1000* 7168* 4.5 -> 4.5 stores per 36 floats => 32 usable bytes per store
gld_throughput Global Load Throughput 2307.3GB/s
.. vs 720.9 GB/s HBM bandwidth ==> looks like caches are actually in use
.. 4.19 GB per call; 32 bytes per load transaction
gst_throughput Global Store Throughput 576.82GB/s
.. 1.05 GB per call; 32 bytes per store transaction
l2_read_transactions L2 Read Transactions 63539508
.. l2_read/gld = 0.49
l2_write_transactions L2 Write Transactions 32256013
l2_utilization L2 Cache Utilization High (7)
ipc Executed IPC 1.348572
inst_executed Instructions Executed 174,497,792
inst_fp_32 FP Instructions(Single) 1,548,288,000
inst_integer Integer Instructions 1,928,213,504
inst_per_warp Instructions per warp 7.7901e+05
warps_launched 224 224 224 224
Notes related to the numerology above
- https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html#l1-cache On Pascal the data access unit is 32B regardless of whether global loads are cached in L1.
- after moving to CUDA 9.2, the numerology is now consistent between expected and reported bandwidth. It looks like we reached the roofline here.
- Use of Eigen in CUDA code is (still) incomplete
- CUDA 9 requires a development version of Eigen
- The most straightforward 6x6 matrix multiplication
c[i] = a[i]*b[i]
would not compile
- Eigen implementation is ~30% faster than our slowest, most naive version, but 9x slower than our best versions
- Optimizations that produced our fastest versions had no effect on Eigen
- It's good if don't want to work at optimization
- Could be better on more complex calculations; for our simple case in the best versions the compiler completely unrolls the loop
Sample code:
__global__ void eigen_reg_c_mult_kn(const Matrix66* __restrict__ a, const Matrix66* __restrict__ b, Matrix66* c, const int N)
{
for (int n = threadIdx.x + blockIdx.x * blockDim.x; n < N; n += blockDim.x * gridDim.x) {
Matrix66 c_reg;
c_reg = a[n] * b[n];
c[n] = c_reg;
}
}
==66068== Profiling result:
Time(%) Time Name
21.33% 84.904ms void naive_mult_kn<GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>>(float, int=6, int=6, int)
15.92% 63.396ms eigen_reg_c_mult_kn(Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6> const *, Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6> const *, Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6>*, int)
15.74% 62.681ms eigen_reg_mult_kn(Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6> const *, Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6> const *, Eigen::Matrix<float, int=6, int=6, int=0, int=6, int=6>*, int)
10.73% 42.710ms void reg_c_mult_kn<GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>>(float, int=6, int=6, int)
9.31% 37.066ms void shared_mult_kn<GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>>(float, int=6, int=6, int)
7.71% 30.709ms raw_naive_mult_kn(float const *, float const *, float*, int)
5.14% 20.481ms raw_shared_mult_kn(float const *, float const *, float*, int)
4.51% 17.972ms raw_reg_c_mult_kn(float const *, float const *, float*, int)
4.01% 15.951ms void reg_mult_kn<GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>, GPlex<Matriplex::Matriplex<float, int=6, int=6, int=16>>>(float, int=6, int=6, int)
1.80% 7.1675ms raw_reg_mult_kn(float const *, float const *, float*, int)
Discuss and define tasks for the following days [wip]
- Develop one of simple test function calls into a roofline benchmark.
- E.g. define 5-10 points that cover a range of (arithmetic intensity , memory use) values.
- This will be used as reference points in comparisons with actual code performance
- Re-establish performance of a more realistic setup with GPU.
- E.g. track fitting.
- Bottoms-up estimates of tracking compute needs on GPUs
- E.g. Take into account KF math for FLOP counts; simple track and hit data structures for data needs; a few realistic cases of number of hits, seeds, candidates to define the scale. Based on these inputs and some GPU (P100?) compute capabilities as a target, find which scenarios can ideally be close to the roofline.
- Possible top-down estimates of tracking compute needs on GPUs from known sources
- E.g. Use track building or full event compute/performance counts from a working x86 example to project required throughput and FLOPs.
- ...