What are the key factors for optimizing performance when running on a GPU? #3380

FR13ndSDP · 2023-06-20T18:59:46Z

FR13ndSDP
Jun 20, 2023

I have an Nvidia V100 GPU and an RTX 3090 GPU. While running the ATPESC-codes/AMReX_Amr101 code, I observed that the execution time was nearly identical for both GPUs. I also ran some CUDA sample codes provided by Nvidia and found that the performance difference was not significant between the two GPUs. However, when running the Tests/GPU/CNS code, I noticed that the execution time on the RTX 3090 was almost twice as long as on the V100. This raises the question of what could be the main factor causing the slower performance of the CNS code on the RTX 3090 and how the code can be tunned to get better performance on specific device?

FR13ndSDP · 2023-06-20T19:11:22Z

FR13ndSDP
Jun 20, 2023
Author

Moreover, I've tried to implement a larger stencil based scheme (6 stencils) for the convective term in CNS, the speed difference between two GPUs is even larger: V100 is almost 10 times faster. This is the most performance-critical part：

for (MFIter mfi(S, TilingIfNotGPU()); mfi.isValid(); ++mfi)
{
    ...
    // general kernel launch on a thread box
    launch(xflxbx,
    [=] AMREX_GPU_DEVICE (const Box& tbox) noexcept
    {
        reconstruction_x(tbox,ql,qr,q,*lparm);
    })；
    ...
}

AMREX_GPU_DEVICE
AMREX_FORCE_INLINE
void
reconstruction_x (amrex::Box const& bx,
             amrex::Array4<amrex::Real> const& ql,
             amrex::Array4<amrex::Real> const& qr,
             amrex::Array4<amrex::Real const> const& q,
             Parm const& parm) noexcept
{
    using amrex::Real;

    const auto lo = amrex::lbound(bx);
    const auto hi = amrex::ubound(bx);

    for (int n = 0; n < NPRIM; ++n) {
        for (int k = lo.z; k<= hi.z; ++k) {
            for (int j = lo.y; j <= hi.y; ++j) {
                for (int i = lo.x; i <= hi.x; ++i) {
                    Real V1 = q(i-3,j,k,n);
                    Real V2 = q(i-2,j,k,n);
                    Real V3 = q(i-1,j,k,n);
                    Real V4 = q(i,j,k,n);
                    Real V5 = q(i+1,j,k,n);
                    Real V6 = q(i+2,j,k,n);

                    // qL
                    Real s11 = 13.0*pow(V1-2.0*V2+V3,2) + 3.0*pow(V1-4.0*V2+3.0*V3,2);
                    Real s22 = 13.0*pow(V2-2.0*V3+V4,2) + 3.0*pow(V2-V4,2);
                    Real s33 = 13.0*pow(V3-2.0*V4+V5,2) + 3.0*pow(3.0*V3-4.0*V4+V5,2);

                    Real s55 = amrex::Math::abs(s11-s33);

                    Real a1 = pow(1.0+s55/(s11+parm.eps),6);
                    Real a2 = pow(1.0+s55/(s22+parm.eps),6);
                    Real a3 = pow(1.0+s55/(s33+parm.eps),6);
                    
                    Real invsum = 1.0/(a1+a2+a3);
                    Real b1 = a1*invsum;
                    Real b2 = a2*invsum;
                    Real b3 = a3*invsum;


                    a1 = 0.1*(b1<1.e-5? 0.0:1.0);
                    a2 = 0.6*(b2<1.e-5? 0.0:1.0);
                    a3 = 0.3*(b3<1.e-5? 0.0:1.0);

                    Real v1 = parm.oneSix*(2.0*V1-7.0*V2+5.0*V3);
                    Real v2 = parm.oneSix*(-V2-V3+2.0*V4);
                    Real v3 = parm.oneSix*(-4.0*V3+5.0*V4-V5);

                    invsum = 1.0/(a1+a2+a3);
                    Real w1 = a1*invsum;
                    Real w2 = a2*invsum;
                    Real w3 = a3*invsum;

                    ql(i,j,k,n) = V3+w1*v1+w2*v2+w3*v3;

                    // qR
                    s11 = 13.0*pow(V6-2.0*V5+V4,2) + 3.0*pow(V6-4.0*V5+3.0*V4,2);
                    s22 = 13.0*pow(V3-2.0*V4+V5,2) + 3.0*pow(V5-V3,2);
                    s33 = 13.0*pow(V4-2.0*V3+V2,2) + 3.0*pow(3.0*V4-4.0*V3+V2,2);

                   ...

                    qr(i,j,k,n) = V4+w1*v1+w2*v2+w3*v3;
                }
            }
        }
    }
}

0 replies

WeiqunZhang · 2023-06-20T23:13:40Z

WeiqunZhang
Jun 20, 2023
Maintainer

The ATPESC-codes/AMReX_Amr101 code does not have enough work for GPU. It's like walking will beat driving a car if the race is only 1 meter. By the time the car is started the walker has finished.

6 replies

WeiqunZhang Jun 21, 2023
Maintainer

Maybe the blocking factor is too small so there are a lot of small boxes.

FR13ndSDP Jun 21, 2023
Author

Do you mean that V100 is expected to be considerably faster than the RTX3090? Could you kindly provide some insights on what might be causing the bottleneck in the case of the 3090? Is it possibly due to the double-precision performance (V100: 8.2 TFLOPS, no data available for the RTX3090), the memory bandwidth (V100: 1134 GB/sec, RTX3090: 936 GB/sec), or the cache/shared memory size?

WeiqunZhang Jun 21, 2023
Maintainer

Sorry, I didn't realized RTX3090 has a comparable memory bandwidth. Maybe it's double precision performance? This page says it's 1.1 TFLOPS. You can probably find out more details using Nvidia profiling tools.

https://www.gpuzoo.com/GPU-NVIDIA/GeForce_RTX_3090.html

WeiqunZhang Jun 21, 2023
Maintainer

Note that your latest kernel has a lot of pow. That might explain it.

FR13ndSDP Jun 21, 2023
Author

Thank you for your help. I will try to delve deeper into this using Nvidia tools. If I have any new results, I will provide feedback here.

FR13ndSDP · 2023-06-21T12:49:50Z

FR13ndSDP
Jun 21, 2023
Author

Note that your latest kernel has a lot of pow. That might explain it.

It turns out the pow function is the culprit, by replacing it with ordinary multiplication, there is a considerable boost in speed. But still, the performance on RTX3090 is not comparable with V100, I'll keep tracking it.

0 replies

zingale · 2023-06-21T12:51:15Z

zingale
Jun 21, 2023
Collaborator

from what I can find online, the V100 has 7 TFLOPS double precision performance and the RTX 3090 has only 0.5 TFLOPs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are the key factors for optimizing performance when running on a GPU? #3380

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

What are the key factors for optimizing performance when running on a GPU? #3380

FR13ndSDP Jun 20, 2023

Replies: 4 comments · 6 replies

FR13ndSDP Jun 20, 2023 Author

WeiqunZhang Jun 20, 2023 Maintainer

WeiqunZhang Jun 21, 2023 Maintainer

FR13ndSDP Jun 21, 2023 Author

WeiqunZhang Jun 21, 2023 Maintainer

WeiqunZhang Jun 21, 2023 Maintainer

FR13ndSDP Jun 21, 2023 Author

FR13ndSDP Jun 21, 2023 Author

zingale Jun 21, 2023 Collaborator

FR13ndSDP
Jun 20, 2023

Replies: 4 comments 6 replies

FR13ndSDP
Jun 20, 2023
Author

WeiqunZhang
Jun 20, 2023
Maintainer

WeiqunZhang Jun 21, 2023
Maintainer

FR13ndSDP Jun 21, 2023
Author

WeiqunZhang Jun 21, 2023
Maintainer

WeiqunZhang Jun 21, 2023
Maintainer

FR13ndSDP Jun 21, 2023
Author

FR13ndSDP
Jun 21, 2023
Author

zingale
Jun 21, 2023
Collaborator