Expand and add more CUDA/HIP documentation

Document cuda_pool, cuda_scheduler, cuda_stream, cublas_handle, cusolver_handle, as well as expose these with CUDA sender adaptors in the documentation. Adds a high-level example of using CUDA functionality.
pika-org · Nov 1, 2024 · 8af4738 · 8af4738
1 parent bfa0be5
commit 8af4738
Show file tree

Hide file tree

Showing 12 changed files with 456 additions and 74 deletions.
diff --git a/docs/Doxyfile b/docs/Doxyfile
@@ -7,7 +7,9 @@ GENERATE_RTF           = NO
 GENERATE_HTML          = NO
 QUIET                  = NO
 OUTPUT_DIRECTORY       = "$(PIKA_DOCS_DOXYGEN_OUTPUT_DIRECTORY)"
-INPUT                  = "$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/init_runtime" \
+INPUT                  = "$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/async_cuda" \
+                         "$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/async_cuda_base" \
+                         "$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/init_runtime" \
                          "$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/runtime" \
                          "$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/execution"
 FILE_PATTERNS          = *.cpp *.hpp *.cu
@@ -18,4 +20,4 @@ EXTRACT_ALL            = YES
 ENABLE_PREPPROCESSING  = YES
 MACRO_EXPANSION        = YES
 EXPAND_ONLY_PREDEF     = YES
-PREDEFINED             = PIKA_EXPORT=
+PREDEFINED             = PIKA_EXPORT= PIKA_NVCC_PRAGMA_HD_WARNING_DISABLE=
diff --git a/docs/api.rst b/docs/api.rst
@@ -5,6 +5,8 @@
     Distributed under the Boost Software License, Version 1.0. (See accompanying
     file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
 
+:tocdepth: 3
+
 .. _api:
 
 =============
@@ -32,7 +34,6 @@ These headers are part of the public API, but are currently undocumented.
 - ``pika/async_rw_mutex.hpp``
 - ``pika/barrier.hpp``
 - ``pika/condition_variable.hpp``
-- ``pika/cuda.hpp``
 - ``pika/execution.hpp``
 - ``pika/latch.hpp``
 - ``pika/mpi.hpp``
@@ -104,3 +105,98 @@ The ``pika/execution.hpp`` header provides functionality related to ``std::execu
 .. literalinclude:: ../examples/documentation/when_all_vector_documentation.cpp
    :language: c++
    :start-at: #include
+
+.. _header_pika_cuda:
+
+``pika/cuda.hpp``
+=================
+
+The ``pika/cuda.hpp`` header provides functionality related to CUDA and HIP. All functionality is
+under the ``pika::cuda::experimental`` namespace and class and function names contain ``cuda``, even
+when HIP support is enabled. CUDA and HIP functionality can be enabled with the CMake options
+``PIKA_WITH_CUDA`` and ``PIKA_WITH_HIP``, respectively. In the following, whenever CUDA is
+mentioned, it refers to to CUDA and HIP interchangeably.
+
+.. note::
+   https://github.com/pika-org/pika/issues/116 tracks a potential renaming of the functionality
+   to avoid using ``cuda`` even when HIP is enabled. If you have feedback on a rename or just want
+   to follow along, please see that issue.
+
+.. warning::
+   At the moment, ``nvcc`` can not compile stdexec headers. Of the CUDA compilers, only ``nvc++`` is
+   able to compile stdexec headers. If you have stdexec support enabled in pika, either ensure that
+   ``.cu`` files do not include stdexec headers, or use ``nvc++`` to compile your application.
+   However, ``nvc++`` does not officially support compiling device code. Use at your own risk.
+
+   For HIP there are no known restrictions.
+
+The CUDA support in pika relies on four major components:
+
+1. A pool of CUDA streams as well as cuBLAS and cuSOLVER handles. These streams and handles are used
+   in a round-robin fashion by various sender adaptors.
+2. A CUDA scheduler, in the ``std::execution`` sense. This uses the CUDA pool to schedule work on a
+   GPU.
+3. Sender adaptors. A few special-purpose sender adaptors, as well as customizations of a few
+   ``std::execution`` adaptors are provided to help schedule different types of work on a GPU.
+4. Polling of CUDA events integrated into the pika scheduling loop. This integration is essential to
+   avoid calling e.g. ``cudaStreamSynchronize`` on a pika task, which would block the underlying
+   worker thread and thus block progress of other work.
+
+The following example gives an overview of using the above CUDA functionalities in pika:
+
+.. literalinclude:: ../examples/documentation/cuda_overview_documentation.cu
+   :language: c++
+   :start-at: #include
+
+TODO: Note that while cuda_pool gives direct access to streams and handles, the intended usage is to
+access them via the scheduler and sender adaptors.
+
+.. doxygenclass:: pika::cuda::experimental::cuda_pool
+.. doxygenclass:: pika::cuda::experimental::cuda_stream
+.. doxygenclass:: pika::cuda::experimental::cublas_handle
+.. doxygenclass:: pika::cuda::experimental::locked_cublas_handle
+.. doxygenclass:: pika::cuda::experimental::cusolver_handle
+.. doxygenclass:: pika::cuda::experimental::locked_cusolver_handle
+
+..
+   .. literalinclude:: ../examples/documentation/cuda_pool_documentation.cpp
+      :language: c++
+      :start-at: #include
+
+.. doxygenclass:: pika::cuda::experimental::cuda_scheduler
+
+..
+   .. literalinclude:: ../examples/documentation/cuda_scheduler_documentation.cpp
+      :language: c++
+      :start-at: #include
+
+.. doxygenvariable:: pika::cuda::experimental::then_with_stream
+
+.. literalinclude:: ../examples/documentation/then_with_stream_documentation.cu
+    :language: c++
+    :start-at: #include
+
+
+.. doxygenvariable:: pika::cuda::experimental::then_with_cublas
+
+..
+   .. literalinclude:: ../examples/documentation/then_with_cublas_documentation.cpp
+      :language: c++
+      :start-at: #include
+
+.. doxygenvariable:: pika::cuda::experimental::then_with_cusolver
+
+..
+   .. literalinclude:: ../examples/documentation/then_with_cusolver_documentation.cpp
+      :language: c++
+      :start-at: #include
+
+.. note::
+   TODO: Is this useful? Just transfer to host explicitly?
+
+.. doxygenvariable:: pika::cuda::experimental::then_on_host
+
+..
+   .. literalinclude:: ../examples/documentation/then_on_host_documentation.cpp
+      :language: c++
+      :start-at: #include
diff --git a/examples/documentation/CMakeLists.txt b/examples/documentation/CMakeLists.txt
@@ -5,11 +5,13 @@
 # file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
 
 set(example_programs
+    # cuda_overview_documentation # TODO
     drop_operation_state_documentation
     drop_value_documentation
     hello_world_documentation
     init_hpp_documentation
     split_tuple_documentation
+    # then_with_stream_documentation # TODO
     unpack_documentation
     when_all_vector_documentation
 )

diff --git a/examples/documentation/cuda_overview_documentation.cu b/examples/documentation/cuda_overview_documentation.cu
@@ -0,0 +1,84 @@
+//  Copyright (c) 2023 ETH Zurich
+//
+//  SPDX-License-Identifier: BSL-1.0
+//  Distributed under the Boost Software License, Version 1.0. (See accompanying
+//  file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
+
+#include <pika/cuda.hpp>
+#include <pika/execution.hpp>
+#include <pika/init.hpp>
+
+#include <fmt/printf.h>
+
+#include <iostream>
+#include <utility>
+
+#if defined(PIKA_HAVE_CUDA)
+# include <cublas_v2.h>
+# include <cuda_runtime.h>
+using stream_t = cudaStream_t;
+using blas_handle_t = cublasHandle_t;
+# elseif defined(PIKA_HAVE_HIP)
+# include <hip/hip_runtime.h>
+# include <hipblas.h>
+using stream_t = hipStream_t;
+using blas_handle_t = hipblasHandle_t;
+#endif
+
+__global__ kernel() { printf("Hello from kernel! threadIdx.x: %d\n", threadIdx.x); }
+
+int main(int argc, char* argv[])
+{
+    namespace cu = pika::cuda::experimental;
+    namespace ex = pika::execution::experimental;
+    namespace tt = pika::this_thread::experimental;
+
+    pika::start(argc, argv);
+    ex::thread_pool_scheduler cpu_sched{};
+
+    // Create a pool of CUDA streams and cuBLAS/SOLVER handles
+    cu::cuda_pool pool{};
+    // Then create a scheduler
+    cu::cuda_scheduler cuda_sched{pool};
+
+    {
+        // There is no error checking of CUDA/HIP calls below!
+        constexpr std::size_t n = 2048;
+        double* a = nullptr;
+        double* b = nullptr;
+        double* c = nullptr;
+        cudaMalloc(&a, sizeof(double) * n * n);
+        cudaMalloc(&b, sizeof(double) * n * n);
+        cudaMalloc(&c, sizeof(double) * n * n);
+
+        // TODO: enable polling
+        auto s = ex::just(42) | ex::continues_on(cuda_sched) |
+
+            // CUDA kernel through a lambda
+            ex::then([](int x) { printf("Hello from the GPU! x: %d\n", x); }) |
+
+            // Explicitly launch a CUDA kernel with a stream
+            // See https://github.com/eth-cscs/whip
+            ex::then_with_stream([](stream_t stream) { kernel<<<32, 1, 0, stream>>>(); }) |
+
+            // Explicitly run a continuation on the host, keeping the CUDA scheduler active. This
+            // will run after the kernel above has finished, but in an unspecified execution
+            // context.
+            ex::then_on_host([] { fmt::print("Hello from the CPU!\n"); }) |
+
+            // Launch a cuBLAS/cuSOLVER kernel
+            ex::then_with_cublas([&](blas_handle_t handle) { blas_gemm(handle, a, b, c); });
+
+        tt::sync_wait(std::move(s));
+
+        // We know that all work has completed so we can safely free the memory
+        cudaFree(a);
+        cudaFree(b);
+        cudaFree(c);
+    }
+
+    pika::finalize();
+    pika::stop();
+
+    return 0;
+}
diff --git a/examples/documentation/then_with_stream_documentation.cu b/examples/documentation/then_with_stream_documentation.cu
@@ -0,0 +1,52 @@
+//  Copyright (c) 2024 ETH Zurich
+//
+//  SPDX-License-Identifier: BSL-1.0
+//  Distributed under the Boost Software License, Version 1.0. (See accompanying
+//  file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
+
+#include <pika/cuda.hpp>
+#include <pika/execution.hpp>
+#include <pika/init.hpp>
+
+#include <fmt/printf.h>
+
+#include <iostream>
+#include <utility>
+
+// TODO: Just use whip?
+#if defined(PIKA_HAVE_CUDA)
+# include <cuda_runtime.h>
+using stream_t = cudaStream_t;
+# elseif defined(PIKA_HAVE_HIP)
+# include <hip/hip_runtime.h>
+using stream_t = hipStream_t;
+#endif
+
+__global__ kernel() { printf("Hello from kernel! threadIdx.x: %d\n", threadIdx.x); }
+
+int main(int argc, char* argv[])
+{
+    namespace cu = pika::cuda::experimental;
+    namespace ex = pika::execution::experimental;
+    namespace tt = pika::this_thread::experimental;
+
+    pika::start(argc, argv);
+    ex::thread_pool_scheduler cpu_sched{};
+    cu::cuda_pool pool{};
+    cu::cuda_scheduler cuda_sched{pool};
+
+    {
+        // There is no error checking of CUDA/HIP calls below!
+        // TODO: enable polling
+        // TODO: Pass some more interesting parameters through here
+        auto s = ex::just(42) | ex::continues_on(cuda_sched) |
+            ex::then_with_stream([](stream_t stream) { kernel<<<32, 1, 0, stream>>>(); });
+
+        tt::sync_wait(std::move(s));
+    }
+
+    pika::finalize();
+    pika::stop();
+
+    return 0;
+}