Skip to content

Commit

Permalink
Expand and add more CUDA/HIP documentation
Browse files Browse the repository at this point in the history
Document cuda_pool, cuda_scheduler, cuda_stream, cublas_handle, cusolver_handle, as well as expose
these with CUDA sender adaptors in the documentation. Adds a high-level example of using CUDA functionality.
  • Loading branch information
msimberg committed Nov 1, 2024
1 parent bfa0be5 commit 8af4738
Show file tree
Hide file tree
Showing 12 changed files with 456 additions and 74 deletions.
6 changes: 4 additions & 2 deletions docs/Doxyfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ GENERATE_RTF = NO
GENERATE_HTML = NO
QUIET = NO
OUTPUT_DIRECTORY = "$(PIKA_DOCS_DOXYGEN_OUTPUT_DIRECTORY)"
INPUT = "$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/init_runtime" \
INPUT = "$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/async_cuda" \
"$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/async_cuda_base" \
"$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/init_runtime" \
"$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/runtime" \
"$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/execution"
FILE_PATTERNS = *.cpp *.hpp *.cu
Expand All @@ -18,4 +20,4 @@ EXTRACT_ALL = YES
ENABLE_PREPPROCESSING = YES
MACRO_EXPANSION = YES
EXPAND_ONLY_PREDEF = YES
PREDEFINED = PIKA_EXPORT=
PREDEFINED = PIKA_EXPORT= PIKA_NVCC_PRAGMA_HD_WARNING_DISABLE=
98 changes: 97 additions & 1 deletion docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
Distributed under the Boost Software License, Version 1.0. (See accompanying
file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

:tocdepth: 3

.. _api:

=============
Expand Down Expand Up @@ -32,7 +34,6 @@ These headers are part of the public API, but are currently undocumented.
- ``pika/async_rw_mutex.hpp``
- ``pika/barrier.hpp``
- ``pika/condition_variable.hpp``
- ``pika/cuda.hpp``
- ``pika/execution.hpp``
- ``pika/latch.hpp``
- ``pika/mpi.hpp``
Expand Down Expand Up @@ -104,3 +105,98 @@ The ``pika/execution.hpp`` header provides functionality related to ``std::execu
.. literalinclude:: ../examples/documentation/when_all_vector_documentation.cpp
:language: c++
:start-at: #include

.. _header_pika_cuda:

``pika/cuda.hpp``
=================

The ``pika/cuda.hpp`` header provides functionality related to CUDA and HIP. All functionality is
under the ``pika::cuda::experimental`` namespace and class and function names contain ``cuda``, even
when HIP support is enabled. CUDA and HIP functionality can be enabled with the CMake options
``PIKA_WITH_CUDA`` and ``PIKA_WITH_HIP``, respectively. In the following, whenever CUDA is
mentioned, it refers to to CUDA and HIP interchangeably.

.. note::
https://github.com/pika-org/pika/issues/116 tracks a potential renaming of the functionality
to avoid using ``cuda`` even when HIP is enabled. If you have feedback on a rename or just want
to follow along, please see that issue.

.. warning::
At the moment, ``nvcc`` can not compile stdexec headers. Of the CUDA compilers, only ``nvc++`` is
able to compile stdexec headers. If you have stdexec support enabled in pika, either ensure that
``.cu`` files do not include stdexec headers, or use ``nvc++`` to compile your application.
However, ``nvc++`` does not officially support compiling device code. Use at your own risk.

For HIP there are no known restrictions.

The CUDA support in pika relies on four major components:

1. A pool of CUDA streams as well as cuBLAS and cuSOLVER handles. These streams and handles are used
in a round-robin fashion by various sender adaptors.
2. A CUDA scheduler, in the ``std::execution`` sense. This uses the CUDA pool to schedule work on a
GPU.
3. Sender adaptors. A few special-purpose sender adaptors, as well as customizations of a few
``std::execution`` adaptors are provided to help schedule different types of work on a GPU.
4. Polling of CUDA events integrated into the pika scheduling loop. This integration is essential to
avoid calling e.g. ``cudaStreamSynchronize`` on a pika task, which would block the underlying
worker thread and thus block progress of other work.

The following example gives an overview of using the above CUDA functionalities in pika:

.. literalinclude:: ../examples/documentation/cuda_overview_documentation.cu
:language: c++
:start-at: #include

TODO: Note that while cuda_pool gives direct access to streams and handles, the intended usage is to
access them via the scheduler and sender adaptors.

.. doxygenclass:: pika::cuda::experimental::cuda_pool
.. doxygenclass:: pika::cuda::experimental::cuda_stream
.. doxygenclass:: pika::cuda::experimental::cublas_handle
.. doxygenclass:: pika::cuda::experimental::locked_cublas_handle
.. doxygenclass:: pika::cuda::experimental::cusolver_handle
.. doxygenclass:: pika::cuda::experimental::locked_cusolver_handle

..
.. literalinclude:: ../examples/documentation/cuda_pool_documentation.cpp
:language: c++
:start-at: #include
.. doxygenclass:: pika::cuda::experimental::cuda_scheduler

..
.. literalinclude:: ../examples/documentation/cuda_scheduler_documentation.cpp
:language: c++
:start-at: #include
.. doxygenvariable:: pika::cuda::experimental::then_with_stream

.. literalinclude:: ../examples/documentation/then_with_stream_documentation.cu
:language: c++
:start-at: #include


.. doxygenvariable:: pika::cuda::experimental::then_with_cublas

..
.. literalinclude:: ../examples/documentation/then_with_cublas_documentation.cpp
:language: c++
:start-at: #include
.. doxygenvariable:: pika::cuda::experimental::then_with_cusolver

..
.. literalinclude:: ../examples/documentation/then_with_cusolver_documentation.cpp
:language: c++
:start-at: #include
.. note::
TODO: Is this useful? Just transfer to host explicitly?

.. doxygenvariable:: pika::cuda::experimental::then_on_host

..
.. literalinclude:: ../examples/documentation/then_on_host_documentation.cpp
:language: c++
:start-at: #include
2 changes: 2 additions & 0 deletions examples/documentation/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,13 @@
# file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

set(example_programs
# cuda_overview_documentation # TODO
drop_operation_state_documentation
drop_value_documentation
hello_world_documentation
init_hpp_documentation
split_tuple_documentation
# then_with_stream_documentation # TODO
unpack_documentation
when_all_vector_documentation
)
Expand Down
84 changes: 84 additions & 0 deletions examples/documentation/cuda_overview_documentation.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
// Copyright (c) 2023 ETH Zurich
//
// SPDX-License-Identifier: BSL-1.0
// Distributed under the Boost Software License, Version 1.0. (See accompanying
// file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

#include <pika/cuda.hpp>
#include <pika/execution.hpp>
#include <pika/init.hpp>

#include <fmt/printf.h>

#include <iostream>
#include <utility>

#if defined(PIKA_HAVE_CUDA)
# include <cublas_v2.h>
# include <cuda_runtime.h>
using stream_t = cudaStream_t;
using blas_handle_t = cublasHandle_t;
# elseif defined(PIKA_HAVE_HIP)
# include <hip/hip_runtime.h>
# include <hipblas.h>
using stream_t = hipStream_t;
using blas_handle_t = hipblasHandle_t;
#endif

__global__ kernel() { printf("Hello from kernel! threadIdx.x: %d\n", threadIdx.x); }

int main(int argc, char* argv[])
{
namespace cu = pika::cuda::experimental;
namespace ex = pika::execution::experimental;
namespace tt = pika::this_thread::experimental;

pika::start(argc, argv);
ex::thread_pool_scheduler cpu_sched{};

// Create a pool of CUDA streams and cuBLAS/SOLVER handles
cu::cuda_pool pool{};
// Then create a scheduler
cu::cuda_scheduler cuda_sched{pool};

{
// There is no error checking of CUDA/HIP calls below!
constexpr std::size_t n = 2048;
double* a = nullptr;
double* b = nullptr;
double* c = nullptr;
cudaMalloc(&a, sizeof(double) * n * n);
cudaMalloc(&b, sizeof(double) * n * n);
cudaMalloc(&c, sizeof(double) * n * n);

// TODO: enable polling
auto s = ex::just(42) | ex::continues_on(cuda_sched) |

// CUDA kernel through a lambda
ex::then([](int x) { printf("Hello from the GPU! x: %d\n", x); }) |

// Explicitly launch a CUDA kernel with a stream
// See https://github.com/eth-cscs/whip
ex::then_with_stream([](stream_t stream) { kernel<<<32, 1, 0, stream>>>(); }) |

// Explicitly run a continuation on the host, keeping the CUDA scheduler active. This
// will run after the kernel above has finished, but in an unspecified execution
// context.
ex::then_on_host([] { fmt::print("Hello from the CPU!\n"); }) |

// Launch a cuBLAS/cuSOLVER kernel
ex::then_with_cublas([&](blas_handle_t handle) { blas_gemm(handle, a, b, c); });

tt::sync_wait(std::move(s));

// We know that all work has completed so we can safely free the memory
cudaFree(a);
cudaFree(b);
cudaFree(c);
}

pika::finalize();
pika::stop();

return 0;
}
52 changes: 52 additions & 0 deletions examples/documentation/then_with_stream_documentation.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
// Copyright (c) 2024 ETH Zurich
//
// SPDX-License-Identifier: BSL-1.0
// Distributed under the Boost Software License, Version 1.0. (See accompanying
// file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

#include <pika/cuda.hpp>
#include <pika/execution.hpp>
#include <pika/init.hpp>

#include <fmt/printf.h>

#include <iostream>
#include <utility>

// TODO: Just use whip?
#if defined(PIKA_HAVE_CUDA)
# include <cuda_runtime.h>
using stream_t = cudaStream_t;
# elseif defined(PIKA_HAVE_HIP)
# include <hip/hip_runtime.h>
using stream_t = hipStream_t;
#endif

__global__ kernel() { printf("Hello from kernel! threadIdx.x: %d\n", threadIdx.x); }

int main(int argc, char* argv[])
{
namespace cu = pika::cuda::experimental;
namespace ex = pika::execution::experimental;
namespace tt = pika::this_thread::experimental;

pika::start(argc, argv);
ex::thread_pool_scheduler cpu_sched{};
cu::cuda_pool pool{};
cu::cuda_scheduler cuda_sched{pool};

{
// There is no error checking of CUDA/HIP calls below!
// TODO: enable polling
// TODO: Pass some more interesting parameters through here
auto s = ex::just(42) | ex::continues_on(cuda_sched) |
ex::then_with_stream([](stream_t stream) { kernel<<<32, 1, 0, stream>>>(); });

tt::sync_wait(std::move(s));
}

pika::finalize();
pika::stop();

return 0;
}
Loading

0 comments on commit 8af4738

Please sign in to comment.