Skip to content

Commit

Permalink
Expand and add more CUDA/HIP documentation
Browse files Browse the repository at this point in the history
Document cuda_pool, cuda_scheduler, cuda_stream, cublas_handle, cusolver_handle, as well as expose
these with CUDA sender adaptors in the documentation. Adds a high-level example of using CUDA functionality.
  • Loading branch information
msimberg committed Nov 1, 2024
1 parent bfa0be5 commit 9f0b8bc
Show file tree
Hide file tree
Showing 10 changed files with 320 additions and 74 deletions.
6 changes: 4 additions & 2 deletions docs/Doxyfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ GENERATE_RTF = NO
GENERATE_HTML = NO
QUIET = NO
OUTPUT_DIRECTORY = "$(PIKA_DOCS_DOXYGEN_OUTPUT_DIRECTORY)"
INPUT = "$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/init_runtime" \
INPUT = "$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/async_cuda" \
"$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/async_cuda_base" \
"$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/init_runtime" \
"$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/runtime" \
"$(PIKA_DOCS_DOXYGEN_INPUT_ROOT)/libs/pika/execution"
FILE_PATTERNS = *.cpp *.hpp *.cu
Expand All @@ -18,4 +20,4 @@ EXTRACT_ALL = YES
ENABLE_PREPPROCESSING = YES
MACRO_EXPANSION = YES
EXPAND_ONLY_PREDEF = YES
PREDEFINED = PIKA_EXPORT=
PREDEFINED = PIKA_EXPORT= PIKA_NVCC_PRAGMA_HD_WARNING_DISABLE=
98 changes: 97 additions & 1 deletion docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
Distributed under the Boost Software License, Version 1.0. (See accompanying
file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

:tocdepth: 3

.. _api:

=============
Expand Down Expand Up @@ -32,7 +34,6 @@ These headers are part of the public API, but are currently undocumented.
- ``pika/async_rw_mutex.hpp``
- ``pika/barrier.hpp``
- ``pika/condition_variable.hpp``
- ``pika/cuda.hpp``
- ``pika/execution.hpp``
- ``pika/latch.hpp``
- ``pika/mpi.hpp``
Expand Down Expand Up @@ -104,3 +105,98 @@ The ``pika/execution.hpp`` header provides functionality related to ``std::execu
.. literalinclude:: ../examples/documentation/when_all_vector_documentation.cpp
:language: c++
:start-at: #include

.. _header_pika_cuda:

``pika/cuda.hpp``
=================

The ``pika/cuda.hpp`` header provides functionality related to CUDA and HIP. All functionality is
under the ``pika::cuda::experimental`` namespace and class and function names contain ``cuda``, even
when HIP support is enabled. CUDA and HIP functionality can be enabled with the CMake options
``PIKA_WITH_CUDA`` and ``PIKA_WITH_HIP``, respectively. In the following, whenever CUDA is
mentioned, it refers to to CUDA and HIP interchangeably.

.. note::
https://github.com/pika-org/pika/issues/116 tracks a potential renaming of the functionality
to avoid using ``cuda`` even when HIP is enabled. If you have feedback on a rename or just want
to follow along, please see that issue.

.. warning::
At the moment, ``nvcc`` can not compile stdexec headers. Of the CUDA compilers, only ``nvc++`` is
able to compile stdexec headers. If you have stdexec support enabled in pika, either ensure that
``.cu`` files do not include stdexec headers, or use ``nvc++`` to compile your application.
However, ``nvc++`` does not officially support compiling device code. Use at your own risk.

For HIP there are no known restrictions.

The CUDA support in pika relies on four major components:

1. A pool of CUDA streams as well as cuBLAS and cuSOLVER handles. These streams and handles are used
in a round-robin fashion by various sender adaptors.
2. A CUDA scheduler, in the ``std::execution`` sense. This uses the CUDA pool to schedule work on a
GPU.
3. Sender adaptors. A few special-purpose sender adaptors, as well as customizations of a few
``std::execution`` adaptors are provided to help schedule different types of work on a GPU.
4. Polling of CUDA events integrated into the pika scheduling loop. This integration is essential to
avoid calling e.g. ``cudaStreamSynchronize`` on a pika task, which would block the underlying
worker thread and thus block progress of other work.

The following example gives an overview of using the above CUDA functionalities in pika:

.. literalinclude:: ../examples/documentation/cuda_overview_documentation.cu
:language: c++
:start-at: #include

TODO: Note that while cuda_pool gives direct access to streams and handles, the intended usage is to
access them via the scheduler and sender adaptors.

.. doxygenclass:: pika::cuda::experimental::cuda_pool
.. doxygenclass:: pika::cuda::experimental::cuda_stream
.. doxygenclass:: pika::cuda::experimental::cublas_handle
.. doxygenclass:: pika::cuda::experimental::locked_cublas_handle
.. doxygenclass:: pika::cuda::experimental::cusolver_handle
.. doxygenclass:: pika::cuda::experimental::locked_cusolver_handle

..
.. literalinclude:: ../examples/documentation/cuda_pool_documentation.cpp
:language: c++
:start-at: #include
.. doxygenclass:: pika::cuda::experimental::cuda_scheduler

..
.. literalinclude:: ../examples/documentation/cuda_scheduler_documentation.cpp
:language: c++
:start-at: #include
.. doxygenvariable:: pika::cuda::experimental::then_with_stream

.. literalinclude:: ../examples/documentation/then_with_stream_documentation.cu
:language: c++
:start-at: #include


.. doxygenvariable:: pika::cuda::experimental::then_with_cublas

..
.. literalinclude:: ../examples/documentation/then_with_cublas_documentation.cpp
:language: c++
:start-at: #include
.. doxygenvariable:: pika::cuda::experimental::then_with_cusolver

..
.. literalinclude:: ../examples/documentation/then_with_cusolver_documentation.cpp
:language: c++
:start-at: #include
.. note::
TODO: Is this useful? Just transfer to host explicitly?

.. doxygenvariable:: pika::cuda::experimental::then_on_host

..
.. literalinclude:: ../examples/documentation/then_on_host_documentation.cpp
:language: c++
:start-at: #include
2 changes: 2 additions & 0 deletions examples/documentation/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,13 @@
# file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

set(example_programs
# cuda_overview_documentation # TODO
drop_operation_state_documentation
drop_value_documentation
hello_world_documentation
init_hpp_documentation
split_tuple_documentation
# then_with_stream_documentation # TODO
unpack_documentation
when_all_vector_documentation
)
Expand Down
70 changes: 65 additions & 5 deletions libs/pika/async_cuda/include/pika/async_cuda/cuda_pool.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@
#include <vector>

namespace pika::cuda::experimental {
/// \brief A locked cuBLAS handle.
///
/// A handle that provides thread-safe access to a \ref cublas_handle. The locked handle is
/// immovable.
class locked_cublas_handle
{
cublas_handle& handle;
Expand All @@ -36,9 +40,17 @@ namespace pika::cuda::experimental {
locked_cublas_handle& operator=(locked_cublas_handle&&) = delete;
locked_cublas_handle& operator=(locked_cublas_handle const&) = delete;

/// \brief Access the underlying cuBLAS handle.
///
/// \return a reference to the \ref cublas_handle. The returned handle is not thread-safe
/// and must be used within the lifetime of the \ref locked_cublas_handle.
PIKA_EXPORT cublas_handle const& get() noexcept;
};

/// \brief A locked cuSOLVER handle.
///
/// A handle that provides thread-safe access to a \ref cusolver_handle. The locked handle is
/// immovable.
class locked_cusolver_handle
{
cusolver_handle& handle;
Expand All @@ -52,15 +64,24 @@ namespace pika::cuda::experimental {
locked_cusolver_handle& operator=(locked_cusolver_handle&&) = delete;
locked_cusolver_handle& operator=(locked_cusolver_handle const&) = delete;

/// \brief Access the underlying cuSOLVER handle.
///
/// \return a reference to the \ref cusolver_handle. The returned handle is not thread-safe
/// and must be used within the lifetime of the \ref locked_cusolver_handle.
PIKA_EXPORT cusolver_handle const& get() noexcept;
};

/// A pool of CUDA streams, used for scheduling work on a CUDA device.
/// \brief A pool of CUDA streams, used for scheduling work on a CUDA device.
///
/// The pool initializes a set of CUDA (thread-local) streams on construction and provides
/// access to the streams in a round-robin fashion. The pool also gives access to cuBLAS and
/// cuSOLVER handles.
///
/// The pool initializes a set of CUDA (thread-local) streams on
/// construction and provides access to the streams in a round-robin
/// fashion. The pool is movable and copyable with reference semantics.
/// Copies of a pool still refer to the original pool of streams.
/// The pool is movable and copyable with reference semantics. Copies of a pool still refer to
/// the original pool of streams. A moved-from pool can't be used, except to check if it is
/// valid with \ref valid().
///
/// The pool is equality comparable and formattable.
class cuda_pool
{
private:
Expand Down Expand Up @@ -135,25 +156,64 @@ namespace pika::cuda::experimental {
std::shared_ptr<pool_data> data;

public:
/// \brief Construct a pool of CUDA streams and handles.
///
/// \param device the CUDA device used for scheduling work
/// \param num_normal_priority_streams_per_thread the number of normal priority streams per
/// thread
/// \param num_high_priority_streams_per_thread the number of high priority streams per
/// thread
/// \param flags flags used to construct CUDA streams
/// \param num_cublas_handles the number of cuBLAS handles to create for the whole pool
/// \param num_cusolver_handles the number of cuSOLVER handles to create for the whole pool
PIKA_EXPORT explicit cuda_pool(int device = 0,
std::size_t num_normal_priority_streams_per_thread = 3,
std::size_t num_high_priority_streams_per_thread = 3, unsigned int flags = 0,
std::size_t num_cublas_handles = 16, std::size_t num_cusolver_handles = 16);

/// \brief Move constructor.
PIKA_NVCC_PRAGMA_HD_WARNING_DISABLE
cuda_pool(cuda_pool&&) = default;
/// \brief Copy constructor.
PIKA_NVCC_PRAGMA_HD_WARNING_DISABLE
cuda_pool(cuda_pool const&) = default;
/// \brief Move assignment operator.
PIKA_NVCC_PRAGMA_HD_WARNING_DISABLE
cuda_pool& operator=(cuda_pool&&) = default;
/// \brief Copy assignment operator.
PIKA_NVCC_PRAGMA_HD_WARNING_DISABLE
cuda_pool& operator=(cuda_pool const&) = default;

/// \brief Check if the pool is valid.
///
/// \return true if the pool refers to a valid pool, false otherwise (e.g. if the pool has
/// been moved out from)
PIKA_EXPORT bool valid() const noexcept;

/// \brief Check if the pool is valid.
///
/// See \ref valid().
PIKA_EXPORT explicit operator bool() noexcept;

/// \brief Get a reference to the next CUDA stream.
///
/// \param priority the priority of the stream
PIKA_EXPORT cuda_stream const& get_next_stream(
pika::execution::thread_priority priority = pika::execution::thread_priority::normal);

/// \brief Get a locked cuBLAS handle.
///
/// \param stream the CUDA stream to use with the cuBLAS handle
///
/// \return a locked cuBLAS handle, which is released for reuse on destruction
PIKA_EXPORT locked_cublas_handle get_cublas_handle(
cuda_stream const& stream, cublasPointerMode_t pointer_mode);

/// \brief Get a locked cuSOLVER handle.
///
/// \param stream the CUDA stream to use with the cuSOLVER handle
///
/// \return a locked cuSOLVER handle, which is released for reuse on destruction
PIKA_EXPORT locked_cusolver_handle get_cusolver_handle(cuda_stream const& stream);

/// \cond NOINTERNAL
Expand Down
26 changes: 23 additions & 3 deletions libs/pika/async_cuda/include/pika/async_cuda/cuda_scheduler.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,27 +21,46 @@
namespace pika::cuda::experimental {
/// A scheduler for running work on a CUDA pool.
///
/// Provides access to scheduling work on a CUDA context represented by a
/// cuda_pool.
/// Provides access to scheduling work on a CUDA context represented by a \ref cuda_pool. Models
/// the `std::execution` scheduler concept.
///
/// Move and copy constructible. The scheduler has reference semantics with respect to the
/// associated CUDA pool.
///
/// Equality comparable.
class cuda_scheduler
{
private:
cuda_pool pool;
pika::execution::thread_priority priority;

public:
/// \brief TODO
PIKA_EXPORT
cuda_scheduler(cuda_pool pool);
cuda_scheduler(cuda_scheduler&&) = default;
cuda_scheduler(cuda_scheduler const&) = default;
cuda_scheduler& operator=(cuda_scheduler&&) = default;
cuda_scheduler& operator=(cuda_scheduler const&) = default;
~cuda_scheduler(){};
~cuda_scheduler() = default; // TODO: Default?

/// \brief Return the \ref cuda_pool associated with this scheduler.
PIKA_EXPORT cuda_pool const& get_pool() const noexcept;

/// \brief Return the next available CUDA stream from the pool.
///
/// Equivalent to accessing the stream via \ref get_pool().
PIKA_EXPORT cuda_stream const& get_next_stream();

/// \brief Return the next available cuBLAS handle from the pool.
///
/// Equivalent to accessing the cuBLAS handle via \ref get_pool().
PIKA_EXPORT locked_cublas_handle get_cublas_handle(
cuda_stream const& stream, cublasPointerMode_t pointer_mode);

/// \brief Return the next available cuSOLVER handle from the pool.
///
/// Equivalent to accessing the cuSOLVER handle via \ref get_pool().
PIKA_EXPORT locked_cusolver_handle get_cusolver_handle(cuda_stream const& stream);

/// \cond NOINTERNAL
Expand All @@ -56,6 +75,7 @@ namespace pika::cuda::experimental {
}
/// \endcond

/// TODO
friend cuda_scheduler tag_invoke(pika::execution::experimental::with_priority_t,
cuda_scheduler const& scheduler, pika::execution::thread_priority priority)
{
Expand Down
25 changes: 13 additions & 12 deletions libs/pika/async_cuda/include/pika/async_cuda/then_on_host.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -212,17 +212,7 @@ namespace pika::cuda::experimental {
};
} // namespace then_on_host_detail

// NOTE: This is not a customization of pika::execution::experimental::then.
// It retains the cuda_scheduler execution context from the predecessor
// sender, but does not run the continuation on a CUDA device. Instead, it
// runs the continuation in the polling thread used by the cuda_scheduler on
// the CPU. The continuation is run only after synchronizing all previous
// events scheduled on the cuda_scheduler. Blocking in the callable given to
// then_on_host blocks other work scheduled on cuda_scheduler from
// completing. Heavier work should be transferred to a host scheduler as
// soon as possible.
inline constexpr struct then_on_host_t final
: pika::functional::detail::tag_fallback<then_on_host_t>
struct then_on_host_t final : pika::functional::detail::tag_fallback<then_on_host_t>
{
private:
template <typename Sender, typename F>
Expand All @@ -245,5 +235,16 @@ namespace pika::cuda::experimental {
return pika::execution::experimental::detail::partial_algorithm<then_on_host_t, F>{
PIKA_FORWARD(F, f)};
}
} then_on_host{};
};

/// NOTE: This is not a customization of pika::execution::experimental::then.
/// It retains the cuda_scheduler execution context from the predecessor
/// sender, but does not run the continuation on a CUDA device. Instead, it
/// runs the continuation in the polling thread used by the cuda_scheduler on
/// the CPU. The continuation is run only after synchronizing all previous
/// events scheduled on the cuda_scheduler. Blocking in the callable given to
/// then_on_host blocks other work scheduled on cuda_scheduler from
/// completing. Heavier work should be transferred to a host scheduler as
/// soon as possible.
inline constexpr then_on_host_t then_on_host{};
} // namespace pika::cuda::experimental
Loading

0 comments on commit 9f0b8bc

Please sign in to comment.