Improve the behaviour of alpaka buffers with respect to asynchronous operations #2417

fwyzard · 2024-11-01T14:20:14Z

While reviewing the use of alpaka buffer in the CMS code, we have seen a recurrent pattern that relies on the (undocumented) behaviour of the underlying back-ends.

Consider this example:

using Host = alpaka::DevCpu;
using HostPlatform = alpaka::PlatformCpu;
auto host = alpaka::getDevByIdx(alpaka::PlatformCpu{}, 0);

using Platform = ...;
using Device = alpaka::Dev<Platform>;
using Queue = alpaka::Queue<Device, alpaka::NonBlocking>

Platform platform{};
auto device = alpaka::getDevByIdx(platform, 0);
Queue queue(dev);
auto dbuf = alpaka::allocBuf<Elem, Idx>(device, extent);
{
#if use_1_a
    // 1.a allocate the host buffer in pageable system memory
    auto hbuf = alpaka::allocBuf<Elem, Idx>(host, extent);
#elif use_1_b
    // 1.b allocate the host buffer in pinned system memory
    auto hbuf = alpaka::allocMappedBuf<Elem, Idx>(host, platform, extent);
#else
    // 1.c allocate the host buffer in async or cached system memory (CMS only)
    auto hbuf = alpaka::allocCachedBuf<Elem, Idx>(host, queue, extent);
#endif
    // 2. fill the host buffer buf
    std::memset(hbuf.data(), ...);
    // 3. copy the content of the host buffer to the device buffer
    alpaka::memcpy(queue, dbuf, hbuf);
    // 4. the host buffer goes out of scope before the asynchronous copy is guaranteed to complete
}

In principle we can observe different behaviours depending how the buffer was allocated in 1. and on what device back-end is being used:

if the buffer was allocated with allocBuf (1.a), in principle there is no synchronisation, and the memory may be freed and reused before the copy completes (or even starts);
if the buffer was allocated with allocMappedBuf (1.b), in principle there is no synchronisation; however for some back-ends the call in the destructor of the buffer (_e.g._the call to cudaFreeHost) is likely to block and synchronise with all back-end (e.g. CUDA) activity, making the copy safe;
if the buffer was allocated with allocCachedBuf (1.c), the buffer is guaranteed to be valid until all operations in queue have completed (assuming the buffer and the copy use the same queue).

Note: allocCachedBuf(host, queue, extent) is a CMS implementation similar to allocAsyncBuf(queue, extent). I'm working to improve its performance and eventually upstream it to alpaka :-)

Given that even such a simple example is error prone, we have been wondering how we could improve the situation.

A couple of ideas:

1. add some free functions or methods to the Buffer classes to check if the destructor of a buffer is blocking, async-safe, or neither; this would allow the user code to check the guaranteed behaviour and react accordingly, e.g. adding an explicit synchronisation;
1. implement some way to make the destructor of an existing buffer async-safe, and wait for a queue or event before freeing the memory.

The text was updated successfully, but these errors were encountered:

fwyzard · 2024-11-01T14:20:29Z

@makortel FYI

mehmetyusufoglu · 2024-11-04T15:04:17Z

Q1 - There are functions alpaka::allocAsyncBuf and alpaka::allocAsyncBufIfSupported which use queues. Are we assuming that the user deliberately selected allocBuf for a reason?
Q2 - Cant we create something like alpaka::allocAsyncMappedBuf similar to allocAsyncBuf?

fwyzard · 2024-11-04T15:50:28Z

Q1 - There are functions alpaka::allocAsyncBuf and alpaka::allocAsyncBufIfSupported which use queues. Are we assuming that the user deliberately selected allocBuf for a reason?

Yes.

As a library, alpaka shouldn't be restricting what users are supposed to do, though of course it can restrict what is or isn't supported.
But it would be nice to be able to catch at compile time (better) or at run time what is and isn't supported.

Q2 - Cant we create something like alpaka::allocAsyncMappedBuf similar to allocAsyncBuf?

There is no native support for this in CUDA, ROCm, etc.
We have implemented it in the CMS code, it is what I am referring to as allocCachedBuf.

fwyzard added the Type:Enhancement label Nov 1, 2024

fwyzard added this to the 2.0.0 milestone Nov 1, 2024

fwyzard mentioned this issue Nov 5, 2024

Document the synchronisation behavious of alpaka buffers #2371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the behaviour of alpaka buffers with respect to asynchronous operations #2417

Improve the behaviour of alpaka buffers with respect to asynchronous operations #2417

fwyzard commented Nov 1, 2024

fwyzard commented Nov 1, 2024

mehmetyusufoglu commented Nov 4, 2024 •

edited

Loading

fwyzard commented Nov 4, 2024

Improve the behaviour of alpaka buffers with respect to asynchronous operations #2417

Improve the behaviour of alpaka buffers with respect to asynchronous operations #2417

Comments

fwyzard commented Nov 1, 2024

fwyzard commented Nov 1, 2024

mehmetyusufoglu commented Nov 4, 2024 • edited Loading

fwyzard commented Nov 4, 2024

mehmetyusufoglu commented Nov 4, 2024 •

edited

Loading