[JNI] Enables fabric handles for CUDA async memory pools #17526

abellina · 2024-12-05T15:33:27Z

Closes #17525
Depends on rapidsai/rmm#1743

Description

This PR adds a CUDA_ASYNC_FABRIC allocation mode in RmmAllocationMode and pipes in the options to RMM's cuda_async_memory_resource of a fabric for the handle type, and read_write as the memory protection mode (as that's the only mode supported by the pools, and is required for IPC).

If CUDA_ASYNC is used, fabric handles are not requested, and the memory protection is none.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Alessandro Bellina <[email protected]>

copy-pr-bot · 2024-12-05T15:33:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

abellina · 2024-12-05T15:37:22Z

For extra tests: fabric is not supported in all versions of CUDA that RAPIDS supports. I can add tests around this, checking for compatibility before, but it would require more code changes.

jlowe

copyrights otherwise lgtm

jlowe · 2024-12-05T21:50:14Z

java/src/main/java/ai/rapids/cudf/RmmAllocationMode.java

2024 copyrights

jlowe · 2024-12-05T21:50:40Z

java/src/main/java/ai/rapids/cudf/RmmCudaAsyncMemoryResource.java

2024 copyrights

jlowe · 2024-12-05T21:53:03Z

java/src/main/native/src/RmmJni.cpp

+    auto [handle_type, prot_flag] = !fabric ?
+      std::pair{ 
+        rmm::mr::cuda_async_memory_resource::allocation_handle_type::none,
+        rmm::mr::cuda_async_memory_resource::access_flags::none} :
+      std::pair{
+        rmm::mr::cuda_async_memory_resource::allocation_handle_type::fabric,
+        rmm::mr::cuda_async_memory_resource::access_flags::read_write};


Nit: It's a little easier to read without the negation

Suggested change

auto [handle_type, prot_flag] = !fabric ?

std::pair{

rmm::mr::cuda_async_memory_resource::allocation_handle_type::none,

rmm::mr::cuda_async_memory_resource::access_flags::none} :

std::pair{

rmm::mr::cuda_async_memory_resource::allocation_handle_type::fabric,

rmm::mr::cuda_async_memory_resource::access_flags::read_write};

auto [handle_type, prot_flag] = fabric ?

std::pair{

rmm::mr::cuda_async_memory_resource::allocation_handle_type::fabric,

rmm::mr::cuda_async_memory_resource::access_flags::read_write} :

std::pair{

rmm::mr::cuda_async_memory_resource::allocation_handle_type::none,

rmm::mr::cuda_async_memory_resource::access_flags::none};

Signed-off-by: Alessandro Bellina <[email protected]>

abellina · 2024-12-08T15:45:39Z

/ok to test

Signed-off-by: Alessandro Bellina <[email protected]>

abellina · 2024-12-08T21:51:14Z

/ok to test

Signed-off-by: Alessandro Bellina <[email protected]>

abellina · 2024-12-09T15:52:29Z

/ok to test

abellina · 2024-12-09T15:54:53Z

I saw a failure in existing async tests that was consistent, around "invalid device ordinal". This happened at places where the async allocator would be instantiated. I have changed the code so that, if fabric is not selected, it follows the exact path it used to follow 8d27a92. I'll try to repro this, but my guess is the memory access protection APIs don't work on older gpus (this was a V100 in CI).

abellina · 2024-12-09T18:52:25Z

It turns out that passing none as the access flag does not work. I filed a follow on issue in RMM rapidsai/rmm#1753. For the perspective of this PR, I will pass nullopt, skipping the call to cudaMemPoolSetAccess in all cases except when we use fabric, where I'll call it with read+write permissions. This is the intended behavior.

abellina · 2024-12-09T18:52:40Z

@jlowe fyi

abellina · 2024-12-09T19:06:48Z

/merge

This is a follow up from #17526, where fabric handles can be enabled from RMM. That PR also sets the memory access protection flag (`cudaMemPoolSetAccess`), but I have learned that this second flag is not needed from the owner device. In fact, it causes confusion because the owning device fails to call this function with some of the flags (access none). `cudaMemPoolSetAccess` is meant to only be called from peer processes that have imported the pool's handle. In our case, UCX handles this from the peer's side and it does not need to be anywhere in RMM or cuDF. Sorry for the noise. I'd like to get this fix in, and then I am going to fix RMM by removing that API. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) - Jason Lowe (https://github.com/jlowe) URL: #17553

[JNI] Enables fabric handles for CUDA async memory pools

68189b1

Signed-off-by: Alessandro Bellina <[email protected]>

abellina requested a review from a team as a code owner December 5, 2024 15:33

github-actions bot assigned abellina Dec 5, 2024

github-actions bot added the Java Affects Java cuDF API. label Dec 5, 2024

abellina added 5 - DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change Performance Performance related issue improvement Improvement / enhancement to an existing function labels Dec 5, 2024

jlowe reviewed Dec 5, 2024

View reviewed changes

Update copyrights and review comments

3c17202

Signed-off-by: Alessandro Bellina <[email protected]>

ttnghia approved these changes Dec 6, 2024

View reviewed changes

Merge branch 'branch-25.02' into add_fabric_handles

960b3c5

abellina removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Dec 8, 2024

clang format

3f7e1bd

Signed-off-by: Alessandro Bellina <[email protected]>

abellina added 2 commits December 9, 2024 07:50

If not selecting fabric, pass nullopt to keep old CUDA api calls

8d27a92

Signed-off-by: Alessandro Bellina <[email protected]>

Merge branch 'branch-25.02' into add_fabric_handles

63340ef

abellina mentioned this pull request Dec 9, 2024

[BUG] cuda async pool cannot be initialized with access_flags::none rapidsai/rmm#1753

Closed

jlowe approved these changes Dec 9, 2024

View reviewed changes

rapids-bot bot merged commit a79077c into rapidsai:branch-25.02 Dec 9, 2024
88 of 89 checks passed

abellina deleted the add_fabric_handles branch December 9, 2024 19:06

abellina mentioned this pull request Dec 9, 2024

[JNI] remove rmm argument to set rw access for fabric handles #17553

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JNI] Enables fabric handles for CUDA async memory pools #17526

[JNI] Enables fabric handles for CUDA async memory pools #17526

abellina commented Dec 5, 2024 •

edited

Loading

copy-pr-bot bot commented Dec 5, 2024

abellina commented Dec 5, 2024

jlowe left a comment

jlowe Dec 5, 2024

jlowe Dec 5, 2024

jlowe Dec 5, 2024

abellina commented Dec 8, 2024

abellina commented Dec 8, 2024

abellina commented Dec 9, 2024

abellina commented Dec 9, 2024

abellina commented Dec 9, 2024

abellina commented Dec 9, 2024

abellina commented Dec 9, 2024

[JNI] Enables fabric handles for CUDA async memory pools #17526

[JNI] Enables fabric handles for CUDA async memory pools #17526

Conversation

abellina commented Dec 5, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Dec 5, 2024

abellina commented Dec 5, 2024

jlowe left a comment

Choose a reason for hiding this comment

jlowe Dec 5, 2024

Choose a reason for hiding this comment

jlowe Dec 5, 2024

Choose a reason for hiding this comment

jlowe Dec 5, 2024

Choose a reason for hiding this comment

abellina commented Dec 8, 2024

abellina commented Dec 8, 2024

abellina commented Dec 9, 2024

abellina commented Dec 9, 2024

abellina commented Dec 9, 2024

abellina commented Dec 9, 2024

abellina commented Dec 9, 2024

abellina commented Dec 5, 2024 •

edited

Loading