Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JNI] Enables fabric handles for CUDA async memory pools #17526

Merged
merged 6 commits into from
Dec 9, 2024

Conversation

abellina
Copy link
Contributor

@abellina abellina commented Dec 5, 2024

Closes #17525
Depends on rapidsai/rmm#1743

Description

This PR adds a CUDA_ASYNC_FABRIC allocation mode in RmmAllocationMode and pipes in the options to RMM's cuda_async_memory_resource of a fabric for the handle type, and read_write as the memory protection mode (as that's the only mode supported by the pools, and is required for IPC).

If CUDA_ASYNC is used, fabric handles are not requested, and the memory protection is none.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@abellina abellina requested a review from a team as a code owner December 5, 2024 15:33
Copy link

copy-pr-bot bot commented Dec 5, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Java Affects Java cuDF API. label Dec 5, 2024
@abellina abellina added 5 - DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change Performance Performance related issue improvement Improvement / enhancement to an existing function labels Dec 5, 2024
@abellina
Copy link
Contributor Author

abellina commented Dec 5, 2024

For extra tests: fabric is not supported in all versions of CUDA that RAPIDS supports. I can add tests around this, checking for compatibility before, but it would require more code changes.

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyrights otherwise lgtm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024 copyrights

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024 copyrights

Comment on lines 786 to 792
auto [handle_type, prot_flag] = !fabric ?
std::pair{
rmm::mr::cuda_async_memory_resource::allocation_handle_type::none,
rmm::mr::cuda_async_memory_resource::access_flags::none} :
std::pair{
rmm::mr::cuda_async_memory_resource::allocation_handle_type::fabric,
rmm::mr::cuda_async_memory_resource::access_flags::read_write};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's a little easier to read without the negation

Suggested change
auto [handle_type, prot_flag] = !fabric ?
std::pair{
rmm::mr::cuda_async_memory_resource::allocation_handle_type::none,
rmm::mr::cuda_async_memory_resource::access_flags::none} :
std::pair{
rmm::mr::cuda_async_memory_resource::allocation_handle_type::fabric,
rmm::mr::cuda_async_memory_resource::access_flags::read_write};
auto [handle_type, prot_flag] = fabric ?
std::pair{
rmm::mr::cuda_async_memory_resource::allocation_handle_type::fabric,
rmm::mr::cuda_async_memory_resource::access_flags::read_write} :
std::pair{
rmm::mr::cuda_async_memory_resource::allocation_handle_type::none,
rmm::mr::cuda_async_memory_resource::access_flags::none};

Signed-off-by: Alessandro Bellina <[email protected]>
@abellina
Copy link
Contributor Author

abellina commented Dec 8, 2024

/ok to test

@abellina abellina removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Dec 8, 2024
Signed-off-by: Alessandro Bellina <[email protected]>
@abellina
Copy link
Contributor Author

abellina commented Dec 8, 2024

/ok to test

@abellina
Copy link
Contributor Author

abellina commented Dec 9, 2024

/ok to test

@abellina
Copy link
Contributor Author

abellina commented Dec 9, 2024

I saw a failure in existing async tests that was consistent, around "invalid device ordinal". This happened at places where the async allocator would be instantiated. I have changed the code so that, if fabric is not selected, it follows the exact path it used to follow 8d27a92. I'll try to repro this, but my guess is the memory access protection APIs don't work on older gpus (this was a V100 in CI).

@abellina
Copy link
Contributor Author

abellina commented Dec 9, 2024

It turns out that passing none as the access flag does not work. I filed a follow on issue in RMM rapidsai/rmm#1753. For the perspective of this PR, I will pass nullopt, skipping the call to cudaMemPoolSetAccess in all cases except when we use fabric, where I'll call it with read+write permissions. This is the intended behavior.

@abellina
Copy link
Contributor Author

abellina commented Dec 9, 2024

@jlowe fyi

@abellina
Copy link
Contributor Author

abellina commented Dec 9, 2024

/merge

@rapids-bot rapids-bot bot merged commit a79077c into rapidsai:branch-25.02 Dec 9, 2024
88 of 89 checks passed
@abellina abellina deleted the add_fabric_handles branch December 9, 2024 19:06
rapids-bot bot pushed a commit that referenced this pull request Dec 9, 2024
This is a follow up from #17526, where fabric handles can be enabled from RMM. That PR also sets the memory access protection flag (`cudaMemPoolSetAccess`), but I have learned that this second flag is not needed from the owner device. In fact, it causes confusion because the owning device fails to call this function with some of the flags (access none).  `cudaMemPoolSetAccess` is meant to only be called from peer processes that have imported the pool's handle. In our case, UCX handles this from the peer's side and it does not need to be anywhere in RMM or cuDF.

Sorry for the noise. I'd like to get this fix in, and then I am going to fix RMM by removing that API.

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)
  - Jason Lowe (https://github.com/jlowe)

URL: #17553
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA][JNI] Add fabric handle support for CUDA async pools in cuDF JNI
3 participants