Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cuFile driver closing causes segfault upon program termination #17121

Open
kingcrimsontianyu opened this issue Oct 18, 2024 · 4 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@kingcrimsontianyu
Copy link
Contributor

kingcrimsontianyu commented Oct 18, 2024

Describe the bug

cuDF accesses the cuFile API via the cufile_shim object that has a static storage duration, meaning its destructor is called after the main function returns. The cuFileDriverClose() internally calls CUDA API, resulting in UB (usually manifested as segfault), and therefore should not be called here. However, even in the absence of cuFileDriverClose(), cuFile will implicitly close the driver, during which process some CUDA calls are still made, likely causing segfault. The best way to clean up the resources needs to be revisited in the future.

For the time being, segfault cannot be completely avoided under the GDS I/O path, but at least cuFileDriverClose() should be removed from the destructor of cufile_shim.

Related issues from KvikIO:
rapidsai/kvikio#497

Steps/Code to reproduce bug
Run any program using GDS I/O.

Expected behavior
Free of segmentation fault.

Environment overview (please complete the following information)
N/A

Environment details
N/A

Additional context
N/A

@kingcrimsontianyu kingcrimsontianyu added the bug Something isn't working label Oct 18, 2024
@kingcrimsontianyu kingcrimsontianyu self-assigned this Oct 18, 2024
@kingcrimsontianyu kingcrimsontianyu moved this to To be revisited in libcudf Oct 18, 2024
@vyasr
Copy link
Contributor

vyasr commented Oct 18, 2024

We've tackled similar issues in the past by leaking the resource and allow the OS to reclaim the objects when the process is terminated (see e.g. rapidsai/rmm#1375). That seems like the thing you're suggesting with removing the explicit close, but it doesn't resolve the other issue if cufile is also executing past main due to the static duration of the object created by cufile.

@kingcrimsontianyu
Copy link
Contributor Author

I filed an internal bug, and the cuFile issue will be addressed in a future CUDA release 😀

rapids-bot bot pushed a commit that referenced this issue Nov 1, 2024
This PR makes small improvements for the I/O code. Specifically,
- Place type constraint on a template class to allow only for rvalue argument. In addition, replace `std::move` with `std::forward` to make the code more *apparently* consistent with the convention, i.e. use `std::move()` on the rvalue references, and `std::forward` on the forwarding references (Effective modern C++ item 25).
- Alleviate (but not completely resolve) an existing cuFile driver close issue by removing the explicit driver close call. See #17121 
- Minor typo fix (`struct` → `class`).

Authors:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #17105
@wence-
Copy link
Contributor

wence- commented Nov 20, 2024

@kingcrimsontianyu I suppose this means this needs to remain open until we are on the newer cuFile?

@kingcrimsontianyu
Copy link
Contributor Author

@wence- Yes. I think so.

Summary of an update from cuFile team:

  • For cuFile's implicit driver close that happens at program exit, cuFile still needs to clean up the CUDA resources, such as internal stream, and cannot let them leak, even though making CUDA API calls after main returns means UB.
  • The cufile library does not have any means to know if the user has called cudaDeviceReset already (such as in our case where nvbench resets the device before main returns). They will reach out to the CUDA team to explore viable options. Ideally, during implicit driver close, cuFile should skip the cleanup step when it is aware that the user has reset the device.
  • For now, the workaround is to explicitly call cuFileDriverClose prior to calling cudaDeviceReset (before main returns).

cc. @madsbk who has rolled out rapidsai/kvikio#514 for KvikIO using the same workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: To be revisited
Development

No branches or pull requests

3 participants