-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cuFile driver closing causes segfault upon program termination #17121
Comments
We've tackled similar issues in the past by leaking the resource and allow the OS to reclaim the objects when the process is terminated (see e.g. rapidsai/rmm#1375). That seems like the thing you're suggesting with removing the explicit close, but it doesn't resolve the other issue if cufile is also executing past main due to the static duration of the object created by cufile. |
I filed an internal bug, and the cuFile issue will be addressed in a future CUDA release 😀 |
This PR makes small improvements for the I/O code. Specifically, - Place type constraint on a template class to allow only for rvalue argument. In addition, replace `std::move` with `std::forward` to make the code more *apparently* consistent with the convention, i.e. use `std::move()` on the rvalue references, and `std::forward` on the forwarding references (Effective modern C++ item 25). - Alleviate (but not completely resolve) an existing cuFile driver close issue by removing the explicit driver close call. See #17121 - Minor typo fix (`struct` → `class`). Authors: - Tianyu Liu (https://github.com/kingcrimsontianyu) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #17105
@kingcrimsontianyu I suppose this means this needs to remain open until we are on the newer cuFile? |
@wence- Yes. I think so. Summary of an update from cuFile team:
cc. @madsbk who has rolled out rapidsai/kvikio#514 for KvikIO using the same workaround. |
Describe the bug
cuDF accesses the cuFile API via the cufile_shim object that has a static storage duration, meaning its destructor is called after the main function returns. The
cuFileDriverClose()
internally calls CUDA API, resulting in UB (usually manifested as segfault), and therefore should not be called here. However, even in the absence ofcuFileDriverClose()
, cuFile will implicitly close the driver, during which process some CUDA calls are still made, likely causing segfault. The best way to clean up the resources needs to be revisited in the future.For the time being, segfault cannot be completely avoided under the GDS I/O path, but at least
cuFileDriverClose()
should be removed from the destructor of cufile_shim.Related issues from KvikIO:
rapidsai/kvikio#497
Steps/Code to reproduce bug
Run any program using GDS I/O.
Expected behavior
Free of segmentation fault.
Environment overview (please complete the following information)
N/A
Environment details
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: