-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic Parallelism | implementation strategy #94
Comments
The generated PTX from a C++ program using dynamic parallelism will tend to include the following .extern .func (.param .b64 func_retval0) cudaGetParameterBufferV2
(
.param .b64 cudaGetParameterBufferV2_param_0, // Function pointer.
.param .align 4 .b8 cudaGetParameterBufferV2_param_1[12], // Grid size.
.param .align 4 .b8 cudaGetParameterBufferV2_param_2[12], // Block size.
.param .b32 cudaGetParameterBufferV2_param_3 // Shared mem.
)
;
.extern .func (.param .b32 func_retval0) cudaLaunchDeviceV2
(
.param .b64 cudaLaunchDeviceV2_param_0, // Param buffer.
.param .b64 cudaLaunchDeviceV2_param_1 // Stream.
)
; This is inserted by nvcc device-side triple-chevron syntax is used. This appears to be updated The V2 ABIs are much more simple, and building up the PTX for these seems to be pretty straightforward. I should have a PTX based solution for this in PR form quite soon. I will likely just copy/paste the launch macro that we currently have in cust, and maybe add it to a shared location, or just copy/paste directly to the cuda_std module. We can decide what to do with it in the PR. Just to clarify, the launch macro extracts block & grid size declarations quite nicely, which is why I want to use the macro and then feed that code into the PTX ASM. |
Well, as it turns out, a great deal of the code (if not all) is already in place: https://github.com/Rust-GPU/Rust-CUDA/tree/master/crates/cuda_std/src/rt . I had originally been searching for this stuff in the docs, and was not able to find it. Looking in the code, there it is. I will enable that |
Well ... once again, I find myself in need of another feature. This time, dynamic parallelism.
Looks like this is also part of the C++ runtime API, similar to cooperative groups, for which I already have a PR.
I'm considering using a similar strategy for implementing this feature. I would love to just pin down the PTX, but that has proven to be a bit unclear; however, I will definitely start my search in the PTX ISA and see if there are any quick wins. If not, then probably a similar approach as was taken with the cooperative groups API.
Thoughts?
The text was updated successfully, but these errors were encountered: