Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Parallelism | implementation strategy #94

Open
thedodd opened this issue Nov 9, 2022 · 2 comments · May be fixed by #96
Open

Dynamic Parallelism | implementation strategy #94

thedodd opened this issue Nov 9, 2022 · 2 comments · May be fixed by #96

Comments

@thedodd
Copy link
Contributor

thedodd commented Nov 9, 2022

Well ... once again, I find myself in need of another feature. This time, dynamic parallelism.

Looks like this is also part of the C++ runtime API, similar to cooperative groups, for which I already have a PR.

I'm considering using a similar strategy for implementing this feature. I would love to just pin down the PTX, but that has proven to be a bit unclear; however, I will definitely start my search in the PTX ISA and see if there are any quick wins. If not, then probably a similar approach as was taken with the cooperative groups API.

Thoughts?

@thedodd
Copy link
Contributor Author

thedodd commented Nov 14, 2022

The generated PTX from a C++ program using dynamic parallelism will tend to include the following .extern declarations in the PTX (comments added by me based on studying the PTX):

.extern .func  (.param .b64 func_retval0) cudaGetParameterBufferV2
(
	.param .b64 cudaGetParameterBufferV2_param_0, // Function pointer.
	.param .align 4 .b8 cudaGetParameterBufferV2_param_1[12], // Grid size.
	.param .align 4 .b8 cudaGetParameterBufferV2_param_2[12], // Block size.
	.param .b32 cudaGetParameterBufferV2_param_3 // Shared mem.
)
;
.extern .func  (.param .b32 func_retval0) cudaLaunchDeviceV2
(
	.param .b64 cudaLaunchDeviceV2_param_0, // Param buffer.
	.param .b64 cudaLaunchDeviceV2_param_1 // Stream.
)
;

This is inserted by nvcc device-side triple-chevron syntax is used. This appears to be updated V2 ABIs compared to what is documented here.

The V2 ABIs are much more simple, and building up the PTX for these seems to be pretty straightforward. I should have a PTX based solution for this in PR form quite soon.

I will likely just copy/paste the launch macro that we currently have in cust, and maybe add it to a shared location, or just copy/paste directly to the cuda_std module. We can decide what to do with it in the PR. Just to clarify, the launch macro extracts block & grid size declarations quite nicely, which is why I want to use the macro and then feed that code into the PTX ASM.

@thedodd
Copy link
Contributor Author

thedodd commented Nov 15, 2022

Well, as it turns out, a great deal of the code (if not all) is already in place: https://github.com/Rust-GPU/Rust-CUDA/tree/master/crates/cuda_std/src/rt . I had originally been searching for this stuff in the docs, and was not able to find it. Looking in the code, there it is.

I will enable that rt module and start experimenting with it. I'll compare the generated PTX with an equivalent C++ program compiled via nvcc.

@thedodd thedodd linked a pull request Nov 21, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant