-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Cooperative Groups API integration #87
base: master
Are you sure you want to change the base?
Conversation
@RDambrosio016 whenever you get some time (no rush), let me know what you think. I am testing this out as I go on a fairly large project of mine which brought about this need in the first place. Overall, the bridging code is quite simple. I've given an outline of how I think this should be exposed overall. Let me know what you think, happy to modify things as I go. Also, for this first pass, I would like to keep focused only on the grid-level components of the cooperative groups API, as well as the basic cooperative launch host-side function. We can add multi-device and the other cooperative group components later. |
4bbc882
to
e44a8bc
Compare
This works as follows: - Users build their Cuda code via `CudaBuilder` as normal. - If they want to use the cooperative groups API, then in their `build.rs`, just after building their PTX, they will: - Create a `cuda_builder::cg::CooperativeGroups` instance, - Add any needed opts for building the Cooperative Groups API bridge code (`-arch=sm_*` and so on), - Add their newly built PTX code to be linked with the CG API, which can include multiple PTX, cubin or fatbin files, - Call `.compile(..)`, which will spit out a fully linked `cubin`, - In the user's main application code, instead of using `launch!` to schedule their GPU work, they will now use `launch_cooperative!`.
e44a8bc
to
aefa92a
Compare
This looks neat, but if im not mistaken, those functions map to single PTX intrinsics directly, wouldn't it be easier to use inline assembly? though i haven't actually looked into this so im not sure if they map to more than one PTX instruction |
I started down that path at first, and for a few of the pertinent functions the corresponding PTX was clear. I was using a base C++ program compiled down to PTX to verify in addition to cross-referencing with the PTX ISA spec. However, I will say, many of the interfaces were not as clear, and this seemed to be a potentially more reliable way to generate the needed code. Perhaps we can replace some of the clear interfaces with some ASM instead. Happy to iterate on this in the future. |
This works as follows:
CudaBuilder
as normal.build.rs
, just after building their PTX, they will:cuda_builder::cg::CooperativeGroups
instance,-arch=sm_*
and so on),.compile(..)
, which will spit out a fully linkedcubin
,launch!
to schedule their GPU work, they will now uselaunch_cooperative!
.todo
cuLaunchCooperativeKernel
in a nice interface. We can add the cooperative multi device bits later, along with all of the other bits from the cooperative API.