-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowed Caliper to control Kokkos fencing behavior #373
base: master
Are you sure you want to change the base?
Allowed Caliper to control Kokkos fencing behavior #373
Conversation
Hold off on merging this one, there's something we ought to discuss |
I pinged you on an email, let me know what you think |
Oh good, you already implemented this, I thought I'd have to deal with the Kokkos fencing stuff now 😄 Do I interpret this correctly that you tell Kokkos "I don't need auto-fencing", but then tell it to fence explicitly in each of the callbacks? Generally Caliper shouldn't need fencing at all. For CUDA, Caliper can map GPU times to the regions that launched them, without any need for explicit fencing (you can give it a try with TBH I still don't understand why the tool interface should have to concern itself with that, that should be a switch in Kokkos that the user can set. Oh well. And as Caliper shouldn't typically require fencing at all, I can't spontaneously think of any scenario where it would be beneficial to have more fine-grained control over it. |
!!!!!!! Very cool, let me test that out. If this works, then yeah, delete fences, and get ready for Sandia users to beat down your door... Should be just CALI_CONFIG=cuda-activity-report(profile.kokkos)? |
I think it should be optional, in any case, due to the reasons I outlined in my email (you're not really just timing your own kernel, in a multi-stream case) |
Yeah, |
Okay, so I think I'll write the logic as follows:
And then as you gain the ability to do HIP fence-free profiling, we just keep adding to case 1 |
Might be easier to just give the kokkos service a config flag that turns on fencing. Is there actually another situation besides the UVM event mapping where we really need it? Do people insist on synchronous times with e.g. a runtime-report? |
So, the problem is that if you don't do synchronous timings, you confuse your users. You claim that a dot-product on three elements took a minute, when in reality a kernel on another stream hogged the device for all but a few microseconds of that time. By the way, I was writing your side of this argument to Christian earlier today. But think about it, you write call_huge_calculation_on_cuda(stream2);
call_small_dot_product(stream1); Depending on how the scheduler works, you'll happily submit on stream2, then stream1, and then the big calculation hogs the device. But you've still started your timing on the kernel in stream1 (unless Caliper is doing something very interesting?). Your users write "why does Caliper think a dot product takes a minute" and then everybody is confused. |
Timing via Cupti isn't like timing on a CPU. |
First: hi! Hope you're enjoying AMD Second: It can resolve it, but the timings won't necessarily be accurate in the sense of "how long does my dot kernel take to run," it might answer "how long does my dot kernel take to get scheduled, possibly delayed by other kernels, and then eventually run." Which is not at all a useless answer, but it's also not always the answer a user will want |
Had my first day today. I haven't sent out an email yet (only bc of procrastination) so I suspect this news to @daboehme 🤣
What I'm saying is that it will be accurate for how long the kernel takes to run bc that's exactly what a cupti activity record provides: the runtime for the kernel, absent everything else. What you are describing with
is what you would get if you used cudaEvent timings, not cupti |
Ah! That's actually news to me. Cool, in that case I don't see a reason to ever fence. And congrats on the first day, and getting to surprise Boehme, that's always a good day. |
It is! Well, congratulations 😄
Exactly! Well, unless you do weird stuff like the UVM event to kokkos space mapping. Otherwise I think all the artificial fencing is probably more harmful than beneficial for understanding performance. |
Path Host Time GPU Time GPU %
cudaFreeHost 0.000039
test_two 3.329040 1.532707 46.040502
Kokkos::Tools::Experim~~: Tool Requested Fence 1.671173
cudaDeviceSynchronize 1.026064
cudaLaunchKernel 0.614629 1.532707 249.371064
cudaFuncSetCacheConfig 0.000005
cudaFuncGetAttributes 0.000007
test_one 3.272529 1.533013 46.844918
Kokkos::Tools::Experim~~: Tool Requested Fence 1.669054
cudaDeviceSynchronize 1.022471
cudaLaunchKernel 0.557666 1.533013 274.898070
cudaFuncSetCacheConfig 0.000006
cudaFuncGetAttributes 0.000008
Kokkos::View::initialization [v2] 0.000113 0.000030 26.590754
Kokkos::Tools::Experim~~: Tool Requested Fence 0.000025
cudaDeviceSynchronize 0.000012
Kokkos::Impl::ViewValu~~setting values in view 0.000030
cudaStreamSynchronize 0.000021
cudaLaunchKernel 0.000017 0.000030 173.327938
cudaPointerGetAttributes 0.000006
Kokkos::View::initialization [v1] 0.000182 0.000031 17.277801
Kokkos::Tools::Experim~~: Tool Requested Fence 0.000024
cudaDeviceSynchronize 0.000011
Kokkos::Impl::ViewValu~~setting values in view 0.000029
cudaStreamSynchronize 0.000018
cudaLaunchKernel 0.000016 0.000030 189.246237
cudaMemcpyToSymbol 0.000017 0.000001 6.528709
cudaFuncSetCacheConfig 0.000006
cudaFuncGetAttributes 0.000013
cudaPointerGetAttributes 0.000014
Kokkos::Tools::Experime~~e: Tool Requested Fence 2.237524
cudaDeviceSynchronize 0.955436
cudaDeviceSetCacheConfig 0.000013
cudaMemset 0.000046
cudaHostAlloc 0.000026
cudaFree 0.000563
cudaLaunchKernel 0.000048 0.000006 13.201375
cudaMemcpy 0.000488 0.000041 8.321780
cudaMalloc 0.000409
Kokkos::CudaInternal::i~~on space initialization 0.000057
cudaDeviceSynchronize 0.000036
cudaSetDevice 0.000016
cudaStreamCreate 0.000694
(base) [dzpolia@kokkos-dev-2 cuda-build]$ ./core/unit_test/KokkosCore_UnitTest_Develop --kokkos-tools-library=$HOME/src/caliper/nkb/src/libcaliper.so --kokkos-tools-args="cuda-activity-report(profile.kokkos)"
(REBUILD WITHOUT FENCES)
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from defaultdevicetype
[ RUN ] defaultdevicetype.development_test
[ OK ] defaultdevicetype.development_test (3125 ms)
[----------] 1 test from defaultdevicetype (3125 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3125 ms total)
[ PASSED ] 1 test.
Path Host Time GPU Time GPU %
cudaFreeHost 0.000037
test_two 1.293494 1.391292 107.560713
cudaLaunchKernel 0.833609 1.391292 166.899775
cudaFuncSetCacheConfig 0.000008
cudaFuncGetAttributes 0.000007
test_one 1.229367 1.391136 113.158774
cudaLaunchKernel 0.769919 1.391136 180.686070
cudaFuncSetCacheConfig 0.000006
cudaFuncGetAttributes 0.000008 Nifty result. Actually cool in both cases, Caliper giving you some visibility into what Kokkos does under the hood. I'll refine that |
Actually, here's the cool result (with some refinement: (base) [dzpolia@kokkos-dev-2 cuda-build]$ ./core/unit_test/KokkosCore_UnitTest_Develop --kokkos-tools-library=$HOME/src/caliper/nkb/src/libcaliper.so --kokkos-tools-args="cuda-activity-report(profile.kokkos)"
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from defaultdevicetype
[ RUN ] defaultdevicetype.development_test
[ OK ] defaultdevicetype.development_test (3117 ms)
[----------] 1 test from defaultdevicetype (3118 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3118 ms total)
[ PASSED ] 1 test.
Path Host Time GPU Time GPU %
cudaFreeHost 0.000038
test_two 1.263474 1.391103 110.101493
cudaLaunchKernel 0.781364 1.391103 178.035232
cudaFuncSetCacheConfig 0.000006
cudaFuncGetAttributes 0.000008
test_one 1.225919 1.391138 113.477202
cudaLaunchKernel 0.738680 1.391138 188.327482
cudaFuncSetCacheConfig 0.000006
cudaFuncGetAttributes 0.000008
Kokkos::View::initialization [v2] 0.000077 0.000030 39.004349
Kokkos::Impl::ViewValu~~setting values in view 0.000021
cudaStreamSynchronize 0.000011
cudaLaunchKernel 0.000026 0.000030 116.596818
cudaPointerGetAttributes 0.000006
Kokkos::View::initialization [v1] 0.000161 0.000032 19.537230
Kokkos::Impl::ViewValu~~setting values in view 0.000028
cudaStreamSynchronize 0.000016
cudaLaunchKernel 0.000018 0.000030 171.277255
cudaMemcpyToSymbol 0.000022 0.000001 5.140208
cudaFuncSetCacheConfig 0.000006
cudaFuncGetAttributes 0.000013
cudaPointerGetAttributes 0.000015
cudaDeviceSetCacheConfig 0.000012
cudaMemset 0.000048
cudaHostAlloc 0.000026
cudaFree 0.000417
cudaLaunchKernel 0.000048 0.000006 12.817001
cudaMemcpy 0.028340 0.000039 0.138658
cudaMalloc 0.000368
Kokkos::CudaInternal::i~~on space initialization 0.000088
cudaDeviceSynchronize 0.000048
cudaSetDevice 0.000016
cudaStreamCreate 0.000647
(base) [dzpolia@kokkos-dev-2 cuda-build]$ ./core/unit_test/KokkosCore_UnitTest_Develop --kokkos-tools-library=$HOME/src/caliper/nkb/src/libcaliper.so --kokkos-tools-args="runtime-report(profile.kokkos)"
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from defaultdevicetype
[ RUN ] defaultdevicetype.development_test
[ OK ] defaultdevicetype.development_test (6608 ms)
[----------] 1 test from defaultdevicetype (6608 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (6608 ms total)
[ PASSED ] 1 test.
Path Time (E) Time (I) Time % (E) Time % (I)
test_two 0.864759 2.130433 13.085650 32.237999
Kokkos::Tools::Experim~~: Tool Requested Fence 1.265674 1.265674 19.152349 19.152349
test_one 0.871407 2.135599 13.186248 32.316171
Kokkos::Tools::Experim~~: Tool Requested Fence 1.264192 1.264192 19.129923 19.129923
Kokkos::View::initialization [v2] 0.000031 0.000072 0.000469 0.001090
Kokkos::Tools::Experim~~: Tool Requested Fence 0.000010 0.000010 0.000151 0.000151
Kokkos::Impl::ViewValu~~setting values in view 0.000031 0.000031 0.000469 0.000469
Kokkos::View::initialization [v1] 0.000076 0.000117 0.001150 0.001770
Kokkos::Tools::Experim~~: Tool Requested Fence 0.000011 0.000011 0.000166 0.000166
Kokkos::Impl::ViewValu~~setting values in view 0.000030 0.000030 0.000454 0.000454
Kokkos::Tools::Experime~~e: Tool Requested Fence 0.828718 0.828718 12.540272 12.540272
Kokkos::CudaInternal::i~~on space initialization 0.000028 0.000028 0.000424 0.000424 Note how Caliper reports the fences it induces. So in the runtime-report case, it needs fences. In the cuda-activity-report case, it doesn't. What's fun here is that as you develop HIP capability to do the same, I just add "hip-activity-report" to the list of things that don't require fencing, but until you do, people get accurate results with runtime-report. And they see how much time is coming from tool induced fencing, if they want to adjust for that. This is a very neat thing. I'll push code |
@daboehme also note the total test runtime. This is a pathological example, but we're cutting runtime in half on the test |
Is this ready to merge now @DavidPoliakoff ? |
In my book it is, if you think it's good to go, merge it |
Ping |
Oh, sorry for the delay. I took a closer look at this again - it seems the only thing that disables the fencing now is if |
Why turn it off, though? If I'm not doing cuda-activity-report, the answers I get will be garbage, no? |
There are other configs (cuda-activity-profile) which also work for asynchronous tasks. Or you can have custom configs. |
What else works for asynchronous tasks? I basically want to limit it to those, as people will use the custom config case, and not know what they're doing, and ask me why Kokkos/Caliper says their kernels only take 30 microseconds |
There's the cuda-activity-profile (in addition to -report), whatever we'll do for HIP, etc. Plus maybe you want to see the asynchronous times. I'm just saying it would be nice to be more configurable for people who do know what they're doing. |
I'd rather maintain a list. The asynchronous times are completely meaningless. Ratio of "people who will be curious to see kernel times in a non-asynchrony-aware case" to "people who will ask me why the config their buddy gave them that says the kernel they know takes 1 second takes exactly the length of a CUDA/HIP kernel launch" is way too high for me to want to expose that option. I'll ask in the group about this, though. |
An aggressive proposal might be for configs to say what resources they require fences of (this is independently useful information). Then Kokkos could just query "ah, does the requested config need fencing of [resource]?" |
Yeah, but that would introduce additional requirements in services that don't necessarily have much to do with kokkos. Unfortunately there is not really a reliable automatic way to determine if you need fencing or not, you just have to encode that somewhere. It can be left on by default if you think it confuses users otherwise. |
It actually provides a lot of value outside of Kokkos. If a user wants to know "what are the configurations that allow me to get asynchronous timings, and CUDA, and produce a trace" for example, they currently look in docs, perhaps? But if configs documented their capabilities, we could allow people to query these kinds of things. Just a thought, not one I'm pushing too hard. What I am pushing very hard against is making this an independent configuration setting. I think it is a very, very, very bad idea, that will become one of the main sources of questions I get about why Caliper is doing something unintuitive. If you tell me "this needs to be an independent setting" I'll do it, but I think it's a bad move to separate this from a list of configs, that buys us nothing but bug reports we'll hate to debug. Edit to add: I don't mean that last aggressively, I just want to highlight how dangerous I think the idea is |
If you don't want a global config variable, maybe just make turning it off explicitly available via the |
So, could somebody give me a use case where
|
Sorry, somehow I hit the close button, ignore that |
Ok, it seems we have different philosophies here. The fencing behavior has a major effect on what is being measured. There are clearly situations where you would want to turn it off (e.g., the cuda-activity-report), and therefore it would be nice to have it runtime-configurable as an option. It wouldn't be unintuitive either, you'll leave the defaults exactly as they are now (on by default, off with cuda-activity-report), so users don't have to deal with this setting but they can change it if they want to. Just gives us more flexibility in the long term. What for I don't know yet, but I'm sure there are more situations where you'd want to turn fencing off. |
I don't want to add a footgun, which turning off fencing would be, without a "here's how this footgun will do something other than put bullets in feet." A user will see "turns off fencing," skip the part about it screwing up measurements except in very specific cases, use it, and get confused. We have too many users to just throw that around. The closest I can come to this being a good idea would be an unlisted, undocumented option, that configs could define. Then my code could inspect the config, rather than using the name. That would provide some value. But I'm wholesale opposed to letting users set this independently without a concrete reason. It's your project, if you want to say "I won't do this unless it's an option," I'll do it, I'm just saying it's a very very very bad idea to have this |
It's certainly an "expert option" and doesn't need to be advertised much. If people use without understanding what it does that's on them. But if someone comes with "I want to do X but that requires turning off fencing" then I don't want the answer to be "too bad, we can't do that". It's just a way to future-proof this. |
Okay, it'll probably be tomorrow before I can get back around to this, but I'll do it that way |
Sounds good. If you don't have the time I can take a stab at it. It's not difficult. I just don't have a setup to really test it. |
Many Kokkos PR's ago we added the ability for a tool to control the fencing behavior of Kokkos. In the case of Caliper, this could minimize overhead in one place currently, and more in the future.
What I do right now is just tell Kokkos only to fence if a callback subscribes to a given event. This could allow us in the future to do things like "only profile Regions" and not induce fences in parallel kernels. There's also room to do extremely clever things with GPU backends where we don't introduce syncs at all using CUPTI, but that's beyond the scope of what I want to do here.
Note that this does introduce some header files, which are extremely backwards compatible, we have about six years of history of these files not breaking anything