Allowed Caliper to control Kokkos fencing behavior #373

DavidPoliakoff · 2021-06-28T15:05:11Z

Many Kokkos PR's ago we added the ability for a tool to control the fencing behavior of Kokkos. In the case of Caliper, this could minimize overhead in one place currently, and more in the future.

What I do right now is just tell Kokkos only to fence if a callback subscribes to a given event. This could allow us in the future to do things like "only profile Regions" and not induce fences in parallel kernels. There's also room to do extremely clever things with GPU backends where we don't introduce syncs at all using CUPTI, but that's beyond the scope of what I want to do here.

Note that this does introduce some header files, which are extremely backwards compatible, we have about six years of history of these files not breaking anything

DavidPoliakoff · 2021-06-28T15:28:34Z

Hold off on merging this one, there's something we ought to discuss

DavidPoliakoff · 2021-06-28T17:24:54Z

I pinged you on an email, let me know what you think

daboehme · 2021-06-28T20:35:55Z

Oh good, you already implemented this, I thought I'd have to deal with the Kokkos fencing stuff now 😄

Do I interpret this correctly that you tell Kokkos "I don't need auto-fencing", but then tell it to fence explicitly in each of the callbacks? Generally Caliper shouldn't need fencing at all. For CUDA, Caliper can map GPU times to the regions that launched them, without any need for explicit fencing (you can give it a try with cuda-activity-report), and we should be able to do similar things for other programming models. The only thing that might need fencing currently would be the map-UVM-events-to-Kokkos-spaces analysis. So I think it would be better if Caliper didn't fence by default. Maybe make it optional?

TBH I still don't understand why the tool interface should have to concern itself with that, that should be a switch in Kokkos that the user can set. Oh well. And as Caliper shouldn't typically require fencing at all, I can't spontaneously think of any scenario where it would be beneficial to have more fine-grained control over it.

DavidPoliakoff · 2021-06-28T20:56:02Z

Do I interpret this correctly that you tell Kokkos "I don't need auto-fencing", but then tell it to fence explicitly in each of the callbacks? Generally Caliper shouldn't need fencing at all. For CUDA, Caliper can map GPU times to the regions that launched them, without any need for explicit fencing (you can give it a try with cuda-activity-report), and we should be able to do similar things for other programming models. The only thing that might need fencing currently would be the map-UVM-events-to-Kokkos-spaces analysis. So I think it would be better if Caliper didn't fence by default. Maybe make it optional?

!!!!!!! Very cool, let me test that out. If this works, then yeah, delete fences, and get ready for Sandia users to beat down your door... Should be just CALI_CONFIG=cuda-activity-report(profile.kokkos)?

DavidPoliakoff · 2021-06-28T20:56:54Z

I think it should be optional, in any case, due to the reasons I outlined in my email (you're not really just timing your own kernel, in a multi-stream case)

daboehme · 2021-06-28T20:58:34Z

Yeah, CALI_CONFIG="cuda-activity-report(profile.kokkos)" should work. Here's some more information about these reports: https://software.llnl.gov/Caliper/CUDA.html

DavidPoliakoff · 2021-06-28T21:29:25Z

Okay, so I think I'll write the logic as follows:

If we're using CUDA, and doing a cuda-activity-report, then no fencing is needed.
Otherwise, do global fencing

And then as you gain the ability to do HIP fence-free profiling, we just keep adding to case 1

daboehme · 2021-06-28T21:50:06Z

Might be easier to just give the kokkos service a config flag that turns on fencing. Is there actually another situation besides the UVM event mapping where we really need it? Do people insist on synchronous times with e.g. a runtime-report?

DavidPoliakoff · 2021-06-28T21:57:40Z

So, the problem is that if you don't do synchronous timings, you confuse your users. You claim that a dot-product on three elements took a minute, when in reality a kernel on another stream hogged the device for all but a few microseconds of that time. By the way, I was writing your side of this argument to Christian earlier today. But think about it, you write

call_huge_calculation_on_cuda(stream2);
call_small_dot_product(stream1);

Depending on how the scheduler works, you'll happily submit on stream2, then stream1, and then the big calculation hogs the device. But you've still started your timing on the kernel in stream1 (unless Caliper is doing something very interesting?). Your users write "why does Caliper think a dot product takes a minute" and then everybody is confused.

jrmadsen · 2021-06-28T23:16:48Z

Depending on how the scheduler works, you'll happily submit on stream2, then stream1, and then the big calculation hogs the device. But you've still started your timing on the kernel in stream1 (unless Caliper is doing something very interesting?)

Timing via Cupti isn't like timing on a CPU.
You don't explicitly start/stop anything. Each kernel submission in CUPTI has a unique ID so as long as Caliper associates the Kokkos string with that ID before Kokkos launches another kernel, Caliper (or any tool using Cupti) is capable of resolving your theoretical scenario

DavidPoliakoff · 2021-06-28T23:22:41Z

First: hi! Hope you're enjoying AMD

Second: It can resolve it, but the timings won't necessarily be accurate in the sense of "how long does my dot kernel take to run," it might answer "how long does my dot kernel take to get scheduled, possibly delayed by other kernels, and then eventually run." Which is not at all a useless answer, but it's also not always the answer a user will want

jrmadsen · 2021-06-28T23:33:05Z

First: hi! Hope you're enjoying AMD

Had my first day today. I haven't sent out an email yet (only bc of procrastination) so I suspect this news to @daboehme 🤣

timings won't necessarily be accurate in the sense of "how long does my dot kernel take to run,"

What I'm saying is that it will be accurate for how long the kernel takes to run bc that's exactly what a cupti activity record provides: the runtime for the kernel, absent everything else. What you are describing with

"how long does my dot kernel take to get scheduled, possibly delayed by other kernels, and then eventually run."

is what you would get if you used cudaEvent timings, not cupti

DavidPoliakoff · 2021-06-28T23:34:50Z

Ah! That's actually news to me. Cool, in that case I don't see a reason to ever fence. And congrats on the first day, and getting to surprise Boehme, that's always a good day.

daboehme · 2021-06-29T00:04:03Z

First: hi! Hope you're enjoying AMD

Had my first day today. I haven't sent out an email yet (only bc of procrastination) so I suspect this news to @daboehme

It is! Well, congratulations 😄

Ah! That's actually news to me. Cool, in that case I don't see a reason to ever fence.

Exactly! Well, unless you do weird stuff like the UVM event to kokkos space mapping. Otherwise I think all the artificial fencing is probably more harmful than beneficial for understanding performance.

DavidPoliakoff · 2021-06-29T20:49:29Z

Path                                             Host Time GPU Time GPU %
cudaFreeHost                                      0.000039
test_two                                          3.329040 1.532707  46.040502
  Kokkos::Tools::Experim~~: Tool Requested Fence  1.671173
    cudaDeviceSynchronize                         1.026064
  cudaLaunchKernel                                0.614629 1.532707 249.371064
  cudaFuncSetCacheConfig                          0.000005
  cudaFuncGetAttributes                           0.000007
test_one                                          3.272529 1.533013  46.844918
  Kokkos::Tools::Experim~~: Tool Requested Fence  1.669054
    cudaDeviceSynchronize                         1.022471
  cudaLaunchKernel                                0.557666 1.533013 274.898070
  cudaFuncSetCacheConfig                          0.000006
  cudaFuncGetAttributes                           0.000008
Kokkos::View::initialization [v2]                 0.000113 0.000030  26.590754
  Kokkos::Tools::Experim~~: Tool Requested Fence  0.000025
    cudaDeviceSynchronize                         0.000012
  Kokkos::Impl::ViewValu~~setting values in view  0.000030
    cudaStreamSynchronize                         0.000021
  cudaLaunchKernel                                0.000017 0.000030 173.327938
  cudaPointerGetAttributes                        0.000006
Kokkos::View::initialization [v1]                 0.000182 0.000031  17.277801
  Kokkos::Tools::Experim~~: Tool Requested Fence  0.000024
    cudaDeviceSynchronize                         0.000011
  Kokkos::Impl::ViewValu~~setting values in view  0.000029
    cudaStreamSynchronize                         0.000018
  cudaLaunchKernel                                0.000016 0.000030 189.246237
  cudaMemcpyToSymbol                              0.000017 0.000001   6.528709
  cudaFuncSetCacheConfig                          0.000006
  cudaFuncGetAttributes                           0.000013
  cudaPointerGetAttributes                        0.000014
Kokkos::Tools::Experime~~e: Tool Requested Fence  2.237524
  cudaDeviceSynchronize                           0.955436
cudaDeviceSetCacheConfig                          0.000013
cudaMemset                                        0.000046
cudaHostAlloc                                     0.000026
cudaFree                                          0.000563
cudaLaunchKernel                                  0.000048 0.000006  13.201375
cudaMemcpy                                        0.000488 0.000041   8.321780
cudaMalloc                                        0.000409
Kokkos::CudaInternal::i~~on space initialization  0.000057
  cudaDeviceSynchronize                           0.000036
cudaSetDevice                                     0.000016
cudaStreamCreate                                  0.000694
(base) [dzpolia@kokkos-dev-2 cuda-build]$ ./core/unit_test/KokkosCore_UnitTest_Develop --kokkos-tools-library=$HOME/src/caliper/nkb/src/libcaliper.so --kokkos-tools-args="cuda-activity-report(profile.kokkos)"
(REBUILD WITHOUT FENCES)
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from defaultdevicetype
[ RUN      ] defaultdevicetype.development_test
[       OK ] defaultdevicetype.development_test (3125 ms)
[----------] 1 test from defaultdevicetype (3125 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3125 ms total)
[  PASSED  ] 1 test.
Path                                             Host Time GPU Time GPU %
cudaFreeHost                                      0.000037
test_two                                          1.293494 1.391292 107.560713
  cudaLaunchKernel                                0.833609 1.391292 166.899775
  cudaFuncSetCacheConfig                          0.000008
  cudaFuncGetAttributes                           0.000007
test_one                                          1.229367 1.391136 113.158774
  cudaLaunchKernel                                0.769919 1.391136 180.686070
  cudaFuncSetCacheConfig                          0.000006
  cudaFuncGetAttributes                           0.000008

Nifty result. Actually cool in both cases, Caliper giving you some visibility into what Kokkos does under the hood. I'll refine that

DavidPoliakoff · 2021-06-29T22:36:26Z

Actually, here's the cool result (with some refinement:

(base) [dzpolia@kokkos-dev-2 cuda-build]$ ./core/unit_test/KokkosCore_UnitTest_Develop --kokkos-tools-library=$HOME/src/caliper/nkb/src/libcaliper.so --kokkos-tools-args="cuda-activity-report(profile.kokkos)"
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from defaultdevicetype
[ RUN      ] defaultdevicetype.development_test
[       OK ] defaultdevicetype.development_test (3117 ms)
[----------] 1 test from defaultdevicetype (3118 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3118 ms total)
[  PASSED  ] 1 test.
Path                                             Host Time GPU Time GPU %
cudaFreeHost                                      0.000038
test_two                                          1.263474 1.391103 110.101493
  cudaLaunchKernel                                0.781364 1.391103 178.035232
  cudaFuncSetCacheConfig                          0.000006
  cudaFuncGetAttributes                           0.000008
test_one                                          1.225919 1.391138 113.477202
  cudaLaunchKernel                                0.738680 1.391138 188.327482
  cudaFuncSetCacheConfig                          0.000006
  cudaFuncGetAttributes                           0.000008
Kokkos::View::initialization [v2]                 0.000077 0.000030  39.004349
  Kokkos::Impl::ViewValu~~setting values in view  0.000021
    cudaStreamSynchronize                         0.000011
  cudaLaunchKernel                                0.000026 0.000030 116.596818
  cudaPointerGetAttributes                        0.000006
Kokkos::View::initialization [v1]                 0.000161 0.000032  19.537230
  Kokkos::Impl::ViewValu~~setting values in view  0.000028
    cudaStreamSynchronize                         0.000016
  cudaLaunchKernel                                0.000018 0.000030 171.277255
  cudaMemcpyToSymbol                              0.000022 0.000001   5.140208
  cudaFuncSetCacheConfig                          0.000006
  cudaFuncGetAttributes                           0.000013
  cudaPointerGetAttributes                        0.000015
cudaDeviceSetCacheConfig                          0.000012
cudaMemset                                        0.000048
cudaHostAlloc                                     0.000026
cudaFree                                          0.000417
cudaLaunchKernel                                  0.000048 0.000006  12.817001
cudaMemcpy                                        0.028340 0.000039   0.138658
cudaMalloc                                        0.000368
Kokkos::CudaInternal::i~~on space initialization  0.000088
  cudaDeviceSynchronize                           0.000048
cudaSetDevice                                     0.000016
cudaStreamCreate                                  0.000647
(base) [dzpolia@kokkos-dev-2 cuda-build]$ ./core/unit_test/KokkosCore_UnitTest_Develop --kokkos-tools-library=$HOME/src/caliper/nkb/src/libcaliper.so --kokkos-tools-args="runtime-report(profile.kokkos)"
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from defaultdevicetype
[ RUN      ] defaultdevicetype.development_test
[       OK ] defaultdevicetype.development_test (6608 ms)
[----------] 1 test from defaultdevicetype (6608 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (6608 ms total)
[  PASSED  ] 1 test.
Path                                             Time (E) Time (I) Time % (E) Time % (I)
test_two                                         0.864759 2.130433  13.085650  32.237999
  Kokkos::Tools::Experim~~: Tool Requested Fence 1.265674 1.265674  19.152349  19.152349
test_one                                         0.871407 2.135599  13.186248  32.316171
  Kokkos::Tools::Experim~~: Tool Requested Fence 1.264192 1.264192  19.129923  19.129923
Kokkos::View::initialization [v2]                0.000031 0.000072   0.000469   0.001090
  Kokkos::Tools::Experim~~: Tool Requested Fence 0.000010 0.000010   0.000151   0.000151
  Kokkos::Impl::ViewValu~~setting values in view 0.000031 0.000031   0.000469   0.000469
Kokkos::View::initialization [v1]                0.000076 0.000117   0.001150   0.001770
  Kokkos::Tools::Experim~~: Tool Requested Fence 0.000011 0.000011   0.000166   0.000166
  Kokkos::Impl::ViewValu~~setting values in view 0.000030 0.000030   0.000454   0.000454
Kokkos::Tools::Experime~~e: Tool Requested Fence 0.828718 0.828718  12.540272  12.540272
Kokkos::CudaInternal::i~~on space initialization 0.000028 0.000028   0.000424   0.000424

Note how Caliper reports the fences it induces. So in the runtime-report case, it needs fences. In the cuda-activity-report case, it doesn't. What's fun here is that as you develop HIP capability to do the same, I just add "hip-activity-report" to the list of things that don't require fencing, but until you do, people get accurate results with runtime-report. And they see how much time is coming from tool induced fencing, if they want to adjust for that. This is a very neat thing. I'll push code

DavidPoliakoff · 2021-06-29T22:42:00Z

@daboehme also note the total test runtime. This is a pathological example, but we're cutting runtime in half on the test

daboehme · 2021-07-01T19:40:59Z

Is this ready to merge now @DavidPoliakoff ?

DavidPoliakoff · 2021-07-02T14:18:44Z

In my book it is, if you think it's good to go, merge it

DavidPoliakoff · 2021-07-13T14:55:31Z

Ping

daboehme · 2021-07-13T17:23:53Z

Oh, sorry for the delay. I took a closer look at this again - it seems the only thing that disables the fencing now is if cuda-activity-report is active? Seems a bit weak - I think it would be good to have an explicit switch as well, like a CALI_KOKKOS_ENABLE_FENCING config variable. Still fine setting the defaults appropriately based on which config we detect.

DavidPoliakoff · 2021-07-13T18:39:37Z

Why turn it off, though? If I'm not doing cuda-activity-report, the answers I get will be garbage, no?

daboehme · 2021-07-13T20:08:48Z

Why turn it off, though? If I'm not doing cuda-activity-report, the answers I get will be garbage, no?

There are other configs (cuda-activity-profile) which also work for asynchronous tasks. Or you can have custom configs.

DavidPoliakoff · 2021-07-13T20:36:53Z

What else works for asynchronous tasks? I basically want to limit it to those, as people will use the custom config case, and not know what they're doing, and ask me why Kokkos/Caliper says their kernels only take 30 microseconds

daboehme · 2021-07-13T20:43:44Z

There's the cuda-activity-profile (in addition to -report), whatever we'll do for HIP, etc. Plus maybe you want to see the asynchronous times. I'm just saying it would be nice to be more configurable for people who do know what they're doing.

DavidPoliakoff · 2021-07-13T20:46:43Z

I'd rather maintain a list. The asynchronous times are completely meaningless. Ratio of "people who will be curious to see kernel times in a non-asynchrony-aware case" to "people who will ask me why the config their buddy gave them that says the kernel they know takes 1 second takes exactly the length of a CUDA/HIP kernel launch" is way too high for me to want to expose that option. I'll ask in the group about this, though.

DavidPoliakoff · 2021-07-13T20:49:43Z

An aggressive proposal might be for configs to say what resources they require fences of (this is independently useful information). Then Kokkos could just query "ah, does the requested config need fencing of [resource]?"

daboehme · 2021-07-13T21:34:22Z

An aggressive proposal might be for configs to say what resources they require fences of (this is independently useful information). Then Kokkos could just query "ah, does the requested config need fencing of [resource]?"

Yeah, but that would introduce additional requirements in services that don't necessarily have much to do with kokkos. Unfortunately there is not really a reliable automatic way to determine if you need fencing or not, you just have to encode that somewhere. It can be left on by default if you think it confuses users otherwise.

DavidPoliakoff · 2021-07-13T21:53:36Z

It actually provides a lot of value outside of Kokkos. If a user wants to know "what are the configurations that allow me to get asynchronous timings, and CUDA, and produce a trace" for example, they currently look in docs, perhaps? But if configs documented their capabilities, we could allow people to query these kinds of things.

Just a thought, not one I'm pushing too hard. What I am pushing very hard against is making this an independent configuration setting. I think it is a very, very, very bad idea, that will become one of the main sources of questions I get about why Caliper is doing something unintuitive. If you tell me "this needs to be an independent setting" I'll do it, but I think it's a bad move to separate this from a list of configs, that buys us nothing but bug reports we'll hate to debug.

Edit to add: I don't mean that last aggressively, I just want to highlight how dangerous I think the idea is

jrmadsen · 2021-07-14T02:44:12Z

If you don't want a global config variable, maybe just make turning it off explicitly available via the kokkosp_parse_args callback so that it's available to the very few users that want it but doesn't add unnecessary clutter to the global configuration.

DavidPoliakoff · 2021-07-14T13:48:57Z

So, could somebody give me a use case where

I want Kokkos
I want no fences
I'm not using one of the curated list of configs for which fences aren't necessary

DavidPoliakoff · 2021-07-14T14:34:25Z

Sorry, somehow I hit the close button, ignore that

daboehme · 2021-07-14T16:34:03Z

Ok, it seems we have different philosophies here. The fencing behavior has a major effect on what is being measured. There are clearly situations where you would want to turn it off (e.g., the cuda-activity-report), and therefore it would be nice to have it runtime-configurable as an option. It wouldn't be unintuitive either, you'll leave the defaults exactly as they are now (on by default, off with cuda-activity-report), so users don't have to deal with this setting but they can change it if they want to. Just gives us more flexibility in the long term. What for I don't know yet, but I'm sure there are more situations where you'd want to turn fencing off.

DavidPoliakoff · 2021-07-14T17:24:41Z

I don't want to add a footgun, which turning off fencing would be, without a "here's how this footgun will do something other than put bullets in feet." A user will see "turns off fencing," skip the part about it screwing up measurements except in very specific cases, use it, and get confused. We have too many users to just throw that around.

The closest I can come to this being a good idea would be an unlisted, undocumented option, that configs could define. Then my code could inspect the config, rather than using the name. That would provide some value. But I'm wholesale opposed to letting users set this independently without a concrete reason.

It's your project, if you want to say "I won't do this unless it's an option," I'll do it, I'm just saying it's a very very very bad idea to have this

daboehme · 2021-07-14T20:18:25Z

It's certainly an "expert option" and doesn't need to be advertised much. If people use without understanding what it does that's on them. But if someone comes with "I want to do X but that requires turning off fencing" then I don't want the answer to be "too bad, we can't do that". It's just a way to future-proof this.

DavidPoliakoff · 2021-07-14T21:16:28Z

Okay, it'll probably be tomorrow before I can get back around to this, but I'll do it that way

daboehme · 2021-07-14T22:16:28Z

Sounds good. If you don't have the time I can take a stab at it. It's not difficult. I just don't have a setup to really test it.

Allowed Caliper to control Kokkos fencing behavior

b7308cd

Added conditional fencing logic

c430e93

DavidPoliakoff closed this Jul 14, 2021

DavidPoliakoff reopened this Jul 14, 2021

Allowed Caliper to control Kokkos fencing behavior #373

Are you sure you want to change the base?

Allowed Caliper to control Kokkos fencing behavior #373

Conversation

DavidPoliakoff commented Jun 28, 2021

DavidPoliakoff commented Jun 28, 2021

DavidPoliakoff commented Jun 28, 2021

daboehme commented Jun 28, 2021

DavidPoliakoff commented Jun 28, 2021

DavidPoliakoff commented Jun 28, 2021

daboehme commented Jun 28, 2021

DavidPoliakoff commented Jun 28, 2021

daboehme commented Jun 28, 2021

DavidPoliakoff commented Jun 28, 2021 • edited Loading

jrmadsen commented Jun 28, 2021

DavidPoliakoff commented Jun 28, 2021

jrmadsen commented Jun 28, 2021

DavidPoliakoff commented Jun 28, 2021

daboehme commented Jun 29, 2021

DavidPoliakoff commented Jun 29, 2021

DavidPoliakoff commented Jun 29, 2021

DavidPoliakoff commented Jun 29, 2021

daboehme commented Jul 1, 2021

DavidPoliakoff commented Jul 2, 2021

DavidPoliakoff commented Jul 13, 2021

daboehme commented Jul 13, 2021

DavidPoliakoff commented Jul 13, 2021

daboehme commented Jul 13, 2021

DavidPoliakoff commented Jul 13, 2021

daboehme commented Jul 13, 2021

DavidPoliakoff commented Jul 13, 2021

DavidPoliakoff commented Jul 13, 2021

daboehme commented Jul 13, 2021

DavidPoliakoff commented Jul 13, 2021 • edited Loading

jrmadsen commented Jul 14, 2021 • edited Loading

DavidPoliakoff commented Jul 14, 2021

DavidPoliakoff commented Jul 14, 2021

daboehme commented Jul 14, 2021

DavidPoliakoff commented Jul 14, 2021

daboehme commented Jul 14, 2021

DavidPoliakoff commented Jul 14, 2021

daboehme commented Jul 14, 2021 • edited Loading

DavidPoliakoff commented Jun 28, 2021 •

edited

Loading

DavidPoliakoff commented Jul 13, 2021 •

edited

Loading

jrmadsen commented Jul 14, 2021 •

edited

Loading

daboehme commented Jul 14, 2021 •

edited

Loading