-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[vulkan] Reduce descriptor sets, use official headers, improve allocator, remove module destructor #8452
Conversation
…t per SPIR-V module. Use an amalgamation for the binary module returned by CodeGen Vulkan Dev. Decode amalgamation and cache each kernel module inside of cache entry.
Hook up dump() and dump_module() methods to debug options.
@steven-johnson Okay, testing this PR on the configs I have locally has everything passing. Can we re-enable testing on the Buildbots to see if there's any device or driver differences I need to catch? |
|
Done, please start testing |
I’ll investigate the three failures:
|
…ut validation errors.
Fixes added for #8466 |
@steven-johnson I believe the GPU lifetime and device memory leak errors are actually caused by the Validation Layer shared lib itself when invoked with ctest --parallel. It doesn't seem very robust at all. It's useful for diagnosing issues when they come up, but it doesn't seem production worthy. I'd like to suggest we remove the VK_INSTANCE_LAYERS env var from the buildbot config, and keep the Vulkan tests enabled. |
Done, change made and buildbot master restarted -- please try again :-) |
So, other than the build failures from LLVM20 interface changes, all the Vulkan tests and apps are actually passing now and reporting "SUCCESS". However, there's exception's being thrown at shutdown that I can't reproduce. |
(FYI: I'm turning off Vulkan testing for now since literally every buildbot is reporting failures with it and it makes finding "real" failures painful, lmk if this makes your work unreasonable and I will revert) EDIT: going to re-enable after offline discussion with @derek-gerstmann |
Okay, updated drivers didn't help. The Vulkan spec has a specific section on "Device Lost", which seems to imply that the device driver is allowed to do whatever it wants, and the app has to recover. I'll add some diagnostics and look at using the Debug extensions to see if we can handle this gracefully. |
FYI: I ssh'ed in and both systems are running the official driver @560 right now; 565 shows as available, but won't install due do a missing dep on libssl1.1. I don't want to mess with it without @abadams on site to support unless something goes wrong; also, a quick google shows several recent hits of people reporting instability with 565 on Ubuntu, so I don't want to touch it unless I hear specifically from you that you think it will move the needle, please LMK. |
Thanks! Yes, @abadams was kind enough to update the workers to @560. If there’s instability being reported we could drop to @550 which is the current release version. |
… of shutdown/cleanup.
Disable other Halide_TARGETS to speed up testing.
Register debug callbacks to try and diagnose potential driver issues. Update src/runtime/HalideRuntimeVulkan.h with guarded typedefs to match Vulkan header. Add VK_KHR_MAINTENANCE_5_EXTENSION_NAME to optional device extensions.
module dtor like all other runtimes.
Update comment in module destructor to match latest findings. Added issue link. Cleanup Vulkan handle initializations ... use VK_NULL_HANDLE.
@steven-johnson @abadams This is ready to merge. I reverted all the custom destructor stuff and we're back to using the module destructor like all other runtimes. As we discussed during the group meeting, we can disable Vulkan testing on the Linux build bots for now ... we'll need to test on other configs until there's a confirmed fix for #8497 |
Done
Are any of our other existing bots suitable or do we need a new one? |
Thanks!
We should be able to test on Windows. I'm building it locally now to confirm. We also plan on getting a Raspberry Pi 5 and add that as a new bot. |
In order to avoid using up all descriptor sets for complicated pipelines, this PR changes the Vulkan CodeGen such that we encode each Kernel entry-point into it's own SPIR-V module and bind them as separate shaders to avoid running out of resources on constrained devices.
Remove stale mini_vulkan.h and instead use official Vulkan Headers for runtime. Added local copy of the subset of ANSI-C headers we need to dependencies/vulkan, and a script to make it easy to update to new release branches (same as SPIR-V).
Split function pointer interfaces into three API categories ... Loader, Instance and Device. Now, only Loader functions exist for the lifetime of the module. Instance and Device function pointers are managed for the lifetime of the context, and Device function pointers now skip the instance call chain.
Changed VulkanMemoryAllocator to iterate across all memory types to allow all heaps to be utilized by using priority flags (rather than requiring all). Also reduced minimum allocation size to 16KB (instead of 32MB) to reduce memory pressure.
Fixed small integer encoding to be sign extended and packed to 32bits.
Removed global VkCommandPool from context since the API type definition changes between platforms, and we only use them for small transient command buffers that we destroy immediately. Use ScopedVulkanCommandPoolAndBuffer to avoid leaking if an error occurs within function scope.
Remove module attribute destructor (since we can't guarantee the driver doesn't register an atexit call which may get invoked before).
Fixes #7235
Fixes #8297
Fixes #8296
Fixes #8295
Fixes #8294
Fixes #8466
Re-enable performance wrap test #7559 .