Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vulkan] Reduce descriptor sets, use official headers, improve allocator, remove module destructor #8452

Merged
merged 65 commits into from
Dec 9, 2024

Conversation

derek-gerstmann
Copy link
Contributor

@derek-gerstmann derek-gerstmann commented Oct 31, 2024

In order to avoid using up all descriptor sets for complicated pipelines, this PR changes the Vulkan CodeGen such that we encode each Kernel entry-point into it's own SPIR-V module and bind them as separate shaders to avoid running out of resources on constrained devices.

Remove stale mini_vulkan.h and instead use official Vulkan Headers for runtime. Added local copy of the subset of ANSI-C headers we need to dependencies/vulkan, and a script to make it easy to update to new release branches (same as SPIR-V).

Split function pointer interfaces into three API categories ... Loader, Instance and Device. Now, only Loader functions exist for the lifetime of the module. Instance and Device function pointers are managed for the lifetime of the context, and Device function pointers now skip the instance call chain.

Changed VulkanMemoryAllocator to iterate across all memory types to allow all heaps to be utilized by using priority flags (rather than requiring all). Also reduced minimum allocation size to 16KB (instead of 32MB) to reduce memory pressure.

Fixed small integer encoding to be sign extended and packed to 32bits.

Removed global VkCommandPool from context since the API type definition changes between platforms, and we only use them for small transient command buffers that we destroy immediately. Use ScopedVulkanCommandPoolAndBuffer to avoid leaking if an error occurs within function scope.

Remove module attribute destructor (since we can't guarantee the driver doesn't register an atexit call which may get invoked before).

Fixes #7235
Fixes #8297
Fixes #8296
Fixes #8295
Fixes #8294
Fixes #8466

Re-enable performance wrap test #7559 .

@derek-gerstmann
Copy link
Contributor Author

@steven-johnson Okay, testing this PR on the configs I have locally has everything passing. Can we re-enable testing on the Buildbots to see if there's any device or driver differences I need to catch?

@steven-johnson
Copy link
Contributor

@steven-johnson Okay, testing this PR on the configs I have locally has everything passing. Can we re-enable testing on the Buildbots to see if there's any device or driver differences I need to catch?

halide/build_bot#294

@steven-johnson
Copy link
Contributor

@steven-johnson Okay, testing this PR on the configs I have locally has everything passing. Can we re-enable testing on the Buildbots to see if there's any device or driver differences I need to catch?

halide/build_bot#294

Done, please start testing

@derek-gerstmann
Copy link
Contributor Author

I’ll investigate the three failures:

The following tests FAILED:
correctness_bounds 
correctness_debug_to_file
correctness_device_buffer_copies_with_profile

@derek-gerstmann
Copy link
Contributor Author

Fixes added for #8466

@derek-gerstmann derek-gerstmann added bug gpu documentation Missing, incorrect, or unclear. Spelling & grammar mistakes. release_notes For changes that may warrant a note in README for official releases. labels Nov 6, 2024
@derek-gerstmann
Copy link
Contributor Author

derek-gerstmann commented Nov 7, 2024

@steven-johnson I believe the GPU lifetime and device memory leak errors are actually caused by the Validation Layer shared lib itself when invoked with ctest --parallel. It doesn't seem very robust at all. It's useful for diagnosing issues when they come up, but it doesn't seem production worthy. I'd like to suggest we remove the VK_INSTANCE_LAYERS env var from the buildbot config, and keep the Vulkan tests enabled.

@steven-johnson
Copy link
Contributor

@steven-johnson I believe the GPU lifetime and device memory leak errors are actually caused by the Validation Layer shared lib itself when invoked with ctest --parallel. It doesn't seem very robust at all. It's useful for diagnosing issues when they come up, but it doesn't seem production worthy. I'd like to suggest we remove the VK_INSTANCE_LAYERS env var from the buildbot config, and keep the Vulkan tests enabled.

Done, change made and buildbot master restarted -- please try again :-)

@derek-gerstmann
Copy link
Contributor Author

So, other than the build failures from LLVM20 interface changes, all the Vulkan tests and apps are actually passing now and reporting "SUCCESS". However, there's exception's being thrown at shutdown that I can't reproduce.

@steven-johnson
Copy link
Contributor

steven-johnson commented Dec 2, 2024

(FYI: I'm turning off Vulkan testing for now since literally every buildbot is reporting failures with it and it makes finding "real" failures painful, lmk if this makes your work unreasonable and I will revert)

EDIT: going to re-enable after offline discussion with @derek-gerstmann

@derek-gerstmann
Copy link
Contributor Author

Okay, updated drivers didn't help. The Vulkan spec has a specific section on "Device Lost", which seems to imply that the device driver is allowed to do whatever it wants, and the app has to recover. I'll add some diagnostics and look at using the Debug extensions to see if we can handle this gracefully.

@steven-johnson
Copy link
Contributor

FYI: I ssh'ed in and both systems are running the official driver @560 right now; 565 shows as available, but won't install due do a missing dep on libssl1.1. I don't want to mess with it without @abadams on site to support unless something goes wrong; also, a quick google shows several recent hits of people reporting instability with 565 on Ubuntu, so I don't want to touch it unless I hear specifically from you that you think it will move the needle, please LMK.

@derek-gerstmann
Copy link
Contributor Author

FYI: I ssh'ed in and both systems are running the official driver @560 right now; 565 shows as available, but won't install due do a missing dep on libssl1.1. I don't want to mess with it without @abadams on site to support unless something goes wrong; also, a quick google shows several recent hits of people reporting instability with 565 on Ubuntu, so I don't want to touch it unless I hear specifically from you that you think it will move the needle, please LMK.

Thanks! Yes, @abadams was kind enough to update the workers to @560. If there’s instability being reported we could drop to @550 which is the current release version.

Derek Gerstmann added 10 commits December 3, 2024 13:00
Disable other Halide_TARGETS to speed up testing.
Register debug callbacks to try and diagnose potential driver issues.
Update src/runtime/HalideRuntimeVulkan.h with guarded typedefs to match Vulkan header.
Add VK_KHR_MAINTENANCE_5_EXTENSION_NAME to optional device extensions.
CMakeLists.txt Outdated Show resolved Hide resolved
Derek Gerstmann added 5 commits December 9, 2024 11:35
Update comment in module destructor to match latest findings.  Added
issue link.
Cleanup Vulkan handle initializations ... use VK_NULL_HANDLE.
@derek-gerstmann
Copy link
Contributor Author

@steven-johnson @abadams This is ready to merge. I reverted all the custom destructor stuff and we're back to using the module destructor like all other runtimes. As we discussed during the group meeting, we can disable Vulkan testing on the Linux build bots for now ... we'll need to test on other configs until there's a confirmed fix for #8497

@steven-johnson
Copy link
Contributor

we can disable Vulkan testing on the Linux build bots for now

Done

we'll need to test on other configs until there's a confirmed fix for #8497

Are any of our other existing bots suitable or do we need a new one?

@derek-gerstmann
Copy link
Contributor Author

we can disable Vulkan testing on the Linux build bots for now

Done

Thanks!

Are any of our other existing bots suitable or do we need a new one?

We should be able to test on Windows. I'm building it locally now to confirm. We also plan on getting a Raspberry Pi 5 and add that as a new bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug documentation Missing, incorrect, or unclear. Spelling & grammar mistakes. gpu release_notes For changes that may warrant a note in README for official releases.
Projects
None yet
5 participants