[vulkan] Reduce descriptor sets, use official headers, improve allocator, remove module destructor #8452

derek-gerstmann · 2024-10-31T19:31:59Z

In order to avoid using up all descriptor sets for complicated pipelines, this PR changes the Vulkan CodeGen such that we encode each Kernel entry-point into it's own SPIR-V module and bind them as separate shaders to avoid running out of resources on constrained devices.

Remove stale mini_vulkan.h and instead use official Vulkan Headers for runtime. Added local copy of the subset of ANSI-C headers we need to dependencies/vulkan, and a script to make it easy to update to new release branches (same as SPIR-V).

Split function pointer interfaces into three API categories ... Loader, Instance and Device. Now, only Loader functions exist for the lifetime of the module. Instance and Device function pointers are managed for the lifetime of the context, and Device function pointers now skip the instance call chain.

Changed VulkanMemoryAllocator to iterate across all memory types to allow all heaps to be utilized by using priority flags (rather than requiring all). Also reduced minimum allocation size to 16KB (instead of 32MB) to reduce memory pressure.

Fixed small integer encoding to be sign extended and packed to 32bits.

Removed global VkCommandPool from context since the API type definition changes between platforms, and we only use them for small transient command buffers that we destroy immediately. Use ScopedVulkanCommandPoolAndBuffer to avoid leaking if an error occurs within function scope.

Remove module attribute destructor (since we can't guarantee the driver doesn't register an atexit call which may get invoked before).

Fixes #7235
Fixes #8297
Fixes #8296
Fixes #8295
Fixes #8294
Fixes #8466

Re-enable performance wrap test #7559 .

…t per SPIR-V module. Use an amalgamation for the binary module returned by CodeGen Vulkan Dev. Decode amalgamation and cache each kernel module inside of cache entry.

Hook up dump() and dump_module() methods to debug options.

derek-gerstmann · 2024-11-05T21:05:58Z

@steven-johnson Okay, testing this PR on the configs I have locally has everything passing. Can we re-enable testing on the Buildbots to see if there's any device or driver differences I need to catch?

steven-johnson · 2024-11-05T21:40:23Z

@steven-johnson Okay, testing this PR on the configs I have locally has everything passing. Can we re-enable testing on the Buildbots to see if there's any device or driver differences I need to catch?

halide/build_bot#294

steven-johnson · 2024-11-05T22:01:07Z

@steven-johnson Okay, testing this PR on the configs I have locally has everything passing. Can we re-enable testing on the Buildbots to see if there's any device or driver differences I need to catch?

halide/build_bot#294

Done, please start testing

derek-gerstmann · 2024-11-06T02:35:48Z

I’ll investigate the three failures:

The following tests FAILED:
correctness_bounds 
correctness_debug_to_file
correctness_device_buffer_copies_with_profile

…ut validation errors.

…ges.

derek-gerstmann · 2024-11-06T20:06:59Z

Fixes added for #8466

… runtimes)

derek-gerstmann · 2024-11-07T20:58:36Z

@steven-johnson I believe the GPU lifetime and device memory leak errors are actually caused by the Validation Layer shared lib itself when invoked with ctest --parallel. It doesn't seem very robust at all. It's useful for diagnosing issues when they come up, but it doesn't seem production worthy. I'd like to suggest we remove the VK_INSTANCE_LAYERS env var from the buildbot config, and keep the Vulkan tests enabled.

steven-johnson · 2024-11-07T21:04:28Z

@steven-johnson I believe the GPU lifetime and device memory leak errors are actually caused by the Validation Layer shared lib itself when invoked with ctest --parallel. It doesn't seem very robust at all. It's useful for diagnosing issues when they come up, but it doesn't seem production worthy. I'd like to suggest we remove the VK_INSTANCE_LAYERS env var from the buildbot config, and keep the Vulkan tests enabled.

Done, change made and buildbot master restarted -- please try again :-)

derek-gerstmann · 2024-11-08T00:29:35Z

So, other than the build failures from LLVM20 interface changes, all the Vulkan tests and apps are actually passing now and reporting "SUCCESS". However, there's exception's being thrown at shutdown that I can't reproduce.

steven-johnson · 2024-12-02T17:38:58Z

(FYI: I'm turning off Vulkan testing for now since literally every buildbot is reporting failures with it and it makes finding "real" failures painful, lmk if this makes your work unreasonable and I will revert)

EDIT: going to re-enable after offline discussion with @derek-gerstmann

derek-gerstmann · 2024-12-02T23:39:51Z

Okay, updated drivers didn't help. The Vulkan spec has a specific section on "Device Lost", which seems to imply that the device driver is allowed to do whatever it wants, and the app has to recover. I'll add some diagnostics and look at using the Debug extensions to see if we can handle this gracefully.

steven-johnson · 2024-12-02T23:52:04Z

FYI: I ssh'ed in and both systems are running the official driver @560 right now; 565 shows as available, but won't install due do a missing dep on libssl1.1. I don't want to mess with it without @abadams on site to support unless something goes wrong; also, a quick google shows several recent hits of people reporting instability with 565 on Ubuntu, so I don't want to touch it unless I hear specifically from you that you think it will move the needle, please LMK.

derek-gerstmann · 2024-12-03T02:05:06Z

FYI: I ssh'ed in and both systems are running the official driver @560 right now; 565 shows as available, but won't install due do a missing dep on libssl1.1. I don't want to mess with it without @abadams on site to support unless something goes wrong; also, a quick google shows several recent hits of people reporting instability with 565 on Ubuntu, so I don't want to touch it unless I hear specifically from you that you think it will move the needle, please LMK.

Thanks! Yes, @abadams was kind enough to update the workers to @560. If there’s instability being reported we could drop to @550 which is the current release version.

…itialized

… of shutdown/cleanup.

Disable other Halide_TARGETS to speed up testing.

…t job

Register debug callbacks to try and diagnose potential driver issues. Update src/runtime/HalideRuntimeVulkan.h with guarded typedefs to match Vulkan header. Add VK_KHR_MAINTENANCE_5_EXTENSION_NAME to optional device extensions.

CMakeLists.txt

module dtor like all other runtimes.

Update comment in module destructor to match latest findings. Added issue link. Cleanup Vulkan handle initializations ... use VK_NULL_HANDLE.

Vulkan.

derek-gerstmann · 2024-12-09T20:16:21Z

@steven-johnson @abadams This is ready to merge. I reverted all the custom destructor stuff and we're back to using the module destructor like all other runtimes. As we discussed during the group meeting, we can disable Vulkan testing on the Linux build bots for now ... we'll need to test on other configs until there's a confirmed fix for #8497

steven-johnson · 2024-12-09T20:40:24Z

we can disable Vulkan testing on the Linux build bots for now

Done

we'll need to test on other configs until there's a confirmed fix for #8497

Are any of our other existing bots suitable or do we need a new one?

derek-gerstmann · 2024-12-09T20:56:12Z

we can disable Vulkan testing on the Linux build bots for now

Done

Thanks!

Are any of our other existing bots suitable or do we need a new one?

We should be able to test on Windows. I'm building it locally now to confirm. We also plan on getting a Raspberry Pi 5 and add that as a new bot.

Derek Gerstmann added 7 commits October 30, 2024 15:59

Reduce resource usage for large pipelines by compiling one entry poin…

7803cd8

…t per SPIR-V module. Use an amalgamation for the binary module returned by CodeGen Vulkan Dev. Decode amalgamation and cache each kernel module inside of cache entry.

Create a new emitter from scratch inside the add_kernel() method.

bf11f0f

Hook up dump() and dump_module() methods to debug options.

Clang format pass

fde4019

Fix ambiguous conversion from path to std::string

eb17d68

Use explicit string method rather than constructor for path conversion.

8afb1c0

Fix file path stem to string conversion.

2b3aaa9

Re-enable performance wrap test for Vulkan.

4a8ff4c

derek-gerstmann mentioned this pull request Oct 31, 2024

[vulkan] performance_wrap test results don't match #7559

Closed

steven-johnson approved these changes Nov 1, 2024

View reviewed changes

steven-johnson mentioned this pull request Nov 5, 2024

re-enable vulkan testing halide/build_bot#294

Merged

Trigger CI for testing

8cc1c32

Derek Gerstmann added 4 commits November 6, 2024 12:01

Mark transfer buffers with SRC & DST usage bits to allow re-use witho…

a7586f1

…ut validation errors.

Change interal_error to user_assert for un-implemented features.

3ba8b67

Add note to Vulkan.md describing validation layer usage and deb packa…

53e81ca

…ges.

Merge branch 'main' into dg/reduce-vulkan-desc-sets

aae923a

Clang format

61e59b7

derek-gerstmann added bug gpu documentation Missing, incorrect, or unclear. Spelling & grammar mistakes. release_notes For changes that may warrant a note in README for official releases. labels Nov 6, 2024

Switch to halide_mutex for locking Vulkan context (to match other GPU…

2a2bf38

… runtimes)

Trigger CI for testing

0951099

Trigger build to test latest drivers

2b4d2d9

Derek Gerstmann added 10 commits December 3, 2024 13:00

Safeguard Vulkan destructor against being called if runtime wasn't in…

4660552

…itialized

Formatting pass.

e040eeb

Disable custom JIT destructor to see if segfaults still occur outside…

58148a9

… of shutdown/cleanup.

Test module destructor with latest driver.

4d317d9

Disable other Halide_TARGETS to speed up testing.

Formatting pass

c77b25d

Only disable CUDA and OpenCL

33dc2ff

Revert CMakeLists ... disabling targets prematurely fails the buildbo…

d1b4b7c

…t job

Make clang tidy happy

115d7e4

Format & Tidy

06d0d35

mcourteaux reviewed Dec 5, 2024

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

Derek Gerstmann added 5 commits December 9, 2024 11:35

Revert changes to JITModule which added a custom dtor. Use standard

6c1ed30

module dtor like all other runtimes.

Revert changes to HalideRuntimeVulkan that added a custom dtor.

6ad058b

Update comment in module destructor to match latest findings. Added issue link. Cleanup Vulkan handle initializations ... use VK_NULL_HANDLE.

Re-enable GPU object lifetime management tests and leak device tests for

f956f40

Vulkan.

Formatting

d180921

Merge branch 'main' into dg/reduce-vulkan-desc-sets

ee3c3d9

trigger buildbots

12c3a49

derek-gerstmann merged commit 6b57a3b into main Dec 9, 2024
3 checks passed

This was referenced Dec 9, 2024

[vulkan] Validation errors for DST bit usage on transfer buffers #8466

Closed

[vulkan] Evaluate SDK dependency for AOT generators #7235

Closed

Vulkan backend bug in region_allocator #8079

Closed

BrewTestBot mentioned this pull request Dec 17, 2024

halide 19.0.0 Homebrew/homebrew-core#201454

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vulkan] Reduce descriptor sets, use official headers, improve allocator, remove module destructor #8452

[vulkan] Reduce descriptor sets, use official headers, improve allocator, remove module destructor #8452

derek-gerstmann commented Oct 31, 2024 •

edited by alexreinking

Loading

derek-gerstmann commented Nov 5, 2024

steven-johnson commented Nov 5, 2024

steven-johnson commented Nov 5, 2024

derek-gerstmann commented Nov 6, 2024

derek-gerstmann commented Nov 6, 2024

derek-gerstmann commented Nov 7, 2024 •

edited

Loading

steven-johnson commented Nov 7, 2024

derek-gerstmann commented Nov 8, 2024

steven-johnson commented Dec 2, 2024 •

edited

Loading

derek-gerstmann commented Dec 2, 2024

steven-johnson commented Dec 2, 2024

derek-gerstmann commented Dec 3, 2024

derek-gerstmann commented Dec 9, 2024

steven-johnson commented Dec 9, 2024

derek-gerstmann commented Dec 9, 2024

[vulkan] Reduce descriptor sets, use official headers, improve allocator, remove module destructor #8452

[vulkan] Reduce descriptor sets, use official headers, improve allocator, remove module destructor #8452

Conversation

derek-gerstmann commented Oct 31, 2024 • edited by alexreinking Loading

derek-gerstmann commented Nov 5, 2024

steven-johnson commented Nov 5, 2024

steven-johnson commented Nov 5, 2024

derek-gerstmann commented Nov 6, 2024

derek-gerstmann commented Nov 6, 2024

derek-gerstmann commented Nov 7, 2024 • edited Loading

steven-johnson commented Nov 7, 2024

derek-gerstmann commented Nov 8, 2024

steven-johnson commented Dec 2, 2024 • edited Loading

derek-gerstmann commented Dec 2, 2024

steven-johnson commented Dec 2, 2024

derek-gerstmann commented Dec 3, 2024

derek-gerstmann commented Dec 9, 2024

steven-johnson commented Dec 9, 2024

derek-gerstmann commented Dec 9, 2024

derek-gerstmann commented Oct 31, 2024 •

edited by alexreinking

Loading

derek-gerstmann commented Nov 7, 2024 •

edited

Loading

steven-johnson commented Dec 2, 2024 •

edited

Loading