Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream Compaction Status #3

Open
2 of 5 tasks
WilliamKHo opened this issue Nov 25, 2017 · 3 comments
Open
2 of 5 tasks

Stream Compaction Status #3

WilliamKHo opened this issue Nov 25, 2017 · 3 comments
Assignees

Comments

@WilliamKHo
Copy link
Owner

WilliamKHo commented Nov 25, 2017

branch

Okay, I finally brought my stream-compaction code into Pollux.

Bad news: It slows down rendering and resultant images are noticeably incorrect:

streamcompactiondegub02

Good news: It can definitely be made to work and we would likely see really good wins

Info:

  • 👍 Inspection of buffers through the GPU Frame Capture in XCode confirm that prefix-sum scan and scatter logic are working. (woot!) On top of that, compared to other kernel calls, the double kernel scan of scans procedure takes virtually no time at all (~1100μs for 1 million elements on my macbook 2013!)

  • 👍 A slight bottleneck in the kern_evaluateRays kernel can be sped up with a little refactoring of the prefix sum scan kernel.

  • 👍 kernComputeIntersections and kernShadeMaterials actually do get faster as a result of stream compaction due to early termination of coherent threads, it's just that the time cost of stream compaction is still too large.

  • 👎 Far and away the major bottleneck is the kernScatterRays and kernCopyBack step, in which I scattered the unterminated rays into the second Ray buffer and copied them back to set all subsequent Rays bounces to 0. This was grossly naive, so this is not surprising. kernCopyBack can probably be replaced with a ping-ponging step between ray bounces. kernScatterRays however is still a read-and-write heavy kernel that will cost time, so the best way to account for that would be to better leverage the work that it allows us to not do.

  • 👎 The visual bug of the black light is caused by the fact that rays that hit lights are terminated, and their remaining bounces are set to 0, which is currently the criteria by which rays get discarded. "Lit" rays thus never actually make it to FINAL_GATHER. RayCompaction could be tweaked to do a full partition on the array of all rays, hopefully a trivial task 😰 . This would probably also solve the striated visual bug above, which I believe is related to errors in the final buffer of arrays passed to FINAL_GATHER as a result of compaction.

  • 👎 One other way to leverage stream compaction is to reduce the number of threads we dispatch for kernComputeIntersections, kernShadeMaterials, and RayCompaction. As far as I can tell, this would require more synchronization between CPU and GPU, and multiple commandEncoders per iteration, since the number of compacted rays can't be known ahead of time. I don't know if this has any significant drawbacks.

TL;DR
Steps to make Stream Compaction fast enough and work correctly

  • Eliminate kernEvaluateRays kernel and refactor to include in first prefix sum scan (doesn't save much time overall, would increase code complexity unnecessarily
  • Ping-Pong Ray buffers between ray bounce calculations.
  • ??? Second prefix-sum-scan on terminated rays so that final compacted Ray buffer is partitioned into unterminated and terminated rays. (I'm pretty sure this is the silver bullet)
  • ??? FINAL_GATHER at every ray bounce. This might be too expensive, and I actually think the above solution would be better.
  • ??? Find a way to dynamically change the number of threadGroups dispatched for ray bounce computations after eliminating terminated Rays
@WilliamKHo WilliamKHo self-assigned this Nov 25, 2017
@WilliamKHo WilliamKHo changed the title Stream Compaction Issues Stream Compaction Status Nov 25, 2017
@WilliamKHo
Copy link
Owner Author

@YoussefV see this update. I'm pretty sure I can get this working the way it's supposed to tomorrow, but I should sleep tonight. Feel free to look at the branch and make comments/criticisms.

@WilliamKHo
Copy link
Owner Author

It works but it's too slow. Need to pass a buffer containing information about rays culled at each iteration between shaders to leverage better and earlier thread termination, and so that kernScatterRays doesn't re-partition the entire array each and every time.

@WilliamKHo
Copy link
Owner Author

screen shot 2017-11-25 at 5 52 21 pm

No ugly visual bugs though so that's good

@YVin3D YVin3D self-assigned this Nov 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants