Merge branch 'perf-tests'

Scthe · Aug 26, 2024 · 3acd5df · 3acd5df
2 parents 501f019 + 91f8e51
commit 3acd5df
Show file tree

Hide file tree

Showing 33 changed files with 941 additions and 137 deletions.
diff --git a/README.md b/README.md
@@ -31,13 +31,15 @@ https://github.com/user-attachments/assets/02859b92-a940-42b6-8381-dcac4b81b4d4
     * The second pass is dispatched for every tile and blends its hair segments in a front-to-back order. Done by dividing each depth bin into slices, assigning segments to each, and blending.
         * It uses a task queue internally. Each "processor" grabs the next tile from a list once it's done with the current tile.
 * Separate [strand-space shading calculation](https://youtu.be/ool2E8SQPGU?si=T0YirLDpKp83CjD2&t=1339). Instead of calculating shading for every pixel, I precalculate the values for every strand. You can select how many points are shaded for each strand. The last point always fades to transparency for a nice, thin tip.
-    * **Kajiya-Kay diffuse, Marschner specular.** Although I do not calculate depth maps for lights, so TT lobe's weight is 0 by default. I like how the current initial scene looks and reconfiguring lights is booooring!
+    * **Kajiya-Kay diffuse, Marschner specular.** However, I do not calculate depth maps for lights, so TT lobe's weight is 0 by default. I like how the current initial scene looks and reconfiguring lights is booooring!
     * **Fake multiple scattering** [like in UE5](https://blog.selfshadow.com/publications/s2016-shading-course/karis/s2016_pbs_epic_hair.pdf#page=39). See "Physically based hair shading in Unreal" by Brian Karis slide 39 if SIGGRAPH does not allow link.
     * **Fake attenuation** mimicking [Beer–Lambert law](https://en.wikipedia.org/wiki/Beer%E2%80%93Lambert_law).
     * It also **casts and receives shadows as well as AO**. You can also randomize some settings for each strand.
-* [LOD](https://youtu.be/ool2E8SQPGU?si=Zv-1N5Y4-nWvlB6v&t=1643) - the user has strand% slider. In a production system, you would automate this and increase hair width with distance. The randomization happens [in my blender exporter](scripts/tfx_exporter.py).
+* [LOD](https://youtu.be/ool2E8SQPGU?si=Zv-1N5Y4-nWvlB6v&t=1643). The user has strand% slider. In a production system, you would automate this and increase hair width with distance. The randomization happens [in my blender exporter](scripts/tfx_exporter.py).
+* [Tile sort](https://youtu.be/ool2E8SQPGU?si=85yOaqCmYkUR9nHL&t=1803). Ensures stable frametimes. Sorting is approximate (buckets).
 * Blender exporter for the older Blender hair system. It's actually the same file format as I've used in my TressFX ports ([1](https://github.com/Scthe/TressFX-OpenGL), [2](https://github.com/Scthe/WebFX), [3](https://github.com/Scthe/Rust-Vulkan-TressFX)).
 * Uses [Sintel Lite 2.57b](http://www.blendswap.com/blends/view/7093) by BenDansie as a 3D model. There were no changes to "make it work" or optimize. Only selecting how many points per each strand.
+    * You might notice that Sintel's hair is less dense than the one showcased in FIFA. This is actually not good as it means we have to process more depth bins/slices till the pixel/tile saturates. Reminds me of similar nonobvious tradeoffs from [Nanite WebGPU](https://github.com/Scthe/nanite-webgpu/tree/master). On the other hand, the tile pass is cheaper.
 
 ### Features: Physics simulation
 
@@ -66,22 +68,21 @@ Check [src/constants.ts](src/constants.ts) for full documentation.
 
 I'm using Robin Taillandier and Jon Valdes's presentation ["Every Strand Counts: Physics and Rendering Behind Frostbite’s Hair"](https://www.youtube.com/watch?v=ool2E8SQPGU) as a reference point.
 
-* No skinning to triangles. If a character has a beard, it should move based on the underlying mesh.
-* There is a [pass that takes all strands and writes their shaded values](https://youtu.be/ool2E8SQPGU?si=HKPzUIWsHh75qBps&t=1333) (in strand-space) into a buffer. I do this for every strand, Frostbite only for visible ones. This pass is entirely separate from rasterization.
-* No hair color from texture. The shading pass has the `strandIdx`, so it's a matter of fetching uv and sampling texture.
-* Frostbite uses a software rasterizer to write to a depth (and maybe normal) buffer. This is a bit of a problem because of how software rasterizers work. So I re-render the hair using a hardware rasterizer just for depth and normals. Only the color is software rasterized.
+* **No skinning to triangles.** If a character has a beard, it should move based on the underlying mesh.
+* We both have a [pass that takes all strands and writes their shaded values](https://youtu.be/ool2E8SQPGU?si=HKPzUIWsHh75qBps&t=1333) (in strand-space) into a buffer. I do this for every strand, **Frostbite only for visible ones**.
+* **No hair color from texture.** The shading pass has the `strandIdx`, so it's a matter of fetching uv and sampling texture. This tech was not needed for my demo app.
+* **Frostbite uses a software rasterizer to write to a depth (and maybe normal) buffer.** This is a problem because of how software rasterizers work. **So I re-render the hair using a hardware rasterizer just for depth and normals.** Only the color is software rasterized.
     * Depth is not a problem (just an atomic op on a separate buffer), normals are. However, the Frostbite presentation does not mention normals. Don't they need them for AO or other stuff? Hair shading can omit AO (I even have supplementary [Beer–Lambert law](https://en.wikipedia.org/wiki/Beer%E2%80%93Lambert_law) attenuation). But what about the skin from which the hair grows? Is it faked in diffuse texture? Or is the hair always dense?
     * I also use a hardware rasterizer to render hair into shadow maps. Again, it's not complicated, but someone would have to spend time writing it. And I can't be bothered.
-* No pre-sorting of tiles, which can result in some frames taking a bit longer than others.
-* No curly hair subdivisions.
-    * The algorithm they use is part of my Blender exporter. In Blender, each hair is a spline. I convert it to equidistant points. Although implementing this in software rasterizer is *a bit* different.
-* No specialized support for [headgear](https://youtu.be/ool2E8SQPGU?si=aAFV_WnUwxJPoIRM&t=2071) like headbands. In Frostbite it requires content authoring to mark selected points as non-dynamic.
-* No automatic LODs.Instead, you have a slider that works [exactly like Frostbite's system](https://youtu.be/ool2E8SQPGU?si=NTmreF8azhRz4sVB&t=1646). I randomize the strand order in my Blender exporter.
-* A different set of constraints. We both have stretch/length constraints and colliders (both Signed Distance Fields and primitives).
+* **No curly hair subdivisions.**
+    * The algorithm they use is part of my Blender exporter. In Blender, each hair is a spline. I convert it to equidistant points. However, implementing this in software rasterizer is *a bit* different.
+* **No specialized support for [headgear](https://youtu.be/ool2E8SQPGU?si=aAFV_WnUwxJPoIRM&t=2071) like headbands.** Frostbite requires content authoring to mark selected points as non-dynamic.
+* **LOD is manual instead of automatic.** Frostbite [automatically calculates rendered strand count](https://youtu.be/ool2E8SQPGU?si=NTmreF8azhRz4sVB&t=1646). I give you control over this parameter.
+* **I simulate all hair strands. Frostbite can choose how much and interpolate the rest.**
+* **A different set of constraints.** We both have stretch/length constraints and colliders (both Signed Distance Fields and primitives).
     * I have extra global shape constraints, based on my experience with [TressFX](https://github.com/Scthe/Rust-Vulkan-TressFX). I assume that Frostbite also has this, but maybe under a different term (like "shape matching")?
     * Frostbite has a global length constraint.
     * We have different implementations for local shape constraints. Mine is based on "A Triangle Bending Constraint Model for Position-Based Dynamics" - [Kelager10](http://image.diku.dk/kenny/download/kelager.niebe.ea10.pdf).
-* I simulate all hair strands. Frostbite can choose how much and interpolate the rest.
 
 Some things were not explained in the presentation, so I gave my best guess. E.g. the aero grid update step takes wind and colliders as input.  But does it do fluid simulation for nice turbulence and vortexes? Possible, but not likely. I just mark 3 regions: lull (inside the mesh), half-lull (grid point is shielded by a collider, half strength), and full strength.
 
@@ -91,7 +92,7 @@ Ofc. I cannot rival Frostbite's performance. I am a single person and I have muc
 ## Usage
 
 * Firefox does not support WebGPU. Use Chrome instead.
-* Use the `[W, S, A, D]` keys to move and `[Z, SPACEBAR]` to fly up or down. `[Shift]` to move faster. `[E]` to toggle depth pyramid debug mode.
+* Use the `[W, S, A, D]` keys to move and `[Z, SPACEBAR]` to fly up or down. `[Shift]` to move faster.
 * As all browsers enforce VSync, use the "Profile" button for accurate timings.
 
 ### Running the app locally

diff --git a/deno.json b/deno.json
@@ -2,7 +2,8 @@
   "tasks": {
     "start": "DENO_NO_PACKAGE_JSON=1 && deno run --allow-read=. --allow-write=. --unstable-webgpu src/index.deno.ts",
     "compile": "DENO_NO_PACKAGE_JSON=1 && deno compile --allow-read=. --allow-write=. --unstable-webgpu src/index.deno.ts",
-    "test": "DENO_NO_PACKAGE_JSON=1 && deno test --allow-read=. --allow-write=. --unstable-webgpu src"
+    "test": "DENO_NO_PACKAGE_JSON=1 && deno test --allow-read=. --allow-write=. --unstable-webgpu src",
+    "testSort": "DENO_NO_PACKAGE_JSON=1 && deno test --allow-read=. --allow-write=. --unstable-webgpu src/passes/swHair/hairTileSortPass.test.ts"
   },
   "imports": {
     "png": "https://deno.land/x/[email protected]/mod.ts",

diff --git a/makefile b/makefile
@@ -15,6 +15,9 @@ run:
 test:
 	$(DENO) task test
 
+testSort:
+	$(DENO) task testSort
+
 # Generate .exe
 compile:
 	$(DENO) task compile

diff --git a/src/constants.ts b/src/constants.ts
@@ -43,12 +43,18 @@ type RGBColor = [number, number, number];
 
 export const DISPLAY_MODE = {
   FINAL: 0,
+  /** Hair tiles using segment count per-tile buffer */
   TILES: 1,
-  HW_RENDER: 2,
-  USED_SLICES: 3,
-  DEPTH: 4,
-  NORMALS: 5,
-  AO: 6,
+  /** Hair tiles using PPLL */
+  TILES_PPLL: 2,
+  /** Harware rasterize */
+  HW_RENDER: 3,
+  /** HairFinePass' slices per pixel. Not super accurate due to per pixel/tile early-out optimizations */
+  USED_SLICES: 4,
+  /**zBuffer clamped to sensible values */
+  DEPTH: 5,
+  NORMALS: 6,
+  AO: 7,
 };
 
 export type HairFile =
@@ -236,6 +242,10 @@ export const CONFIG = {
      */
     invalidTilesPerSegmentThreshold: 64,
 
+    ////// SORT PASS
+    sortBuckets: 64,
+    sortBucketSize: 16,
+
     ////// FINE PASS
     /** This is like slices per pixel in original Frostbite presentation, but the slices are inside each depth bin */
     slicesPerPixel: 8,
@@ -247,6 +257,10 @@ export const CONFIG = {
     finePassWorkgroupSizeX: 1,
     /** Where to store the PPLL slice heads data */
     sliceHeadsMemory: 'workgroup' as SliceHeadsMemory,
+    /** Given distance between pixel and strand, how to calculate alpha? Can be linear 0-1 from strand edge to middle. Or quadratic (faster, denser, but more error prone and 'blocky'). */
+    alphaQuadratic: false,
+    /** Alpha comes from pixel's distance to strand. Multiply to make strands "fatter". Faster pixel/tile convergence at the cost of Anti Alias. fuzzy edges. */
+    alphaMultipler: 1.1,
 
     ////// LOD
     lodRenderPercent: 100, // LOD %. Fun fact, performance is NOT linear. Range [0..100]

diff --git a/src/passes/README.md b/src/passes/README.md
@@ -19,12 +19,13 @@ Passes:
    2. [ShadowMapPass](shadowMapPass) to update shadow map. Has separate `GPURenderPipeline` for meshes and hair. Uses a hardware rasterizer for hair, but you should change this if you have extra time.
    3. [DrawMeshesPass](drawMeshes) draws solid objects. This also includes a special code for the ball collider.
    4. [HairTilesPass](swHair/hairTilesPass.ts) software rasterizes hair segments into tiles. Or, to be more precise, into each tile's depth bins. Dispatches a thread for each hair segment.
-   5. [HairFinePass](swHair/hairFinePass.ts) software rasterizes each tile and writes the final pixel colors into the buffer. It contains the main part of the order-independent transparency implementation. It uses a task queue internally. Each "processor" grabs the next tile from a list once it's done with the current one. Dispatches a thread for each processor.
-   6. [HairCombinePass](hairCombine) writes the software-rasterized hair into the HDR texture. Has special code for debug modes.
-   7. Update depth and normal buffers using [hardware rasterizer](hwHair).
-   8. [AoPass](aoPass) - GTAO.
-   9. [HairShadingPass](hairShadingPass) updates the shading for each hair strand. Requires AO and normals. Dispatches a thread for each shading point on each hair strand.
-      1. You might consider moving this before the software rasterizer if you want.
+   5. [HairTileSortPass](swHair/hairTileSortPass.ts) sorts the tiles by the segment count (decreasing order). Used to better balance workload. The sorting is approximate (based on buckets).
+   6. [HairFinePass](swHair/hairFinePass.ts) software rasterizes each tile and writes the final pixel colors into the buffer. It contains the main part of the order-independent transparency implementation. It uses a task queue internally. Each "processor" grabs the next tile from a list once it's done with the current one. Dispatches a thread for each processor.
+   7. [HairCombinePass](hairCombine) writes the software-rasterized hair into the HDR texture. Has special code for debug modes.
+   8. Update depth and normal buffers using [hardware rasterizer](hwHair).
+   9. [AoPass](aoPass) - GTAO.
+   10. [HairShadingPass](hairShadingPass) updates the shading for each hair strand. Requires AO and normals. Dispatches a thread for each shading point on each hair strand.
+       1. You might consider moving this before the software rasterizer if you want.
 3. Finish
    1. [DrawGizmoPass](drawGizmo) renders the move gizmo for the ball collider.
    2. [DrawSdfColliderPass](drawSdfCollider) and [DrawGridDbgPass](drawGridDbg) are debug views for physics simulation.

diff --git a/src/passes/_shared/shared.ts b/src/passes/_shared/shared.ts
@@ -77,3 +77,25 @@ export const useDepthStencilAttachment = (
     depthStoreOp,
   };
 };
+
+// TODO [LOW] use everywhere
+export const createComputePipeline = (
+  device: GPUDevice,
+  passClass: PassClass,
+  shaderText: string,
+  name = '',
+  mainFn = 'main'
+): GPUComputePipeline => {
+  const shaderModule = device.createShaderModule({
+    label: labelShader(passClass, name),
+    code: shaderText,
+  });
+  return device.createComputePipeline({
+    label: labelPipeline(passClass, name),
+    layout: 'auto',
+    compute: {
+      module: shaderModule,
+      entryPoint: mainFn,
+    },
+  });
+};
diff --git a/src/passes/hairCombine/hairCombinePass.ts b/src/passes/hairCombine/hairCombinePass.ts
@@ -91,6 +91,7 @@ export class HairCombinePass {
       hairTilesBuffer,
       hairTileSegmentsBuffer,
       hairRasterizerResultsBuffer,
+      hairSegmentCountPerTileBuffer,
     } = ctx;
     const b = SHADER_PARAMS.bindings;
 
@@ -99,6 +100,7 @@ export class HairCombinePass {
       bindBuffer(b.tilesBuffer, hairTilesBuffer),
       bindBuffer(b.tileSegmentsBuffer, hairTileSegmentsBuffer),
       bindBuffer(b.rasterizeResultBuffer, hairRasterizerResultsBuffer),
+      bindBuffer(b.segmentCountPerTile, hairSegmentCountPerTileBuffer),
     ]);
   };
 }
diff --git a/src/passes/hairCombine/hairCombinePass.wgsl.ts b/src/passes/hairCombine/hairCombinePass.wgsl.ts
@@ -5,13 +5,15 @@ import * as SHADER_SNIPPETS from '../_shaderSnippets/shaderSnippets.wgls.ts';
 import { BUFFER_HAIR_TILE_SEGMENTS } from '../swHair/shared/hairTileSegmentsBuffer.ts';
 import { BUFFER_HAIR_RASTERIZER_RESULTS } from '../swHair/shared/hairRasterizerResultBuffer.ts';
 import { SHADER_TILE_UTILS } from '../swHair/shaderImpl/tileUtils.wgsl.ts';
+import { BUFFER_SEGMENT_COUNT_PER_TILE } from '../swHair/shared/segmentCountPerTileBuffer.ts';
 
 export const SHADER_PARAMS = {
   bindings: {
     renderUniforms: 0,
     tilesBuffer: 1,
     tileSegmentsBuffer: 2,
     rasterizeResultBuffer: 3,
+    segmentCountPerTile: 4,
   },
 };
 
@@ -31,6 +33,7 @@ ${RenderUniformsBuffer.SHADER_SNIPPET(b.renderUniforms)}
 ${BUFFER_HAIR_TILES_RESULT(b.tilesBuffer, 'read')}
 ${BUFFER_HAIR_TILE_SEGMENTS(b.tileSegmentsBuffer, 'read')}
 ${BUFFER_HAIR_RASTERIZER_RESULTS(b.rasterizeResultBuffer, 'read')}
+${BUFFER_SEGMENT_COUNT_PER_TILE(b.segmentCountPerTile, 'read')}
 
 
 @vertex
@@ -62,8 +65,8 @@ fn main_fs(
   let tileXY = getHairTileXY_FromPx(fragPositionPx);
   let displayMode = getDisplayMode();
 
-  if (displayMode == DISPLAY_MODE_TILES) {
-    result.color = renderTileSegmentCount(viewportSizeU32, tileXY);
+  if (displayMode == DISPLAY_MODE_TILES || displayMode == DISPLAY_MODE_TILES_PPLL) {
+    result.color = renderTileSegmentCount(displayMode, viewportSizeU32, tileXY);
 
   } else {
     var color = vec4f(0.0, 0.0, 0.0, 1.0);
@@ -95,19 +98,26 @@ fn getDebugTileColor(tileXY: vec2u) -> vec4f {
 }
 
 fn renderTileSegmentCount(
+  displayMode: u32,
   viewportSize: vec2u,
   tileXY: vec2u
 ) -> vec4f {
   var color = vec4f(0.0, 0.0, 0.0, 1.0);
 
   // output: segment count in each tile normalized by UI provided value
   let maxSegmentsCount = getDbgTileModeMaxSegments();
-  let segments = getSegmentCountInTiles(viewportSize, maxSegmentsCount, tileXY);
+  var segments = 0u;
+  if (displayMode == DISPLAY_MODE_TILES) {
+    segments = getSegmentCountInTiles_Count(viewportSize, maxSegmentsCount, tileXY);
+  } else {
+    segments = getSegmentCountInTiles_PPLL(viewportSize, maxSegmentsCount, tileXY);
+  }
+  
   color.r = f32(segments) / f32(maxSegmentsCount);
   color.g = 1.0 - color.r;
 
   // dbg: tile bounds
-  // let tileIdx: u32 = getHairTileIdx(viewportSize, tileXY, 0u);
+  // let tileIdx: u32 = getHairTileDepthBinIdx(viewportSize, tileXY, 0u);
   // color.r = f32((tileIdx * 17) % 33) / 33.0;
   // color.a = 1.0;
   
@@ -117,7 +127,7 @@ fn renderTileSegmentCount(
   return color;
 }
 
-fn getSegmentCountInTiles(
+fn getSegmentCountInTiles_PPLL(
   viewportSize: vec2u,
   maxSegmentsCount: u32,
   tileXY: vec2u
@@ -142,4 +152,13 @@ fn getSegmentCountInTiles(
   return count;
 }
 
+fn getSegmentCountInTiles_Count(
+  viewportSize: vec2u,
+  maxSegmentsCount: u32,
+  tileXY: vec2u
+) -> u32 {
+  let tileIdx = getHairTileIdx(viewportSize, tileXY);
+  return _hairSegmentCountPerTile[tileIdx];
+}
+
 `;
diff --git a/src/passes/passCtx.ts b/src/passes/passCtx.ts
@@ -30,4 +30,6 @@ export interface PassCtx {
   hairTilesBuffer: GPUBuffer;
   hairTileSegmentsBuffer: GPUBuffer;
   hairRasterizerResultsBuffer: GPUBuffer;
+  hairTileListBuffer: GPUBuffer;
+  hairSegmentCountPerTileBuffer: GPUBuffer;
 }
diff --git a/src/passes/renderUniformsBuffer.ts b/src/passes/renderUniformsBuffer.ts
@@ -36,6 +36,7 @@ export class RenderUniformsBuffer {
 
     const DISPLAY_MODE_FINAL = ${DISPLAY_MODE.FINAL}u;
     const DISPLAY_MODE_TILES = ${DISPLAY_MODE.TILES}u;
+    const DISPLAY_MODE_TILES_PPLL = ${DISPLAY_MODE.TILES_PPLL}u;
     const DISPLAY_MODE_HW_RENDER = ${DISPLAY_MODE.HW_RENDER}u;
     const DISPLAY_MODE_USED_SLICES = ${DISPLAY_MODE.USED_SLICES}u;
     const DISPLAY_MODE_DEPTH = ${DISPLAY_MODE.DEPTH}u;
@@ -398,7 +399,10 @@ export class RenderUniformsBuffer {
     const hr = CONFIG.hairRender;
     let extraData = 0;
 
-    if (c.displayMode === DISPLAY_MODE.TILES) {
+    if (
+      c.displayMode === DISPLAY_MODE.TILES ||
+      c.displayMode === DISPLAY_MODE.TILES_PPLL
+    ) {
       extraData = hr.dbgTileModeMaxSegments;
     } else if (c.displayMode === DISPLAY_MODE.USED_SLICES) {
       extraData = hr.dbgSlicesModeMaxSlices;