diff --git a/README.md b/README.md
index f5cb851..58d41c3 100644
--- a/README.md
+++ b/README.md
@@ -31,13 +31,15 @@ https://github.com/user-attachments/assets/02859b92-a940-42b6-8381-dcac4b81b4d4
     * The second pass is dispatched for every tile and blends its hair segments in a front-to-back order. Done by dividing each depth bin into slices, assigning segments to each, and blending.
         * It uses a task queue internally. Each "processor" grabs the next tile from a list once it's done with the current tile.
 * Separate [strand-space shading calculation](https://youtu.be/ool2E8SQPGU?si=T0YirLDpKp83CjD2&t=1339). Instead of calculating shading for every pixel, I precalculate the values for every strand. You can select how many points are shaded for each strand. The last point always fades to transparency for a nice, thin tip.
-    * **Kajiya-Kay diffuse, Marschner specular.** Although I do not calculate depth maps for lights, so TT lobe's weight is 0 by default. I like how the current initial scene looks and reconfiguring lights is booooring!
+    * **Kajiya-Kay diffuse, Marschner specular.** However, I do not calculate depth maps for lights, so TT lobe's weight is 0 by default. I like how the current initial scene looks and reconfiguring lights is booooring!
     * **Fake multiple scattering** [like in UE5](https://blog.selfshadow.com/publications/s2016-shading-course/karis/s2016_pbs_epic_hair.pdf#page=39). See "Physically based hair shading in Unreal" by Brian Karis slide 39 if SIGGRAPH does not allow link.
     * **Fake attenuation** mimicking [Beer–Lambert law](https://en.wikipedia.org/wiki/Beer%E2%80%93Lambert_law).
     * It also **casts and receives shadows as well as AO**. You can also randomize some settings for each strand.
-* [LOD](https://youtu.be/ool2E8SQPGU?si=Zv-1N5Y4-nWvlB6v&t=1643) - the user has strand% slider. In a production system, you would automate this and increase hair width with distance. The randomization happens [in my blender exporter](scripts/tfx_exporter.py).
+* [LOD](https://youtu.be/ool2E8SQPGU?si=Zv-1N5Y4-nWvlB6v&t=1643). The user has strand% slider. In a production system, you would automate this and increase hair width with distance. The randomization happens [in my blender exporter](scripts/tfx_exporter.py).
+* [Tile sort](https://youtu.be/ool2E8SQPGU?si=85yOaqCmYkUR9nHL&t=1803). Ensures stable frametimes. Sorting is approximate (buckets).
 * Blender exporter for the older Blender hair system. It's actually the same file format as I've used in my TressFX ports ([1](https://github.com/Scthe/TressFX-OpenGL), [2](https://github.com/Scthe/WebFX), [3](https://github.com/Scthe/Rust-Vulkan-TressFX)).
 * Uses [Sintel Lite 2.57b](http://www.blendswap.com/blends/view/7093) by BenDansie as a 3D model. There were no changes to "make it work" or optimize. Only selecting how many points per each strand.
+    * You might notice that Sintel's hair is less dense than the one showcased in FIFA. This is actually not good as it means we have to process more depth bins/slices till the pixel/tile saturates. Reminds me of similar nonobvious tradeoffs from [Nanite WebGPU](https://github.com/Scthe/nanite-webgpu/tree/master). On the other hand, the tile pass is cheaper.
 
 ### Features: Physics simulation
 
@@ -66,22 +68,21 @@ Check [src/constants.ts](src/constants.ts) for full documentation.
 
 I'm using Robin Taillandier and Jon Valdes's presentation ["Every Strand Counts: Physics and Rendering Behind Frostbite’s Hair"](https://www.youtube.com/watch?v=ool2E8SQPGU) as a reference point.
 
-* No skinning to triangles. If a character has a beard, it should move based on the underlying mesh.
-* There is a [pass that takes all strands and writes their shaded values](https://youtu.be/ool2E8SQPGU?si=HKPzUIWsHh75qBps&t=1333) (in strand-space) into a buffer. I do this for every strand, Frostbite only for visible ones. This pass is entirely separate from rasterization.
-* No hair color from texture. The shading pass has the `strandIdx`, so it's a matter of fetching uv and sampling texture.
-* Frostbite uses a software rasterizer to write to a depth (and maybe normal) buffer. This is a bit of a problem because of how software rasterizers work. So I re-render the hair using a hardware rasterizer just for depth and normals. Only the color is software rasterized.
+* **No skinning to triangles.** If a character has a beard, it should move based on the underlying mesh.
+* We both have a [pass that takes all strands and writes their shaded values](https://youtu.be/ool2E8SQPGU?si=HKPzUIWsHh75qBps&t=1333) (in strand-space) into a buffer. I do this for every strand, **Frostbite only for visible ones**.
+* **No hair color from texture.** The shading pass has the `strandIdx`, so it's a matter of fetching uv and sampling texture. This tech was not needed for my demo app.
+* **Frostbite uses a software rasterizer to write to a depth (and maybe normal) buffer.** This is a problem because of how software rasterizers work. **So I re-render the hair using a hardware rasterizer just for depth and normals.** Only the color is software rasterized.
     * Depth is not a problem (just an atomic op on a separate buffer), normals are. However, the Frostbite presentation does not mention normals. Don't they need them for AO or other stuff? Hair shading can omit AO (I even have supplementary [Beer–Lambert law](https://en.wikipedia.org/wiki/Beer%E2%80%93Lambert_law) attenuation). But what about the skin from which the hair grows? Is it faked in diffuse texture? Or is the hair always dense?
     * I also use a hardware rasterizer to render hair into shadow maps. Again, it's not complicated, but someone would have to spend time writing it. And I can't be bothered.
-* No pre-sorting of tiles, which can result in some frames taking a bit longer than others.
-* No curly hair subdivisions.
-    * The algorithm they use is part of my Blender exporter. In Blender, each hair is a spline. I convert it to equidistant points. Although implementing this in software rasterizer is *a bit* different.
-* No specialized support for [headgear](https://youtu.be/ool2E8SQPGU?si=aAFV_WnUwxJPoIRM&t=2071) like headbands. In Frostbite it requires content authoring to mark selected points as non-dynamic.
-* No automatic LODs.Instead, you have a slider that works [exactly like Frostbite's system](https://youtu.be/ool2E8SQPGU?si=NTmreF8azhRz4sVB&t=1646). I randomize the strand order in my Blender exporter.
-* A different set of constraints. We both have stretch/length constraints and colliders (both Signed Distance Fields and primitives).
+* **No curly hair subdivisions.**
+    * The algorithm they use is part of my Blender exporter. In Blender, each hair is a spline. I convert it to equidistant points. However, implementing this in software rasterizer is *a bit* different.
+* **No specialized support for [headgear](https://youtu.be/ool2E8SQPGU?si=aAFV_WnUwxJPoIRM&t=2071) like headbands.** Frostbite requires content authoring to mark selected points as non-dynamic.
+* **LOD is manual instead of automatic.** Frostbite [automatically calculates rendered strand count](https://youtu.be/ool2E8SQPGU?si=NTmreF8azhRz4sVB&t=1646). I give you control over this parameter.
+* **I simulate all hair strands. Frostbite can choose how much and interpolate the rest.**
+* **A different set of constraints.** We both have stretch/length constraints and colliders (both Signed Distance Fields and primitives).
     * I have extra global shape constraints, based on my experience with [TressFX](https://github.com/Scthe/Rust-Vulkan-TressFX). I assume that Frostbite also has this, but maybe under a different term (like "shape matching")?
     * Frostbite has a global length constraint.
     * We have different implementations for local shape constraints. Mine is based on "A Triangle Bending Constraint Model for Position-Based Dynamics" - [Kelager10](http://image.diku.dk/kenny/download/kelager.niebe.ea10.pdf).
-* I simulate all hair strands. Frostbite can choose how much and interpolate the rest.
 
 Some things were not explained in the presentation, so I gave my best guess. E.g. the aero grid update step takes wind and colliders as input.  But does it do fluid simulation for nice turbulence and vortexes? Possible, but not likely. I just mark 3 regions: lull (inside the mesh), half-lull (grid point is shielded by a collider, half strength), and full strength.
 
@@ -91,7 +92,7 @@ Ofc. I cannot rival Frostbite's performance. I am a single person and I have muc
 ## Usage
 
 * Firefox does not support WebGPU. Use Chrome instead.
-* Use the `[W, S, A, D]` keys to move and `[Z, SPACEBAR]` to fly up or down. `[Shift]` to move faster. `[E]` to toggle depth pyramid debug mode.
+* Use the `[W, S, A, D]` keys to move and `[Z, SPACEBAR]` to fly up or down. `[Shift]` to move faster.
 * As all browsers enforce VSync, use the "Profile" button for accurate timings.
 
 ### Running the app locally
diff --git a/deno.json b/deno.json
index c03c5be..97a62c6 100644
--- a/deno.json
+++ b/deno.json
@@ -2,7 +2,8 @@
   "tasks": {
     "start": "DENO_NO_PACKAGE_JSON=1 && deno run --allow-read=. --allow-write=. --unstable-webgpu src/index.deno.ts",
     "compile": "DENO_NO_PACKAGE_JSON=1 && deno compile --allow-read=. --allow-write=. --unstable-webgpu src/index.deno.ts",
-    "test": "DENO_NO_PACKAGE_JSON=1 && deno test --allow-read=. --allow-write=. --unstable-webgpu src"
+    "test": "DENO_NO_PACKAGE_JSON=1 && deno test --allow-read=. --allow-write=. --unstable-webgpu src",
+    "testSort": "DENO_NO_PACKAGE_JSON=1 && deno test --allow-read=. --allow-write=. --unstable-webgpu src/passes/swHair/hairTileSortPass.test.ts"
   },
   "imports": {
     "png": "https://deno.land/x/pngs@0.1.1/mod.ts",
diff --git a/makefile b/makefile
index f64d44a..8f4095a 100644
--- a/makefile
+++ b/makefile
@@ -15,6 +15,9 @@ run:
 test:
 	$(DENO) task test
 
+testSort:
+	$(DENO) task testSort
+
 # Generate .exe
 compile:
 	$(DENO) task compile
diff --git a/src/constants.ts b/src/constants.ts
index c3a8d86..f0fa0ce 100644
--- a/src/constants.ts
+++ b/src/constants.ts
@@ -43,12 +43,18 @@ type RGBColor = [number, number, number];
 
 export const DISPLAY_MODE = {
   FINAL: 0,
+  /** Hair tiles using segment count per-tile buffer */
   TILES: 1,
-  HW_RENDER: 2,
-  USED_SLICES: 3,
-  DEPTH: 4,
-  NORMALS: 5,
-  AO: 6,
+  /** Hair tiles using PPLL */
+  TILES_PPLL: 2,
+  /** Harware rasterize */
+  HW_RENDER: 3,
+  /** HairFinePass' slices per pixel. Not super accurate due to per pixel/tile early-out optimizations */
+  USED_SLICES: 4,
+  /**zBuffer clamped to sensible values */
+  DEPTH: 5,
+  NORMALS: 6,
+  AO: 7,
 };
 
 export type HairFile =
@@ -236,6 +242,10 @@ export const CONFIG = {
      */
     invalidTilesPerSegmentThreshold: 64,
 
+    ////// SORT PASS
+    sortBuckets: 64,
+    sortBucketSize: 16,
+
     ////// FINE PASS
     /** This is like slices per pixel in original Frostbite presentation, but the slices are inside each depth bin */
     slicesPerPixel: 8,
@@ -247,6 +257,10 @@ export const CONFIG = {
     finePassWorkgroupSizeX: 1,
     /** Where to store the PPLL slice heads data */
     sliceHeadsMemory: 'workgroup' as SliceHeadsMemory,
+    /** Given distance between pixel and strand, how to calculate alpha? Can be linear 0-1 from strand edge to middle. Or quadratic (faster, denser, but more error prone and 'blocky'). */
+    alphaQuadratic: false,
+    /** Alpha comes from pixel's distance to strand. Multiply to make strands "fatter". Faster pixel/tile convergence at the cost of Anti Alias. fuzzy edges. */
+    alphaMultipler: 1.1,
 
     ////// LOD
     lodRenderPercent: 100, // LOD %. Fun fact, performance is NOT linear. Range [0..100]
diff --git a/src/passes/README.md b/src/passes/README.md
index e5297fb..7fee5d3 100644
--- a/src/passes/README.md
+++ b/src/passes/README.md
@@ -19,12 +19,13 @@ Passes:
    2. [ShadowMapPass](shadowMapPass) to update shadow map. Has separate `GPURenderPipeline` for meshes and hair. Uses a hardware rasterizer for hair, but you should change this if you have extra time.
    3. [DrawMeshesPass](drawMeshes) draws solid objects. This also includes a special code for the ball collider.
    4. [HairTilesPass](swHair/hairTilesPass.ts) software rasterizes hair segments into tiles. Or, to be more precise, into each tile's depth bins. Dispatches a thread for each hair segment.
-   5. [HairFinePass](swHair/hairFinePass.ts) software rasterizes each tile and writes the final pixel colors into the buffer. It contains the main part of the order-independent transparency implementation. It uses a task queue internally. Each "processor" grabs the next tile from a list once it's done with the current one. Dispatches a thread for each processor.
-   6. [HairCombinePass](hairCombine) writes the software-rasterized hair into the HDR texture. Has special code for debug modes.
-   7. Update depth and normal buffers using [hardware rasterizer](hwHair).
-   8. [AoPass](aoPass) - GTAO.
-   9. [HairShadingPass](hairShadingPass) updates the shading for each hair strand. Requires AO and normals. Dispatches a thread for each shading point on each hair strand.
-      1. You might consider moving this before the software rasterizer if you want.
+   5. [HairTileSortPass](swHair/hairTileSortPass.ts) sorts the tiles by the segment count (decreasing order). Used to better balance workload. The sorting is approximate (based on buckets).
+   6. [HairFinePass](swHair/hairFinePass.ts) software rasterizes each tile and writes the final pixel colors into the buffer. It contains the main part of the order-independent transparency implementation. It uses a task queue internally. Each "processor" grabs the next tile from a list once it's done with the current one. Dispatches a thread for each processor.
+   7. [HairCombinePass](hairCombine) writes the software-rasterized hair into the HDR texture. Has special code for debug modes.
+   8. Update depth and normal buffers using [hardware rasterizer](hwHair).
+   9. [AoPass](aoPass) - GTAO.
+   10. [HairShadingPass](hairShadingPass) updates the shading for each hair strand. Requires AO and normals. Dispatches a thread for each shading point on each hair strand.
+       1. You might consider moving this before the software rasterizer if you want.
 3. Finish
    1. [DrawGizmoPass](drawGizmo) renders the move gizmo for the ball collider.
    2. [DrawSdfColliderPass](drawSdfCollider) and [DrawGridDbgPass](drawGridDbg) are debug views for physics simulation.
diff --git a/src/passes/_shared/shared.ts b/src/passes/_shared/shared.ts
index d88f918..77d495e 100644
--- a/src/passes/_shared/shared.ts
+++ b/src/passes/_shared/shared.ts
@@ -77,3 +77,25 @@ export const useDepthStencilAttachment = (
     depthStoreOp,
   };
 };
+
+// TODO [LOW] use everywhere
+export const createComputePipeline = (
+  device: GPUDevice,
+  passClass: PassClass,
+  shaderText: string,
+  name = '',
+  mainFn = 'main'
+): GPUComputePipeline => {
+  const shaderModule = device.createShaderModule({
+    label: labelShader(passClass, name),
+    code: shaderText,
+  });
+  return device.createComputePipeline({
+    label: labelPipeline(passClass, name),
+    layout: 'auto',
+    compute: {
+      module: shaderModule,
+      entryPoint: mainFn,
+    },
+  });
+};
diff --git a/src/passes/hairCombine/hairCombinePass.ts b/src/passes/hairCombine/hairCombinePass.ts
index 1101dac..1eb890d 100644
--- a/src/passes/hairCombine/hairCombinePass.ts
+++ b/src/passes/hairCombine/hairCombinePass.ts
@@ -91,6 +91,7 @@ export class HairCombinePass {
       hairTilesBuffer,
       hairTileSegmentsBuffer,
       hairRasterizerResultsBuffer,
+      hairSegmentCountPerTileBuffer,
     } = ctx;
     const b = SHADER_PARAMS.bindings;
 
@@ -99,6 +100,7 @@ export class HairCombinePass {
       bindBuffer(b.tilesBuffer, hairTilesBuffer),
       bindBuffer(b.tileSegmentsBuffer, hairTileSegmentsBuffer),
       bindBuffer(b.rasterizeResultBuffer, hairRasterizerResultsBuffer),
+      bindBuffer(b.segmentCountPerTile, hairSegmentCountPerTileBuffer),
     ]);
   };
 }
diff --git a/src/passes/hairCombine/hairCombinePass.wgsl.ts b/src/passes/hairCombine/hairCombinePass.wgsl.ts
index fcb806b..6e46457 100644
--- a/src/passes/hairCombine/hairCombinePass.wgsl.ts
+++ b/src/passes/hairCombine/hairCombinePass.wgsl.ts
@@ -5,6 +5,7 @@ import * as SHADER_SNIPPETS from '../_shaderSnippets/shaderSnippets.wgls.ts';
 import { BUFFER_HAIR_TILE_SEGMENTS } from '../swHair/shared/hairTileSegmentsBuffer.ts';
 import { BUFFER_HAIR_RASTERIZER_RESULTS } from '../swHair/shared/hairRasterizerResultBuffer.ts';
 import { SHADER_TILE_UTILS } from '../swHair/shaderImpl/tileUtils.wgsl.ts';
+import { BUFFER_SEGMENT_COUNT_PER_TILE } from '../swHair/shared/segmentCountPerTileBuffer.ts';
 
 export const SHADER_PARAMS = {
   bindings: {
@@ -12,6 +13,7 @@ export const SHADER_PARAMS = {
     tilesBuffer: 1,
     tileSegmentsBuffer: 2,
     rasterizeResultBuffer: 3,
+    segmentCountPerTile: 4,
   },
 };
 
@@ -31,6 +33,7 @@ ${RenderUniformsBuffer.SHADER_SNIPPET(b.renderUniforms)}
 ${BUFFER_HAIR_TILES_RESULT(b.tilesBuffer, 'read')}
 ${BUFFER_HAIR_TILE_SEGMENTS(b.tileSegmentsBuffer, 'read')}
 ${BUFFER_HAIR_RASTERIZER_RESULTS(b.rasterizeResultBuffer, 'read')}
+${BUFFER_SEGMENT_COUNT_PER_TILE(b.segmentCountPerTile, 'read')}
 
 
 @vertex
@@ -62,8 +65,8 @@ fn main_fs(
   let tileXY = getHairTileXY_FromPx(fragPositionPx);
   let displayMode = getDisplayMode();
 
-  if (displayMode == DISPLAY_MODE_TILES) {
-    result.color = renderTileSegmentCount(viewportSizeU32, tileXY);
+  if (displayMode == DISPLAY_MODE_TILES || displayMode == DISPLAY_MODE_TILES_PPLL) {
+    result.color = renderTileSegmentCount(displayMode, viewportSizeU32, tileXY);
 
   } else {
     var color = vec4f(0.0, 0.0, 0.0, 1.0);
@@ -95,6 +98,7 @@ fn getDebugTileColor(tileXY: vec2u) -> vec4f {
 }
 
 fn renderTileSegmentCount(
+  displayMode: u32,
   viewportSize: vec2u,
   tileXY: vec2u
 ) -> vec4f {
@@ -102,12 +106,18 @@ fn renderTileSegmentCount(
 
   // output: segment count in each tile normalized by UI provided value
   let maxSegmentsCount = getDbgTileModeMaxSegments();
-  let segments = getSegmentCountInTiles(viewportSize, maxSegmentsCount, tileXY);
+  var segments = 0u;
+  if (displayMode == DISPLAY_MODE_TILES) {
+    segments = getSegmentCountInTiles_Count(viewportSize, maxSegmentsCount, tileXY);
+  } else {
+    segments = getSegmentCountInTiles_PPLL(viewportSize, maxSegmentsCount, tileXY);
+  }
+  
   color.r = f32(segments) / f32(maxSegmentsCount);
   color.g = 1.0 - color.r;
 
   // dbg: tile bounds
-  // let tileIdx: u32 = getHairTileIdx(viewportSize, tileXY, 0u);
+  // let tileIdx: u32 = getHairTileDepthBinIdx(viewportSize, tileXY, 0u);
   // color.r = f32((tileIdx * 17) % 33) / 33.0;
   // color.a = 1.0;
   
@@ -117,7 +127,7 @@ fn renderTileSegmentCount(
   return color;
 }
 
-fn getSegmentCountInTiles(
+fn getSegmentCountInTiles_PPLL(
   viewportSize: vec2u,
   maxSegmentsCount: u32,
   tileXY: vec2u
@@ -142,4 +152,13 @@ fn getSegmentCountInTiles(
   return count;
 }
 
+fn getSegmentCountInTiles_Count(
+  viewportSize: vec2u,
+  maxSegmentsCount: u32,
+  tileXY: vec2u
+) -> u32 {
+  let tileIdx = getHairTileIdx(viewportSize, tileXY);
+  return _hairSegmentCountPerTile[tileIdx];
+}
+
 `;
diff --git a/src/passes/passCtx.ts b/src/passes/passCtx.ts
index 871b60d..2187244 100644
--- a/src/passes/passCtx.ts
+++ b/src/passes/passCtx.ts
@@ -30,4 +30,6 @@ export interface PassCtx {
   hairTilesBuffer: GPUBuffer;
   hairTileSegmentsBuffer: GPUBuffer;
   hairRasterizerResultsBuffer: GPUBuffer;
+  hairTileListBuffer: GPUBuffer;
+  hairSegmentCountPerTileBuffer: GPUBuffer;
 }
diff --git a/src/passes/renderUniformsBuffer.ts b/src/passes/renderUniformsBuffer.ts
index 601ef94..4abf71e 100644
--- a/src/passes/renderUniformsBuffer.ts
+++ b/src/passes/renderUniformsBuffer.ts
@@ -36,6 +36,7 @@ export class RenderUniformsBuffer {
 
     const DISPLAY_MODE_FINAL = ${DISPLAY_MODE.FINAL}u;
     const DISPLAY_MODE_TILES = ${DISPLAY_MODE.TILES}u;
+    const DISPLAY_MODE_TILES_PPLL = ${DISPLAY_MODE.TILES_PPLL}u;
     const DISPLAY_MODE_HW_RENDER = ${DISPLAY_MODE.HW_RENDER}u;
     const DISPLAY_MODE_USED_SLICES = ${DISPLAY_MODE.USED_SLICES}u;
     const DISPLAY_MODE_DEPTH = ${DISPLAY_MODE.DEPTH}u;
@@ -398,7 +399,10 @@ export class RenderUniformsBuffer {
     const hr = CONFIG.hairRender;
     let extraData = 0;
 
-    if (c.displayMode === DISPLAY_MODE.TILES) {
+    if (
+      c.displayMode === DISPLAY_MODE.TILES ||
+      c.displayMode === DISPLAY_MODE.TILES_PPLL
+    ) {
       extraData = hr.dbgTileModeMaxSegments;
     } else if (c.displayMode === DISPLAY_MODE.USED_SLICES) {
       extraData = hr.dbgSlicesModeMaxSlices;
diff --git a/src/passes/swHair/hairFinePass.ts b/src/passes/swHair/hairFinePass.ts
index 03ffffa..2509217 100644
--- a/src/passes/swHair/hairFinePass.ts
+++ b/src/passes/swHair/hairFinePass.ts
@@ -4,6 +4,7 @@ import { Dimensions } from '../../utils/index.ts';
 import {
   assertIsGPUTextureView,
   bindBuffer,
+  cmdClearWholeBuffer,
   getItemsPerThread,
 } from '../../utils/webgpu.ts';
 import { BindingsCache } from '../_shared/bindingsCache.ts';
@@ -46,20 +47,11 @@ export class HairFinePass {
     this.hairSlicesDataBuffer = createHairSlicesDataBuffer(device);
   }
 
-  /** Clears to 0. We cannot select a number */
-  clearFramebuffer(ctx: PassCtx) {
+  cmdClearBeforeRender(ctx: PassCtx) {
     if (this.hairSlicesHeadsBuffer) {
-      ctx.cmdBuf.clearBuffer(
-        this.hairSlicesHeadsBuffer,
-        0,
-        this.hairSlicesHeadsBuffer.size
-      );
+      cmdClearWholeBuffer(ctx.cmdBuf, this.hairSlicesHeadsBuffer);
     }
-    ctx.cmdBuf.clearBuffer(
-      this.hairRasterizerResultsBuffer,
-      0,
-      this.hairRasterizerResultsBuffer.size
-    );
+    cmdClearWholeBuffer(ctx.cmdBuf, this.hairRasterizerResultsBuffer); // TODO not needed?
   }
 
   onViewportResize = (device: GPUDevice, viewportSize: Dimensions) => {
@@ -107,6 +99,7 @@ export class HairFinePass {
       depthTexture,
       hairTilesBuffer,
       hairTileSegmentsBuffer,
+      hairTileListBuffer,
     } = ctx;
     const b = SHADER_PARAMS.bindings;
     assertIsGPUTextureView(depthTexture);
@@ -117,6 +110,7 @@ export class HairFinePass {
       bindBuffer(b.tileSegmentsBuffer, hairTileSegmentsBuffer),
       bindBuffer(b.hairSlicesData, this.hairSlicesDataBuffer),
       bindBuffer(b.rasterizerResult, this.hairRasterizerResultsBuffer),
+      bindBuffer(b.tileList, hairTileListBuffer),
       object.bindHairData(b.hairData),
       object.bindPointsPositions(b.hairPositions),
       object.bindTangents(b.hairTangents),
@@ -124,7 +118,7 @@ export class HairFinePass {
       { binding: b.depthTexture, resource: depthTexture },
     ];
 
-    // no needed if using local memory
+    // no needed if using workgroup/registers memory
     if (this.hairSlicesHeadsBuffer) {
       entries.push(bindBuffer(b.hairSlicesHeads, this.hairSlicesHeadsBuffer));
     }
diff --git a/src/passes/swHair/hairFinePass.wgsl.ts b/src/passes/swHair/hairFinePass.wgsl.ts
index 64d4c92..c146fb3 100644
--- a/src/passes/swHair/hairFinePass.wgsl.ts
+++ b/src/passes/swHair/hairFinePass.wgsl.ts
@@ -14,6 +14,7 @@ import { BUFFER_HAIR_SHADING } from '../../scene/hair/hairShadingBuffer.ts';
 import { SHADER_TILE_UTILS } from './shaderImpl/tileUtils.wgsl.ts';
 import { SW_RASTERIZE_HAIR } from './shaderImpl/swRasterizeHair.wgsl.ts';
 import { BUFFER_HAIR_TANGENTS } from '../../scene/hair/hairTangentsBuffer.ts';
+import { BUFFER_TILE_LIST } from './shared/tileListBuffer.ts';
 
 export const SHADER_PARAMS = {
   workgroupSizeX: CONFIG.hairRender.finePassWorkgroupSizeX,
@@ -29,6 +30,7 @@ export const SHADER_PARAMS = {
     depthTexture: 8,
     hairShading: 9,
     hairTangents: 10,
+    tileList: 11,
   },
 };
 
@@ -61,14 +63,13 @@ ${BUFFER_HAIR_RASTERIZER_RESULTS(b.rasterizerResult, 'read_write')}
 ${BUFFER_HAIR_SLICES_HEADS(b.hairSlicesHeads, 'read_write')}
 ${BUFFER_HAIR_SLICES_DATA(b.hairSlicesData, 'read_write')}
 ${BUFFER_HAIR_SHADING(b.hairShading, 'read')}
+${BUFFER_TILE_LIST(b.tileList, 'read')}
 
 @group(0) @binding(${b.depthTexture})
 var _depthTexture: texture_depth_2d;
 
 
 struct FineRasterParams {
-  viewModelMat: mat4x4f,
-  projMat: mat4x4f,
   // START: vec4u
   strandsCount: u32, // u32's first
   pointsPerStrand: u32,
@@ -84,21 +85,24 @@ struct FineRasterParams {
 ${SHADER_IMPL_PROCESS_HAIR_SEGMENT()}
 ${SHADER_IMPL_REDUCE_HAIR_SLICES()}
 
+var<private> _local_invocation_index: u32;
+var<workgroup> _tileStartOffset: u32;
+var<workgroup> _isDone: bool;
 
 @compute
 @workgroup_size(${c.workgroupSizeX}, 1, 1)
 fn main(
   @builtin(global_invocation_id) global_id: vec3<u32>,
+  @builtin(local_invocation_index) local_invocation_index: u32, // threadId inside workgroup
 ) {
   let processorId = global_id.x;
   let viewportSize: vec2f = _uniforms.viewport.xy;
   let maxDrawnSegments: u32 = _uniforms.maxDrawnHairSegments;
   let strandsCount: u32 = _hairData.strandsCount;
   let pointsPerStrand: u32 = _hairData.pointsPerStrand;
+  _local_invocation_index = local_invocation_index;
 
   let params = FineRasterParams(
-    _uniforms.modelViewMat,
-    _uniforms.projMatrix,
     strandsCount,
     pointsPerStrand,
     vec2u(viewportSize),
@@ -110,16 +114,23 @@ fn main(
 
   // clear memory before starting work
   _clearSlicesHeadPtrs(processorId);
-  
+
+  // tile count based on screen size. Used to check if tile is valid
   let tileCount2d = getTileCount(params.viewportSizeU32);
   let tileCount = tileCount2d.x * tileCount2d.y;
-  var tileIdx = _getNextTileIdx();
+  // size of task queue
+  let tilesToProcess = _hairTileData.drawnTiles;
+  var tileIdx = _getNextTileIdx(tilesToProcess);
 
-  while (tileIdx < tileCount) {
+  while (!workgroupUniformLoad(&_isDone)) {
     let tileXY = getTileXY(params.viewportSizeU32, tileIdx);
     var tileBoundsPx: vec4u = getTileBoundsPx(params.viewportSizeU32, tileXY);
     
-    for (var depthBin = 0u; depthBin < TILE_DEPTH_BINS_COUNT; depthBin += 1u) {
+    for (
+      var depthBin = 0u;
+      depthBin < TILE_DEPTH_BINS_COUNT && tileIdx < tileCount;
+      depthBin += 1u
+    ) {
       let allPixelsDone = processTile(
         params,
         maxDrawnSegments,
@@ -127,26 +138,29 @@ fn main(
         depthBin,
         &tileBoundsPx
       );
-      if (allPixelsDone) { break; }
+      if (allPixelsDone) { // early out for whole tile
+        // debugColorWholeTile(tileBoundsPx, vec4f(1., 0., 0., 1.));
+        break;
+      }
     }
 
     // move to next tile
-    tileIdx = _getNextTileIdx();
+    tileIdx = _getNextTileIdx(tilesToProcess);
   }
 }
 
 fn processTile(
-  p: FineRasterParams,
+  params: FineRasterParams,
   maxDrawnSegments: u32,
   tileXY: vec2u,
   depthBin: u32,
   tileBoundsPx: ptr<function,vec4u>
 ) -> bool {
-  let MAX_PROCESSED_SEGMENTS = p.strandsCount * p.pointsPerStrand; // just in case
+  let MAX_PROCESSED_SEGMENTS = params.strandsCount * params.pointsPerStrand; // just in case
   
-  let tileDepth = _getTileDepth(p.viewportSizeU32, tileXY, depthBin);
+  let tileDepth = _getTileDepth(params.viewportSizeU32, tileXY, depthBin);
   if (tileDepth.y == 0.0) { return false; } // no depth written means empty tile
-  var segmentPtr = _getTileSegmentPtr(p.viewportSizeU32, tileXY, depthBin);
+  var segmentPtr = _getTileSegmentPtr(params.viewportSizeU32, tileXY, depthBin);
 
   var segmentData = vec3u(); // [strandIdx, segmentIdx, nextPtr]
   var processedSegmentCnt = 0u;
@@ -157,7 +171,7 @@ fn processTile(
   while (processedSegmentCnt < MAX_PROCESSED_SEGMENTS){
     if (_getTileSegment(maxDrawnSegments, segmentPtr, &segmentData)) {
       let writtenSliceDataCount = processHairSegment(
-        p,
+        params,
         (*tileBoundsPx), tileDepth,
         sliceDataOffset,
         segmentData.x, segmentData.y // strandIdx, segmentIdx
@@ -183,9 +197,9 @@ fn processTile(
   // this also clears the current processor state for next tile
   // debugColorWholeTile(tileBoundsPx, vec4f(1., 0., 0., 1.));
   let allPixelsDone = reduceHairSlices(
-    p.processorId,
-    p.viewportSizeU32,
-    p.dbgSlicesModeMaxSlices,
+    params.processorId,
+    params.viewportSizeU32,
+    params.dbgSlicesModeMaxSlices,
     tileBoundsPx
   );
 
diff --git a/src/passes/swHair/hairTileSortPass.countTiles.wgsl.ts b/src/passes/swHair/hairTileSortPass.countTiles.wgsl.ts
new file mode 100644
index 0000000..46475d1
--- /dev/null
+++ b/src/passes/swHair/hairTileSortPass.countTiles.wgsl.ts
@@ -0,0 +1,99 @@
+import { RenderUniformsBuffer } from '../renderUniformsBuffer.ts';
+import { BUFFER_SEGMENT_COUNT_PER_TILE } from './shared/segmentCountPerTileBuffer.ts';
+import * as SHADER_SNIPPETS from '../_shaderSnippets/shaderSnippets.wgls.ts';
+import { SHADER_TILE_UTILS } from './shaderImpl/tileUtils.wgsl.ts';
+import { u32_type } from '../../utils/webgpu.ts';
+import { CONFIG } from '../../constants.ts';
+
+export const SHADER_PARAMS = {
+  workgroupSizeX: 256,
+  bindings: {
+    renderUniforms: 0,
+    segmentCountPerTile: 1,
+    sortBuckets: 2,
+  },
+};
+
+///////////////////////////
+/// SORT UTILS
+///////////////////////////
+
+export const SORT_BUCKETS_BUFFER = (
+  bindingIdx: number,
+  pass: 'count-tiles' | 'sort'
+) => /* wgsl */ `
+
+const SORT_BUCKETS = ${CONFIG.hairRender.sortBuckets}u;
+const BUCKET_SIZE = ${CONFIG.hairRender.sortBucketSize}u;
+
+fn calcTileSortBucket(segmentCount: u32) -> u32 {
+  let key = segmentCount / BUCKET_SIZE;
+  return clamp(key, 0u, SORT_BUCKETS - 1u);
+}
+
+struct SortBucket {
+  // 1 pass: WRITE: inc for each tile that has segment count in this bucket
+  // 2 pass: READ: to get offsets
+  tileCount: ${u32_type(pass === 'count-tiles' ? 'read_write' : 'read')},
+  // 1 pass: -
+  // 2 pass: WRITE: in-bucket offsets
+  writeOffset: ${u32_type(pass === 'count-tiles' ? 'read' : 'read_write')},
+}
+
+@group(0) @binding(${bindingIdx})
+var<storage, read_write> _buckets: array<SortBucket>;
+`;
+
+///////////////////////////
+/// SHADER CODE
+///////////////////////////
+const c = SHADER_PARAMS;
+const b = SHADER_PARAMS.bindings;
+
+export const SHADER_CODE = () => /* wgsl */ `
+
+${SHADER_SNIPPETS.GENERIC_UTILS}
+${SHADER_TILE_UTILS}
+${SORT_BUCKETS_BUFFER(b.sortBuckets, 'count-tiles')}
+
+${RenderUniformsBuffer.SHADER_SNIPPET(b.renderUniforms)}
+${BUFFER_SEGMENT_COUNT_PER_TILE(b.segmentCountPerTile, 'read')}
+
+
+@compute
+@workgroup_size(${c.workgroupSizeX}, 1, 1)
+fn main(
+  @builtin(global_invocation_id) global_id: vec3<u32>,
+  @builtin(local_invocation_index) local_invocation_index: u32, // threadId inside workgroup
+) {
+  let tileIdx = global_id.x;
+  let viewportSize: vec2f = _uniforms.viewport.xy;
+
+  /*if (local_invocation_index == 0u) {
+    for (var i = 0u; i < SORT_BUCKETS; i++) {
+      atomicStore(_subResults[i], 0u);
+    }
+  }
+  workgroupBarrier();*/
+
+  let screenTileCount_2d = getTileCount(vec2u(viewportSize));
+  let screenTileCount = screenTileCount_2d.x * screenTileCount_2d.y;
+  let isValidTile = tileIdx < screenTileCount;
+
+  let segmentCount = _hairSegmentCountPerTile[tileIdx];
+  if (isValidTile && segmentCount > 0u) {
+    let sortBucket = calcTileSortBucket(segmentCount);
+    // atomicAdd(&_subResults[sortBucket], 1u);
+    atomicAdd(&_buckets[sortBucket].tileCount, 1u);
+  }
+  /*workgroupBarrier();
+
+  if (local_invocation_index == 0u) {
+    for (var i = 0u; i < SORT_BUCKETS; i++) {
+      let bucketValue = atomicLoad(_subResults[i]);
+      _segmentsInBucket[i] = bucketValue;
+    }
+  }*/
+}
+
+`;
diff --git a/src/passes/swHair/hairTileSortPass.sort.wgsl.ts b/src/passes/swHair/hairTileSortPass.sort.wgsl.ts
new file mode 100644
index 0000000..af589fc
--- /dev/null
+++ b/src/passes/swHair/hairTileSortPass.sort.wgsl.ts
@@ -0,0 +1,77 @@
+import { RenderUniformsBuffer } from '../renderUniformsBuffer.ts';
+import { BUFFER_SEGMENT_COUNT_PER_TILE } from './shared/segmentCountPerTileBuffer.ts';
+import * as SHADER_SNIPPETS from '../_shaderSnippets/shaderSnippets.wgls.ts';
+import { SHADER_TILE_UTILS } from './shaderImpl/tileUtils.wgsl.ts';
+import { SORT_BUCKETS_BUFFER } from './hairTileSortPass.countTiles.wgsl.ts';
+import { BUFFER_TILE_LIST } from './shared/tileListBuffer.ts';
+
+export const SHADER_PARAMS = {
+  workgroupSizeX: 256,
+  bindings: {
+    renderUniforms: 0,
+    segmentCountPerTile: 1,
+    tileList: 2,
+    sortBuckets: 3,
+  },
+};
+
+///////////////////////////
+/// SHADER CODE
+///////////////////////////
+const c = SHADER_PARAMS;
+const b = SHADER_PARAMS.bindings;
+
+export const SHADER_CODE = () => /* wgsl */ `
+
+${SHADER_SNIPPETS.GENERIC_UTILS}
+${SHADER_TILE_UTILS}
+
+${RenderUniformsBuffer.SHADER_SNIPPET(b.renderUniforms)}
+${BUFFER_SEGMENT_COUNT_PER_TILE(b.segmentCountPerTile, 'read')}
+${BUFFER_TILE_LIST(b.tileList, 'read_write')}
+${SORT_BUCKETS_BUFFER(b.sortBuckets, 'sort')}
+
+
+var<workgroup> _bucketOffsets: array<u32, SORT_BUCKETS>;
+
+@compute
+@workgroup_size(${c.workgroupSizeX}, 1, 1)
+fn main(
+  @builtin(global_invocation_id) global_id: vec3<u32>,
+  @builtin(local_invocation_index) local_invocation_index: u32, // threadId inside workgroup
+) {
+  let tileIdx = global_id.x;
+  let viewportSize: vec2f = _uniforms.viewport.xy;
+
+  if (local_invocation_index == 0u) {
+    calculateOffsetsForEachBucket();
+  }
+  workgroupBarrier();
+
+  let screenTileCount_2d = getTileCount(vec2u(viewportSize));
+  let screenTileCount = screenTileCount_2d.x * screenTileCount_2d.y;
+  let isValidTile = tileIdx < screenTileCount;
+
+  let segmentCount = _hairSegmentCountPerTile[tileIdx];
+  if (isValidTile && segmentCount > 0u) {
+    let sortBucket = calcTileSortBucket(segmentCount);
+    let bucketOffset = _bucketOffsets[sortBucket];
+    let inBucketOffset = atomicAdd(&_buckets[sortBucket].writeOffset, 1u);
+    let offset = bucketOffset + inBucketOffset;
+    _hairTileData.data[offset] = tileIdx;
+
+    // add to tile counter
+    atomicAdd(&_hairTileData.drawnTiles, 1u);
+  }
+}
+
+fn calculateOffsetsForEachBucket() {
+  var offset = 0u;
+  for (var i = 0u; i < SORT_BUCKETS; i++) {
+    let idx = SORT_BUCKETS - 1u - i; // reverse sort: heavier tiles first
+    _bucketOffsets[idx] = offset;
+    offset += _buckets[idx].tileCount;
+  }
+}
+
+`;
diff --git a/src/passes/swHair/hairTileSortPass.test.ts b/src/passes/swHair/hairTileSortPass.test.ts
new file mode 100644
index 0000000..78a03e0
--- /dev/null
+++ b/src/passes/swHair/hairTileSortPass.test.ts
@@ -0,0 +1,183 @@
+import {
+  createGpuDevice_TESTS,
+  createMockPassCtx,
+} from '../../sys_deno/testUtils.ts';
+import { Dimensions, clamp } from '../../utils/index.ts';
+import {
+  cmdCopyToReadbackBuffer,
+  createGPU_StorageBuffer,
+  createReadbackBuffer,
+  readBufferToCPU,
+} from '../../utils/webgpu.ts';
+import { RenderUniformsBuffer } from '../renderUniformsBuffer.ts';
+import { CONFIG } from '../../constants.ts';
+import { HairTileSortPass } from './hairTileSortPass.ts';
+import {
+  createHairTileListBuffer,
+  parseTileList,
+} from './shared/tileListBuffer.ts';
+import { createArray } from '../../utils/arrays.ts';
+import { assertEquals, assertLessOrEqual } from 'assert';
+
+const TILE_COUNT = 16;
+const TEST_PREFFIX = 'test-tiles-sort';
+
+Deno.test('HairTileSortPass', async () => {
+  CONFIG.hairRender.sortBuckets = 8;
+  const [device, reportWebGPUErrAsync] = await createGpuDevice_TESTS();
+
+  const viewportSize: Dimensions = {
+    width: TILE_COUNT * CONFIG.hairRender.tileSize,
+    height: 1,
+  };
+  const uniforms = new RenderUniformsBuffer(device);
+  const pass = new HairTileSortPass(device);
+  pass.onViewportResize(device, viewportSize);
+  const bucketsReadbackBuffer = createReadbackBuffer(
+    device,
+    pass.bucketsDataBuffer
+  );
+
+  const hairTileListBuffer = createHairTileListBuffer(device, viewportSize);
+  const hairTileListReadbackBuffer = createReadbackBuffer(
+    device,
+    hairTileListBuffer
+  );
+
+  const segmentsPerTile = new Uint32Array(TILE_COUNT);
+  segmentsPerTile.fill(0);
+  segmentsPerTile[1] = 1;
+  segmentsPerTile[6] = 15;
+  segmentsPerTile[2] = 16;
+  segmentsPerTile[3] = 17;
+  segmentsPerTile[8] = 4096;
+  const hairSegmentCountPerTileBuffer = createGPU_StorageBuffer(
+    device,
+    'test-hair-segments-per-tile',
+    segmentsPerTile
+  );
+
+  // start execute
+  const cmdBuf = device.createCommandEncoder();
+
+  // prepare params
+  const passCtx = createMockPassCtx(device, cmdBuf);
+  passCtx.globalUniforms = uniforms;
+  passCtx.viewport = viewportSize;
+  passCtx.hairSegmentCountPerTileBuffer = hairSegmentCountPerTileBuffer;
+  passCtx.hairTileListBuffer = hairTileListBuffer;
+  uniforms.update(passCtx);
+
+  // execute pass
+  const computePass = cmdBuf.beginComputePass({
+    label: `${TEST_PREFFIX}-compute-pass`,
+  });
+  pass.cmdClearBeforeRender(passCtx);
+  pass.cmdSortHairTiles(passCtx);
+  computePass.end();
+
+  // finalize
+  cmdCopyToReadbackBuffer(
+    cmdBuf,
+    pass.bucketsDataBuffer,
+    bucketsReadbackBuffer
+  );
+  cmdCopyToReadbackBuffer(
+    cmdBuf,
+    hairTileListBuffer,
+    hairTileListReadbackBuffer
+  );
+  device.queue.submit([cmdBuf.finish()]);
+
+  await reportWebGPUErrAsync();
+
+  // read back
+  const bucketsData = await readBufferToCPU(Uint32Array, bucketsReadbackBuffer);
+  // console.log('bucketsData', typedArr2str(bucketsData));
+  const tilesListData = await readBufferToCPU(
+    Uint32Array,
+    hairTileListReadbackBuffer
+  );
+  // console.log('tilesListData', parseTileList(tilesListData));
+
+  // cleanup
+  device.destroy();
+
+  // test pass 0
+  const gpuBucketData = parseBucketData(bucketsData);
+  // printDebugBuckets(gpuBucketData);
+  assertBucketCount(segmentsPerTile, gpuBucketData);
+
+  // test pass 1
+  const tileListParsed = parseTileList(tilesListData);
+  const fullTileData = tileListParsed.data.map((tileIdx) => {
+    const segments = segmentsPerTile[tileIdx];
+    return {
+      tileIdx,
+      bucket: calcTileSortBucket(segments),
+      segments,
+    };
+  });
+  // console.log('tilesListData', fullTileData);
+  for (let i = 1; i < fullTileData.length; i++) {
+    const prev = fullTileData[i - 1];
+    const now = fullTileData[i];
+    // bucket has to decrease
+    assertLessOrEqual(now.bucket, prev.bucket);
+  }
+});
+
+type GPUBucketData = ReturnType<typeof parseBucketData>;
+
+function parseBucketData(data: Uint32Array) {
+  const { sortBuckets } = CONFIG.hairRender;
+
+  return createArray(sortBuckets).map((_, i) => {
+    const tileCount = data[2 * i];
+    const writeOffset = data[2 * i + 1];
+    return { tileCount, writeOffset };
+  });
+}
+
+// deno-lint-ignore no-unused-vars
+function printDebugBuckets(bucketData: GPUBucketData) {
+  const { sortBucketSize } = CONFIG.hairRender;
+  const p = (a: number) => ' '.repeat(4 - String(a).length) + a;
+
+  let start = 0;
+  bucketData.forEach(({ tileCount, writeOffset }) => {
+    const end = start + sortBucketSize - 1;
+
+    const dataStr =
+      tileCount > 0
+        ? `tileCount=${tileCount}, writeOffset=${writeOffset}`
+        : '-';
+    console.log(`Bucket [${p(start)} : ${p(end)}] ${dataStr}`);
+    start += sortBucketSize;
+  });
+}
+
+function assertBucketCount(
+  segmentsPerTile: Uint32Array,
+  gpuBucketsData: GPUBucketData
+) {
+  const expected: number[] = createArray(CONFIG.hairRender.sortBuckets);
+  expected.fill(0);
+
+  segmentsPerTile.forEach((segmentCnt) => {
+    if (segmentCnt == 0) return;
+    const bucketIdx = calcTileSortBucket(segmentCnt);
+    expected[bucketIdx] += 1;
+  });
+
+  assertEquals(
+    expected,
+    gpuBucketsData.map((e) => e.tileCount)
+  );
+}
+
+function calcTileSortBucket(segmentCount: number) {
+  const { sortBuckets, sortBucketSize } = CONFIG.hairRender;
+  const key = Math.floor(segmentCount / sortBucketSize);
+  return clamp(key, 0, sortBuckets - 1);
+}
diff --git a/src/passes/swHair/hairTileSortPass.ts b/src/passes/swHair/hairTileSortPass.ts
new file mode 100644
index 0000000..8d54fab
--- /dev/null
+++ b/src/passes/swHair/hairTileSortPass.ts
@@ -0,0 +1,177 @@
+import { BYTES_U32, CONFIG } from '../../constants.ts';
+import { Dimensions } from '../../utils/index.ts';
+import {
+  bindBuffer,
+  cmdClearWholeBuffer,
+  getItemsPerThread,
+} from '../../utils/webgpu.ts';
+import { BindingsCache } from '../_shared/bindingsCache.ts';
+import { createComputePipeline, createLabel } from '../_shared/shared.ts';
+import { PassCtx } from '../passCtx.ts';
+import * as SHADER_COUNT_TILES from './hairTileSortPass.countTiles.wgsl.ts';
+import * as SHADER_SORT from './hairTileSortPass.sort.wgsl.ts';
+import { getTileCount } from './shared/utils.ts';
+import { createHairTileListBuffer } from './shared/tileListBuffer.ts';
+import { assignResourcesToBindings2 } from '../_shared/shared.ts';
+
+const NAME_COUNT_TILES = 'count-tiles';
+const NAME_SORT = 'sort';
+
+/**
+ * Usual approximate bucket sort (split tiles into buckets).
+ *
+ * Pass 1: Count tiles for each bucket.
+ * Pass 2. Write tiles to array based on their bucket. E.g.
+ *         Move stuff from bucket 0 to the start of the array.
+ *         Then all stuff from bucket 1, etc.
+ *
+ * This pass is only needed for optimization. See below for older version without:
+ * - https://github.com/Scthe/frostbitten-hair-webgpu/tree/501f01969b4bc65cb7df3b901c1ced4e2da0c84b
+ */
+export class HairTileSortPass {
+  public static NAME: string = 'HairTileSortPass';
+
+  private readonly pipelineCountPerBucket: GPUComputePipeline;
+  private readonly pipelineSort: GPUComputePipeline;
+  private readonly bindingsCache = new BindingsCache();
+
+  public tileListBuffer: GPUBuffer = undefined!; // see this.handleViewportResize()
+  public bucketsDataBuffer: GPUBuffer;
+
+  constructor(device: GPUDevice) {
+    const extraUsage = CONFIG.isTest ? GPUBufferUsage.COPY_SRC : 0;
+    this.bucketsDataBuffer = device.createBuffer({
+      label: createLabel(HairTileSortPass, 'sortBuckets'),
+      size: CONFIG.hairRender.sortBuckets * 2 * BYTES_U32,
+      usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | extraUsage,
+    });
+
+    this.pipelineCountPerBucket = createComputePipeline(
+      device,
+      HairTileSortPass,
+      SHADER_COUNT_TILES.SHADER_CODE(),
+      NAME_COUNT_TILES
+    );
+    this.pipelineSort = createComputePipeline(
+      device,
+      HairTileSortPass,
+      SHADER_SORT.SHADER_CODE(),
+      NAME_SORT
+    );
+  }
+
+  cmdClearBeforeRender(ctx: PassCtx) {
+    if (this.tileListBuffer) {
+      ctx.cmdBuf.clearBuffer(this.tileListBuffer, 0, BYTES_U32);
+    }
+    cmdClearWholeBuffer(ctx.cmdBuf, this.bucketsDataBuffer);
+  }
+
+  onViewportResize = (device: GPUDevice, viewportSize: Dimensions) => {
+    this.bindingsCache.clear();
+
+    if (this.tileListBuffer) {
+      this.tileListBuffer.destroy();
+    }
+
+    this.tileListBuffer = createHairTileListBuffer(device, viewportSize);
+  };
+
+  cmdSortHairTiles(ctx: PassCtx) {
+    const { cmdBuf, profiler } = ctx;
+
+    const computePass = cmdBuf.beginComputePass({
+      label: HairTileSortPass.NAME,
+      timestampWrites: profiler?.createScopeGpu(HairTileSortPass.NAME),
+    });
+
+    this.cmdCountTilesPerBucket(ctx, computePass);
+    this.cmdSort(ctx, computePass);
+
+    computePass.end();
+  }
+
+  private getDispatchDims_EachTile(ctx: PassCtx, workgroupSize: number) {
+    const tileCount = getTileCount(ctx.viewport);
+    return getItemsPerThread(tileCount.width * tileCount.height, workgroupSize);
+  }
+
+  /////////////////////////
+  /// Count Tiles Per Bucket
+
+  private cmdCountTilesPerBucket(
+    ctx: PassCtx,
+    computePass: GPUComputePassEncoder
+  ) {
+    const bindings = this.bindingsCache.getBindings(NAME_COUNT_TILES, () =>
+      this.createBindings_countTilesPerBucket(ctx)
+    );
+    computePass.setPipeline(this.pipelineCountPerBucket);
+    computePass.setBindGroup(0, bindings);
+
+    // dispatch
+    const workgroupsX = this.getDispatchDims_EachTile(
+      ctx,
+      SHADER_COUNT_TILES.SHADER_PARAMS.workgroupSizeX
+    );
+    computePass.dispatchWorkgroups(workgroupsX, 1, 1);
+  }
+
+  private createBindings_countTilesPerBucket = (ctx: PassCtx): GPUBindGroup => {
+    const { device, globalUniforms, hairSegmentCountPerTileBuffer } = ctx;
+    const b = SHADER_COUNT_TILES.SHADER_PARAMS.bindings;
+
+    return assignResourcesToBindings2(
+      HairTileSortPass,
+      NAME_COUNT_TILES,
+      device,
+      this.pipelineCountPerBucket,
+      [
+        globalUniforms.createBindingDesc(b.renderUniforms),
+        bindBuffer(b.segmentCountPerTile, hairSegmentCountPerTileBuffer),
+        bindBuffer(b.sortBuckets, this.bucketsDataBuffer),
+      ]
+    );
+  };
+
+  /////////////////////////
+  /// Sort
+
+  private cmdSort(ctx: PassCtx, computePass: GPUComputePassEncoder) {
+    const bindings = this.bindingsCache.getBindings(NAME_SORT, () =>
+      this.createBindings_Sort(ctx)
+    );
+    computePass.setPipeline(this.pipelineSort);
+    computePass.setBindGroup(0, bindings);
+
+    // dispatch
+    const workgroupsX = this.getDispatchDims_EachTile(
+      ctx,
+      SHADER_SORT.SHADER_PARAMS.workgroupSizeX
+    );
+    computePass.dispatchWorkgroups(workgroupsX, 1, 1);
+  }
+
+  private createBindings_Sort = (ctx: PassCtx): GPUBindGroup => {
+    const {
+      device,
+      globalUniforms,
+      hairSegmentCountPerTileBuffer,
+      hairTileListBuffer,
+    } = ctx;
+    const b = SHADER_SORT.SHADER_PARAMS.bindings;
+
+    return assignResourcesToBindings2(
+      HairTileSortPass,
+      NAME_SORT,
+      device,
+      this.pipelineSort,
+      [
+        globalUniforms.createBindingDesc(b.renderUniforms),
+        bindBuffer(b.segmentCountPerTile, hairSegmentCountPerTileBuffer),
+        bindBuffer(b.tileList, hairTileListBuffer),
+        bindBuffer(b.sortBuckets, this.bucketsDataBuffer),
+      ]
+    );
+  };
+}
diff --git a/src/passes/swHair/hairTilesPass.ts b/src/passes/swHair/hairTilesPass.ts
index 9618309..51c581f 100644
--- a/src/passes/swHair/hairTilesPass.ts
+++ b/src/passes/swHair/hairTilesPass.ts
@@ -4,6 +4,7 @@ import { Dimensions } from '../../utils/index.ts';
 import {
   assertIsGPUTextureView,
   bindBuffer,
+  cmdClearWholeBuffer,
   getItemsPerThread,
 } from '../../utils/webgpu.ts';
 import { BindingsCache } from '../_shared/bindingsCache.ts';
@@ -16,6 +17,7 @@ import { PassCtx } from '../passCtx.ts';
 import { SHADER_CODE, SHADER_PARAMS } from './hairTilesPass.wgsl.ts';
 import { createHairTileSegmentsBuffer } from './shared/hairTileSegmentsBuffer.ts';
 import { createHairTilesResultBuffer } from './shared/hairTilesResultBuffer.ts';
+import { createHairSegmentCountPerTileBuffer } from './shared/segmentCountPerTileBuffer.ts';
 
 export class HairTilesPass {
   public static NAME: string = 'HairTilesPass';
@@ -26,6 +28,7 @@ export class HairTilesPass {
   /** result framebuffer as flat buffer */
   public hairTilesBuffer: GPUBuffer = undefined!; // see this.handleViewportResize()
   public hairTileSegmentsBuffer: GPUBuffer = undefined!; // see this.handleViewportResize()
+  public segmentCountPerTileBuffer: GPUBuffer = undefined!; // see this.handleViewportResize()
 
   constructor(device: GPUDevice) {
     const shaderModule = device.createShaderModule({
@@ -42,9 +45,9 @@ export class HairTilesPass {
     });
   }
 
-  /** Clears to 0. We cannot select a number */
-  clearFramebuffer(ctx: PassCtx) {
-    ctx.cmdBuf.clearBuffer(this.hairTilesBuffer, 0, this.hairTilesBuffer.size);
+  cmdClearBeforeRender(ctx: PassCtx) {
+    cmdClearWholeBuffer(ctx.cmdBuf, this.hairTilesBuffer);
+    cmdClearWholeBuffer(ctx.cmdBuf, this.segmentCountPerTileBuffer);
     ctx.cmdBuf.clearBuffer(this.hairTileSegmentsBuffer, 0, BYTES_U32);
   }
 
@@ -57,12 +60,19 @@ export class HairTilesPass {
     if (this.hairTileSegmentsBuffer) {
       this.hairTileSegmentsBuffer.destroy();
     }
+    if (this.segmentCountPerTileBuffer) {
+      this.segmentCountPerTileBuffer.destroy();
+    }
 
     this.hairTilesBuffer = createHairTilesResultBuffer(device, viewportSize);
     this.hairTileSegmentsBuffer = createHairTileSegmentsBuffer(
       device,
       viewportSize
     );
+    this.segmentCountPerTileBuffer = createHairSegmentCountPerTileBuffer(
+      device,
+      viewportSize
+    );
   };
 
   cmdDrawHairToTiles(ctx: PassCtx, hairObject: HairObject) {
@@ -119,6 +129,7 @@ export class HairTilesPass {
         globalUniforms.createBindingDesc(b.renderUniforms),
         bindBuffer(b.tilesBuffer, this.hairTilesBuffer),
         bindBuffer(b.tileSegmentsBuffer, this.hairTileSegmentsBuffer),
+        bindBuffer(b.segmentCountPerTileBuffer, this.segmentCountPerTileBuffer),
         object.bindHairData(b.hairData),
         object.bindPointsPositions(b.hairPositions),
         object.bindTangents(b.hairTangents),
diff --git a/src/passes/swHair/hairTilesPass.wgsl.ts b/src/passes/swHair/hairTilesPass.wgsl.ts
index 71af885..cabdcfa 100644
--- a/src/passes/swHair/hairTilesPass.wgsl.ts
+++ b/src/passes/swHair/hairTilesPass.wgsl.ts
@@ -8,6 +8,7 @@ import { BUFFER_HAIR_TILES_RESULT } from './shared/hairTilesResultBuffer.ts';
 import { BUFFER_HAIR_TILE_SEGMENTS } from './shared/hairTileSegmentsBuffer.ts';
 import { SHADER_TILE_UTILS } from './shaderImpl/tileUtils.wgsl.ts';
 import { CONFIG } from '../../constants.ts';
+import { BUFFER_SEGMENT_COUNT_PER_TILE } from './shared/segmentCountPerTileBuffer.ts';
 
 /*
 1) I've also tested per-strand dispatch version - https://github.com/Scthe/frostbitten-hair-webgpu/blob/d6306a69ab1cde4ef1321fc98c2040fd64ccac37/src/passes/swHair/hairTilesPass.perStrand.wgsl.ts .
@@ -33,6 +34,7 @@ export const SHADER_PARAMS = {
     tilesBuffer: 4,
     depthTexture: 5,
     tileSegmentsBuffer: 6,
+    segmentCountPerTileBuffer: 7,
   },
 };
 
@@ -56,6 +58,7 @@ ${BUFFER_HAIR_POINTS_POSITIONS(b.hairPositions)}
 ${BUFFER_HAIR_TANGENTS(b.hairTangents)}
 ${BUFFER_HAIR_TILES_RESULT(b.tilesBuffer, 'read_write')}
 ${BUFFER_HAIR_TILE_SEGMENTS(b.tileSegmentsBuffer, 'read_write')}
+${BUFFER_SEGMENT_COUNT_PER_TILE(b.segmentCountPerTileBuffer, 'read_write')}
 
 @group(0) @binding(${b.depthTexture})
 var _depthTexture: texture_depth_2d;
@@ -228,6 +231,9 @@ fn processTile(
       nextPtr, prevPtr,
       strandIdx, segmentIdx
     );
+
+    // store for sorting
+    _incTileSegmentCount(viewportSize, tileXY);
   }
 }
 
diff --git a/src/passes/swHair/shaderImpl/processHairSegment.wgsl.ts b/src/passes/swHair/shaderImpl/processHairSegment.wgsl.ts
index 385f925..87aa6ab 100644
--- a/src/passes/swHair/shaderImpl/processHairSegment.wgsl.ts
+++ b/src/passes/swHair/shaderImpl/processHairSegment.wgsl.ts
@@ -9,21 +9,23 @@ for (var x: u32 = 0u; x < TILE_SIZE; x += 1u) {
 }}
 */
 
+import { CONFIG } from '../../../constants.ts';
+
 export const SHADER_IMPL_PROCESS_HAIR_SEGMENT = () => /* wgsl */ `
 
 fn processHairSegment(
-  p: FineRasterParams,
+  params: FineRasterParams,
   tileBoundsPx: vec4u, tileDepth: vec2f,
   sliceDataOffset: u32,
   strandIdx: u32, segmentIdx: u32
 ) -> u32 {
   var writtenSliceDataCount: u32 = 0u;
-  let segmentCount = p.pointsPerStrand - 1;
+  let segmentCount = params.pointsPerStrand - 1;
 
   let projParams = ProjectHairParams(
-    p.pointsPerStrand,
-    p.viewportSize,
-    p.fiberRadius,
+    params.pointsPerStrand,
+    params.viewportSize,
+    params.fiberRadius,
   );
   let projSegm = projectHairSegment(
     projParams,
@@ -80,11 +82,17 @@ fn processHairSegment(
     let interpW = interpolateHairQuad(projSegm, posPx);
     let t = interpW.y; // 0 .. 1
     let hairDepth: f32 = interpolateHairF32(interpW, projSegm.depthsProj);
-    // TODO [IGNORE] instead of linear, have quadratic interp? It makes strands "fatter", so user would provide lower fiber radius. Which is good for us.
-    let alpha = 1.0 - abs(interpW.x * 2. - 1.); // interpW.x is in 0..1. Turn it so strand middle is 1.0 and then 0.0 at edges.
+    
+    // interpW.x is in 0..1. Transform it so strand middle is 1.0 and then 0.0 at edges.
+    var alpha = 1.0 - abs(interpW.x * 2. - 1.);
+    if (${CONFIG.hairRender.alphaQuadratic}) { // see CONFIG docs
+      alpha = sqrt(alpha);
+    }
+    // optimization: -0.5ms with x1.1 'fatter' strands. Fills the pixel/tiles faster
+    alpha = saturate(alpha * ${CONFIG.hairRender.alphaMultipler});
 
     // sample depth buffer, depth test with GL_LESS
-    let depthTextSamplePx: vec2i = vec2i(i32(posPx_u32.x), i32(p.viewportSize.y - y)); // wgpu's naga requiers vec2i..
+    let depthTextSamplePx: vec2i = vec2i(i32(posPx_u32.x), i32(params.viewportSize.y - y)); // wgpu's naga requiers vec2i..
     let depthBufferValue: f32 = textureLoad(_depthTexture, depthTextSamplePx, 0);
     if (hairDepth >= depthBufferValue) {
       continue;
@@ -93,14 +101,16 @@ fn processHairSegment(
     // calculate final color
     let tFullStrand = (f32(segmentIdx) + t) / f32(segmentCount);
     // let color = vec4f(1.0 - t, t, 0.0, alpha); // red at root, green at tip
+    // Either shade here and store RGBA per slice or at least
+    // (strandIdx: u32, tFullStrand: f16, alpha: f16).
+    // Either way it's u32 for nextSlicePtr and 2*u32 for payload.
     var color = _sampleShading(strandIdx, tFullStrand);
     color.a = color.a * alpha;
-    let sliceIdx = getSliceIdx(tileDepth, hairDepth);
-
+    
     // insert into per-slice linked list
-    // WARNING: Both lines below can be slow!
-    let previousPtr: u32 = _setSlicesHeadPtr(p.processorId, pxInTile, sliceIdx, nextSliceDataPtr);
-    _setSliceData(p.processorId, nextSliceDataPtr, color, previousPtr);
+    let sliceIdx = getSliceIdx(tileDepth, hairDepth);
+    let previousPtr: u32 = _setSlicesHeadPtr(params.processorId, pxInTile, sliceIdx, nextSliceDataPtr);
+    _setSliceData(params.processorId, nextSliceDataPtr, color, previousPtr);
     writtenSliceDataCount += 1u;
   }
   CY0 += CC0.B;
diff --git a/src/passes/swHair/shaderImpl/reduceHairSlices.wgsl.ts b/src/passes/swHair/shaderImpl/reduceHairSlices.wgsl.ts
index 78d177c..4d6aef9 100644
--- a/src/passes/swHair/shaderImpl/reduceHairSlices.wgsl.ts
+++ b/src/passes/swHair/shaderImpl/reduceHairSlices.wgsl.ts
@@ -30,7 +30,7 @@ fn reduceHairSlices(
     var sliceCount = select(0u, u32(finalColor.r * f32(dbgSlicesModeMaxSlices)), isDbgSliceCnt); // debug value
     
     // START: ITERATE SLICES (front to back)
-    // TODO [NOW] is it faster if we get start/end values from 'processHairSegment'? ATM it's loop on consts, so might be quite fast. And only 4 iters with current settings..
+    // We know the start/end slices from 'processHairSegment'. But iterating with consts is -0.4ms faster
     var s: u32 = 0u;
     for (; s < SLICES_PER_PIXEL; s += 1u) {
       if (isPixelDone(finalColor) && !isDbgSliceCnt) {
diff --git a/src/passes/swHair/shaderImpl/tileUtils.wgsl.ts b/src/passes/swHair/shaderImpl/tileUtils.wgsl.ts
index cc54895..f72f85c 100644
--- a/src/passes/swHair/shaderImpl/tileUtils.wgsl.ts
+++ b/src/passes/swHair/shaderImpl/tileUtils.wgsl.ts
@@ -12,15 +12,20 @@ fn getTileCount(viewportSize: vec2u) -> vec2u {
   );
 }
 
-fn getHairTileIdx(viewportSize: vec2u, tileXY: vec2u, depthBin: u32) -> u32 {
+fn getHairTileDepthBinIdx(viewportSize: vec2u, tileXY: vec2u, depthBin: u32) -> u32 {
   let tileCount = getTileCount(viewportSize);
   return (
-    tileXY.y * tileCount.x  * TILE_DEPTH_BINS_COUNT +
+    tileXY.y * tileCount.x * TILE_DEPTH_BINS_COUNT +
     tileXY.x * TILE_DEPTH_BINS_COUNT +
     depthBin
   );
 }
 
+fn getHairTileIdx(viewportSize: vec2u, tileXY: vec2u) -> u32 {
+  let tileCount = getTileCount(viewportSize);
+  return tileXY.y * tileCount.x + tileXY.x;
+}
+
 /** Changes tileIdx into (tileX, tileY) coordinates (NOT IN PIXELS!) */
 fn getTileXY(viewportSize: vec2u, tileIdx: u32) -> vec2u {
   let tileCount = getTileCount(viewportSize);
diff --git a/src/passes/swHair/shared/hairRasterizerResultBuffer.ts b/src/passes/swHair/shared/hairRasterizerResultBuffer.ts
index e7e1e70..101aa72 100644
--- a/src/passes/swHair/shared/hairRasterizerResultBuffer.ts
+++ b/src/passes/swHair/shared/hairRasterizerResultBuffer.ts
@@ -18,10 +18,6 @@ fn _setRasterizerResult(viewportSize: vec2u, posPx: vec2u, color: vec4f) {
   let idx = viewportSize.x * posPx.y + posPx.x;
   _hairRasterizerResults.data[idx] = color;
 }
-
-fn _getNextTileIdx() -> u32 {
-  return atomicAdd(&_hairRasterizerResults.tileQueueAtomicIdx, 1u);
-}
 `;
 
 export const BUFFER_HAIR_RASTERIZER_RESULTS = (
diff --git a/src/passes/swHair/shared/hairSliceHeadsBuffer.ts b/src/passes/swHair/shared/hairSliceHeadsBuffer.ts
index c6c9738..c5c8123 100644
--- a/src/passes/swHair/shared/hairSliceHeadsBuffer.ts
+++ b/src/passes/swHair/shared/hairSliceHeadsBuffer.ts
@@ -7,12 +7,26 @@ const ENTRIES_PER_PROCESSOR =
   CONFIG.hairRender.tileSize *
   CONFIG.hairRender.tileSize *
   CONFIG.hairRender.slicesPerPixel;
+const PROCESSOR_COUNT = CONFIG.hairRender.processorCount;
 
 const SLICE_HEADS_MEMORY = CONFIG.hairRender.sliceHeadsMemory;
 
 export const getLocalMemoryRequirements = () =>
   SLICE_HEADS_MEMORY === 'workgroup' ? ENTRIES_PER_PROCESSOR * BYTES_U32 : 0;
 
+export const MEMORY_PARALLEL_SIZE = () => {
+  if (SLICE_HEADS_MEMORY === 'workgroup')
+    return CONFIG.hairRender.finePassWorkgroupSizeX;
+  if (SLICE_HEADS_MEMORY === 'global') return PROCESSOR_COUNT;
+  return 1; // registers
+};
+
+export const MEMORY_PROCESSOR_OFFSET = () => {
+  if (SLICE_HEADS_MEMORY === 'workgroup') return '_local_invocation_index';
+  if (SLICE_HEADS_MEMORY === 'global') return 'processorId';
+  return '0u'; // registers
+};
+
 ///////////////////////////
 /// SHADER CODE - SHARED - UTILS
 ///////////////////////////
@@ -25,13 +39,13 @@ fn _getHeadsSliceIdx(
   processorId: u32,
   pixelInTile: vec2u, sliceIdx: u32,
 ) -> u32 {
-  let offset = _getHeadsProcessorOffset(processorId);
-  let offsetInProcessor = (
-    pixelInTile.y * TILE_SIZE * SLICES_PER_PIXEL +
-    pixelInTile.x * SLICES_PER_PIXEL +
-    sliceIdx
+  let OFFSET = ${MEMORY_PARALLEL_SIZE()}u;
+  return (
+    pixelInTile.y * OFFSET * TILE_SIZE * SLICES_PER_PIXEL +
+    pixelInTile.x * OFFSET * SLICES_PER_PIXEL +
+    sliceIdx * OFFSET +
+    ${MEMORY_PROCESSOR_OFFSET()}
   );
-  return offset + offsetInProcessor;
 }
 
 fn _setSlicesHeadPtr(
@@ -54,12 +68,12 @@ fn _getSlicesHeadPtr(
 }
 
 fn _clearSlicesHeadPtrs(processorId: u32) {
-  let offset = _getHeadsProcessorOffset(processorId);
-  let count = ${ENTRIES_PER_PROCESSOR}u;
-
-  for (var i: u32 = 0u; i < count; i += 1u) {
-    _hairSliceHeads[offset + i] = INVALID_SLICE_DATA_PTR;
-  }
+  for (var y: u32 = 0u; y < TILE_SIZE; y += 1u) {
+  for (var x: u32 = 0u; x < TILE_SIZE; x += 1u) {
+    for (var s: u32 = 0u; s < SLICES_PER_PIXEL; s += 1u) {
+      _clearSliceHeadPtr(processorId, vec2u(x, y), s);
+    }
+  }}
 }
 
 fn _clearSliceHeadPtr(
@@ -94,10 +108,6 @@ const BUFFER_HAIR_SLICES_HEADS_GLOBAL = (
 @group(0) @binding(${bindingIdx})
 var<storage, ${access}> _hairSliceHeads: array<u32>;
 
-fn _getHeadsProcessorOffset(processorId: u32) -> u32 {
-  return processorId * ${ENTRIES_PER_PROCESSOR};
-}
-
 ${SHARED_UTILS}
 `;
 
@@ -107,17 +117,17 @@ ${SHARED_UTILS}
 
 const LOCAL_MEMORY_ACCESS =
   SLICE_HEADS_MEMORY === 'workgroup' ? 'workgroup' : 'private';
+const LOCAL_MEMORY_SIZE =
+  SLICE_HEADS_MEMORY === 'workgroup'
+    ? ENTRIES_PER_PROCESSOR * CONFIG.hairRender.finePassWorkgroupSizeX
+    : ENTRIES_PER_PROCESSOR;
 
 const BUFFER_HAIR_SLICES_HEADS_LOCAL = (
   _bindingIdx: number,
   _access: 'read_write'
 ) => /* wgsl */ `
 
-var<${LOCAL_MEMORY_ACCESS}> _hairSliceHeads: array<u32, ${ENTRIES_PER_PROCESSOR}u>;
-
-fn _getHeadsProcessorOffset(processorId: u32) -> u32 {
-  return 0u;
-}
+var<${LOCAL_MEMORY_ACCESS}> _hairSliceHeads: array<u32, ${LOCAL_MEMORY_SIZE}u>;
 
 ${SHARED_UTILS}
 `;
@@ -138,17 +148,12 @@ function createHairSlicesHeadsBuffer_GLOBAL(device: GPUDevice): GPUBuffer {
 
   return device.createBuffer({
     label: `hair-slices-heads`,
-    size,
+    size: Math.max(size, WEBGPU_MINIMAL_BUFFER_SIZE),
     usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
   });
 }
 
 function createHairSlicesHeadsBuffer_LOCAL(_device: GPUDevice): undefined {
-  const { finePassWorkgroupSizeX: grSize, sliceHeadsMemory } =
-    CONFIG.hairRender;
-  if (grSize !== 1 && sliceHeadsMemory === 'workgroup') {
-    throw new Error(`Expected finePassWorkgroupSizeX to be 1, was ${grSize}`);
-  }
   calcMemoryReqs();
   return undefined;
 }
@@ -159,12 +164,11 @@ export const createHairSlicesHeadsBuffer: Allocator =
     : createHairSlicesHeadsBuffer_LOCAL;
 
 function calcMemoryReqs() {
-  const { tileSize, slicesPerPixel, processorCount } = CONFIG.hairRender;
+  const { processorCount } = CONFIG.hairRender;
 
-  const entriesPerProcessor = tileSize * tileSize * slicesPerPixel;
-  const entries = processorCount * entriesPerProcessor;
+  const entries = processorCount * ENTRIES_PER_PROCESSOR;
   const bytesPerEntry = BYTES_U32;
-  const size = Math.max(entries * bytesPerEntry, WEBGPU_MINIMAL_BUFFER_SIZE);
+  const size = entries * bytesPerEntry;
 
   const memRegionNames: Record<SliceHeadsMemory, string> = {
     global: 'VRAM',
@@ -176,7 +180,7 @@ function calcMemoryReqs() {
   STATS.update('Slices heads', `${memRegionName} ${formatBytes(size)}`);
   STATS.update(
     ' \\ Per processor',
-    formatBytes(entriesPerProcessor * bytesPerEntry)
+    formatBytes(ENTRIES_PER_PROCESSOR * bytesPerEntry)
   );
 
   return size;
diff --git a/src/passes/swHair/shared/hairTileSegmentsBuffer.ts b/src/passes/swHair/shared/hairTileSegmentsBuffer.ts
index 47879f1..bc13b7e 100644
--- a/src/passes/swHair/shared/hairTileSegmentsBuffer.ts
+++ b/src/passes/swHair/shared/hairTileSegmentsBuffer.ts
@@ -4,7 +4,7 @@ import {
   WEBGPU_MINIMAL_BUFFER_SIZE,
   u32_type,
 } from '../../../utils/webgpu.ts';
-import { getTileCount } from './hairTilesResultBuffer.ts';
+import { getTileCount } from './utils.ts';
 import { Dimensions } from '../../../utils/index.ts';
 import { STATS } from '../../../stats.ts';
 import { formatBytes } from '../../../utils/string.ts';
diff --git a/src/passes/swHair/shared/hairTilesResultBuffer.ts b/src/passes/swHair/shared/hairTilesResultBuffer.ts
index ff5bd8b..c4fa1d7 100644
--- a/src/passes/swHair/shared/hairTilesResultBuffer.ts
+++ b/src/passes/swHair/shared/hairTilesResultBuffer.ts
@@ -1,8 +1,9 @@
 import { BYTES_U32, CONFIG } from '../../../constants.ts';
 import { STATS } from '../../../stats.ts';
-import { Dimensions, divideCeil } from '../../../utils/index.ts';
+import { Dimensions } from '../../../utils/index.ts';
 import { formatBytes } from '../../../utils/string.ts';
 import { StorageAccess, u32_type } from '../../../utils/webgpu.ts';
+import { getTileCount } from './utils.ts';
 
 ///////////////////////////
 /// SHADER CODE
@@ -16,7 +17,7 @@ fn _storeTileHead(
   depthMin: f32, depthMax: f32,
   nextPtr: u32
 ) -> u32 {
-  let tileIdx: u32 = getHairTileIdx(viewportSize, tileXY, depthBin);
+  let tileIdx: u32 = getHairTileDepthBinIdx(viewportSize, tileXY, depthBin);
   
   // store depth
   // TODO [IGNORE] low precision. Convert this into 0-1 inside the bounding sphere and then quantisize
@@ -44,7 +45,7 @@ fn _storeTileHead(
 const getTileDepth = /* wgsl */ `
 
 fn _getTileDepth(viewportSize: vec2u, tileXY: vec2u, depthBin: u32) -> vec2f {
-  let tileIdx: u32 = getHairTileIdx(viewportSize, tileXY, depthBin);
+  let tileIdx: u32 = getHairTileDepthBinIdx(viewportSize, tileXY, depthBin);
   let tile = _hairTilesResult[tileIdx];
   return vec2f(
     f32(MAX_U32 - tile.minDepth) / f32(MAX_U32),
@@ -53,7 +54,7 @@ fn _getTileDepth(viewportSize: vec2u, tileXY: vec2u, depthBin: u32) -> vec2f {
 }
 
 fn _getTileSegmentPtr(viewportSize: vec2u, tileXY: vec2u, depthBin: u32) -> u32 {
-  let tileIdx: u32 = getHairTileIdx(viewportSize, tileXY, depthBin);
+  let tileIdx: u32 = getHairTileDepthBinIdx(viewportSize, tileXY, depthBin);
   let myPtr = _hairTilesResult[tileIdx].tileSegmentPtr;
   return _translateHeadPointer(myPtr);
 }
@@ -96,14 +97,6 @@ fn _translateHeadPointer(segmentPtr: u32) -> u32 {
 /// GPU BUFFER
 ///////////////////////////
 
-export const getTileCount = (viewportSize: Dimensions): Dimensions => {
-  const { tileSize } = CONFIG.hairRender;
-  return {
-    width: divideCeil(viewportSize.width, tileSize),
-    height: divideCeil(viewportSize.height, tileSize),
-  };
-};
-
 export function createHairTilesResultBuffer(
   device: GPUDevice,
   viewportSize: Dimensions
diff --git a/src/passes/swHair/shared/segmentCountPerTileBuffer.ts b/src/passes/swHair/shared/segmentCountPerTileBuffer.ts
new file mode 100644
index 0000000..8daf563
--- /dev/null
+++ b/src/passes/swHair/shared/segmentCountPerTileBuffer.ts
@@ -0,0 +1,45 @@
+import { BYTES_U32 } from '../../../constants.ts';
+import { Dimensions } from '../../../utils/index.ts';
+import { StorageAccess, u32_type } from '../../../utils/webgpu.ts';
+import { getTileCount } from './utils.ts';
+
+///////////////////////////
+/// SHADER CODE
+///////////////////////////
+
+export const BUFFER_SEGMENT_COUNT_PER_TILE = (
+  bindingIdx: number,
+  access: StorageAccess
+) => /* wgsl */ `
+
+@group(0) @binding(${bindingIdx})
+var<storage, ${access}> _hairSegmentCountPerTile: array<${u32_type(access)}>;
+
+${access == 'read_write' ? incTileSegmentCount : ''}
+`;
+
+const incTileSegmentCount = /* wgsl */ `
+  fn _incTileSegmentCount(viewportSize: vec2u, tileXY: vec2u) {
+    let tileIdx = getHairTileIdx(viewportSize, tileXY);
+    atomicAdd(&_hairSegmentCountPerTile[tileIdx], 1u);
+  }
+`;
+
+///////////////////////////
+/// GPU BUFFER
+///////////////////////////
+
+export function createHairSegmentCountPerTileBuffer(
+  device: GPUDevice,
+  viewportSize: Dimensions
+): GPUBuffer {
+  const tileCount = getTileCount(viewportSize);
+  const entries = tileCount.width * tileCount.height;
+  const size = entries * BYTES_U32;
+
+  return device.createBuffer({
+    label: `hair-segments-per-tile`,
+    size,
+    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
+  });
+}
diff --git a/src/passes/swHair/shared/tileListBuffer.ts b/src/passes/swHair/shared/tileListBuffer.ts
new file mode 100644
index 0000000..5cf7703
--- /dev/null
+++ b/src/passes/swHair/shared/tileListBuffer.ts
@@ -0,0 +1,82 @@
+import { BYTES_U32, CONFIG } from '../../../constants.ts';
+import { Dimensions } from '../../../utils/index.ts';
+import { StorageAccess, u32_type } from '../../../utils/webgpu.ts';
+import { getTileCount } from './utils.ts';
+
+///////////////////////////
+/// SHADER CODE
+///////////////////////////
+
+export const BUFFER_TILE_LIST = (
+  bindingIdx: number,
+  access: StorageAccess
+) => /* wgsl */ `
+
+struct TilesList {
+  drawnTiles: ${u32_type(access)},
+  // processedTiles: u32, // TODO [NO] move here instead of hairTileSegmentsBuffer? Or is it better if fine pass stays read-only there?
+  data: array<u32>, // tileIds
+}
+
+@group(0) @binding(${bindingIdx})
+var<storage, ${access}> _hairTileData: TilesList;
+
+${access == 'read_write' ? '' : getNextTileToProcess}
+`;
+
+const getNextTileToProcess = /* wgsl */ `
+  fn _getNextTileIdx(tileCount: u32) -> u32 {
+    // we could do 'atomicAdd(_, 1)' on each thread. But which thread in wkgrp
+    // receives the smallest value? It is the one that decides if we are done.
+    // 'atomicAdd(_, 1)' does not give us guarantee inside wkgrp. And clever ways
+    // to find this are more complicated then the following code.
+    if (_local_invocation_index == 0u) {
+      let wkgrpThreadCnt = ${CONFIG.hairRender.finePassWorkgroupSizeX}u;
+      _tileStartOffset = atomicAdd(&_hairRasterizerResults.tileQueueAtomicIdx, wkgrpThreadCnt);
+      _isDone = _tileStartOffset >= tileCount;
+    }
+  
+    // workgroupUniformLoad() has implicit barrier
+    let tileStartOffset = workgroupUniformLoad(&_tileStartOffset);
+    let idx =  tileStartOffset + _local_invocation_index;
+    return _hairTileData.data[idx];
+  }
+`;
+
+///////////////////////////
+/// GPU BUFFER
+///////////////////////////
+
+export function createHairTileListBuffer(
+  device: GPUDevice,
+  viewportSize: Dimensions
+): GPUBuffer {
+  const tileCount2d = getTileCount(viewportSize);
+  const tileCount = tileCount2d.width * tileCount2d.height;
+
+  // 4 cause I will probably forget to inc. this if I add more fields
+  const entries = 4 + tileCount;
+
+  const extraUsage = CONFIG.isTest ? GPUBufferUsage.COPY_SRC : 0;
+
+  return device.createBuffer({
+    label: `hair-tile-list`,
+    size: entries * BYTES_U32,
+    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | extraUsage,
+  });
+
+  /*const data = new Uint32Array(tileCount + 1);
+  data[0] = tileCount;
+  for (let i = 0; i < tileCount; i++) {
+    data[i + 1] = i;
+  }
+
+  return createGPU_StorageBuffer(device, `hair-tile-list`, data);*/
+}
+
+export function parseTileList(data: Uint32Array) {
+  return {
+    drawnTiles: data[0],
+    data: Array(...data.slice(1, 1 + data[0])) as number[],
+  };
+}
diff --git a/src/passes/swHair/shared/utils.ts b/src/passes/swHair/shared/utils.ts
new file mode 100644
index 0000000..2b3ff18
--- /dev/null
+++ b/src/passes/swHair/shared/utils.ts
@@ -0,0 +1,10 @@
+import { CONFIG } from '../../../constants.ts';
+import { Dimensions, divideCeil } from '../../../utils/index.ts';
+
+export const getTileCount = (viewportSize: Dimensions): Dimensions => {
+  const { tileSize } = CONFIG.hairRender;
+  return {
+    width: divideCeil(viewportSize.width, tileSize),
+    height: divideCeil(viewportSize.height, tileSize),
+  };
+};
diff --git a/src/renderer.ts b/src/renderer.ts
index 34167d0..6a72690 100644
--- a/src/renderer.ts
+++ b/src/renderer.ts
@@ -37,6 +37,7 @@ import { DrawGridDbgPass } from './passes/drawGridDbg/drawGridDbgPass.ts';
 import { GridPostSimPass } from './passes/simulation/gridPostSimPass.ts';
 import { GridPreSimPass } from './passes/simulation/gridPreSimPass.ts';
 import { DrawGizmoPass } from './passes/drawGizmo/drawGizmoPass.ts';
+import { HairTileSortPass } from './passes/swHair/hairTileSortPass.ts';
 
 export class Renderer {
   public readonly cameraCtrl: Camera;
@@ -67,6 +68,7 @@ export class Renderer {
   private readonly hairTilesPass: HairTilesPass;
   private readonly hairShadingPass: HairShadingPass;
   private readonly hairFinePass: HairFinePass;
+  private readonly hairTileSortPass: HairTileSortPass;
   private readonly hairCombinePass: HairCombinePass;
   private readonly presentPass: PresentPass;
 
@@ -111,6 +113,7 @@ export class Renderer {
       NORMALS_TEX_FORMAT
     );
     this.hairTilesPass = new HairTilesPass(device);
+    this.hairTileSortPass = new HairTileSortPass(device);
     this.hairShadingPass = new HairShadingPass(device);
     this.hairFinePass = new HairFinePass(device);
     this.hairCombinePass = new HairCombinePass(device, HDR_RENDER_TEX_FORMAT);
@@ -253,13 +256,23 @@ export class Renderer {
       return;
     }
 
-    this.hairTilesPass.clearFramebuffer(ctx);
-    this.hairFinePass.clearFramebuffer(ctx);
+    this.hairTilesPass.cmdClearBeforeRender(ctx);
+    this.hairFinePass.cmdClearBeforeRender(ctx);
+    this.hairTileSortPass.cmdClearBeforeRender(ctx);
 
+    // hair rasterize pass 1
     this.hairTilesPass.cmdDrawHairToTiles(ctx, hairObject);
-    if (displayMode !== DISPLAY_MODE.TILES) {
+
+    if (
+      displayMode !== DISPLAY_MODE.TILES &&
+      displayMode !== DISPLAY_MODE.TILES_PPLL
+    ) {
+      this.hairTileSortPass.cmdSortHairTiles(ctx);
+      // hair rasterize pass 2
       this.hairFinePass.cmdRasterizeSlicesHair(ctx, hairObject);
     }
+
+    // combine meshes + hair
     this.hairCombinePass.cmdCombineRasterResults(ctx);
 
     this.updateResourcesForNextFrame(ctx, hairObject);
@@ -310,8 +323,11 @@ export class Renderer {
       // hair:
       hairTilesBuffer: this.hairTilesPass.hairTilesBuffer,
       hairTileSegmentsBuffer: this.hairTilesPass.hairTileSegmentsBuffer,
+      hairSegmentCountPerTileBuffer:
+        this.hairTilesPass.segmentCountPerTileBuffer,
       hairRasterizerResultsBuffer:
         this.hairFinePass.hairRasterizerResultsBuffer,
+      hairTileListBuffer: this.hairTileSortPass.tileListBuffer,
     };
   }
 
@@ -333,6 +349,7 @@ export class Renderer {
     this.drawBackgroundGradientPass.onViewportResize();
     this.hairTilesPass.onViewportResize(this.device, viewportSize);
     this.hairFinePass.onViewportResize(this.device, viewportSize);
+    this.hairTileSortPass.onViewportResize(this.device, viewportSize);
     this.hairCombinePass.onViewportResize();
     this.aoPass.onViewportResize();
     this.hairShadingPass.onViewportResize();
diff --git a/src/scene/sdfCollider/sdfCollider.ts b/src/scene/sdfCollider/sdfCollider.ts
index 4685156..d2f2a76 100644
--- a/src/scene/sdfCollider/sdfCollider.ts
+++ b/src/scene/sdfCollider/sdfCollider.ts
@@ -59,7 +59,9 @@ export class SDFCollider {
     const [boundsMin, boundsMax] = bounds;
     const size = vec3.subtract(boundsMax, boundsMin);
     const cellSize = vec3.scale(size, 1 / (dims - 1));
-    console.log(`SDF collider '${name}' (dims=${dims}, cellSize=${cellSize}), bounds:`, bounds); // prettier-ignore
+    if (!CONFIG.isTest) {
+      console.log(`SDF collider '${name}' (dims=${dims}, cellSize=${cellSize}), bounds:`, bounds); // prettier-ignore
+    }
   }
 
   bindTexture = (bindingIdx: number): GPUBindGroupEntry => ({
diff --git a/src/sys_deno/testUtils.ts b/src/sys_deno/testUtils.ts
index 74b08d4..a4f3065 100644
--- a/src/sys_deno/testUtils.ts
+++ b/src/sys_deno/testUtils.ts
@@ -115,6 +115,8 @@ export const createMockPassCtx = (
     hairRasterizerResultsBuffer: undefined!,
     hairTilesBuffer: undefined!,
     hairTileSegmentsBuffer: undefined!,
+    hairTileListBuffer: undefined!,
+    hairSegmentCountPerTileBuffer: undefined!,
   };
 };
 
diff --git a/src/sys_web/gui.ts b/src/sys_web/gui.ts
index eecf7c2..027f1d0 100644
--- a/src/sys_web/gui.ts
+++ b/src/sys_web/gui.ts
@@ -57,6 +57,7 @@ export function initializeGUI(
   const modeDummy = createDummy(CONFIG, 'displayMode', [
     { label: 'Final', value: DISPLAY_MODE.FINAL },
     { label: 'DBG: tiles', value: DISPLAY_MODE.TILES },
+    { label: 'DBG: tiles PPLL', value: DISPLAY_MODE.TILES_PPLL },
     { label: 'DBG: slices cnt', value: DISPLAY_MODE.USED_SLICES },
     { label: 'DBG: hw-render', value: DISPLAY_MODE.HW_RENDER },
     { label: 'DBG: depth', value: DISPLAY_MODE.DEPTH },
@@ -128,7 +129,10 @@ export function initializeGUI(
 
     function onDisplayModeChange() {
       const mode = CONFIG.displayMode;
-      setVisible(tileSegmentsCtrl, mode === DISPLAY_MODE.TILES);
+      setVisible(
+        tileSegmentsCtrl,
+        mode === DISPLAY_MODE.TILES || mode === DISPLAY_MODE.TILES_PPLL
+      );
       setVisible(slicesCtrl, mode === DISPLAY_MODE.USED_SLICES);
       setVisible(showTilesCtrl, mode === DISPLAY_MODE.FINAL);
     }
diff --git a/src/utils/webgpu.ts b/src/utils/webgpu.ts
index b8096b5..1e811a4 100644
--- a/src/utils/webgpu.ts
+++ b/src/utils/webgpu.ts
@@ -36,7 +36,7 @@ export async function createGpuDevice() {
       // 16Kb is the default limit on Chrome, provided to cover for undefined default limit
       adapter.limits.maxComputeWorkgroupStorageSize || getBytes(16, 'KB')
     );
-    requiredLimits.maxStorageBuffersPerShaderStage = 9;
+    requiredLimits.maxStorageBuffersPerShaderStage = 10;
 
     // create device
     const device = await adapter?.requestDevice({
@@ -127,6 +127,10 @@ export function createGPU_StorageBuffer(
   );
 }
 
+export function cmdClearWholeBuffer(cmdBuf: GPUCommandEncoder, buf: GPUBuffer) {
+  cmdBuf.clearBuffer(buf, 0, buf.size);
+}
+
 export const getItemsPerThread = divideCeil;
 
 export type StorageAccess = 'read_write' | 'read';