From 85a6db41c5288fce028bddc7f6a33a1e9884c7d7 Mon Sep 17 00:00:00 2001
From: Ozan Bozkurt <ozan@tinqs.com>
Date: Mon, 15 Jun 2026 22:48:25 +0100
Subject: [PATCH] post: add engine improvement roadmap (Tier A-D)

---
 gpu-skinned-herds.html     | 23 +++++++++++++++++------
 posts/gpu-skinned-herds.md | 36 ++++++++++++++++++++++++++++++++----
 2 files changed, 49 insertions(+), 10 deletions(-)
diff --git a/gpu-skinned-herds.html b/gpu-skinned-herds.html
index 6c99e2a..2cbce02 100644
--- a/gpu-skinned-herds.html
+++ b/gpu-skinned-herds.html
@@ -380,12 +380,23 @@ NORMAL = normalize((skin * vec4(NORMAL, 0.0)).xyz);</code></pre>
   <li><strong>Impostor tier (2D billboards)</strong> — sprite atlas indexed by view-angle and animation-frame. For very far agents.</li>
 </ul>
 <p>One abstraction, three detail levels. The same code that drives 30 animals today will drive thousands of colonists at launch.</p>
-<h2>What's deliberately not here</h2>
-<ul>
-  <li><strong>No C# wrapper.</strong> Instantiate from GDScript via <code>ClassDB.instantiate()</code> — the binding surface is small and stable.</li>
-  <li><strong>No automatic <code>AnimationPlayer</code> integration.</strong> You drive poses at bake time. We give you the texture. Freedom to animate however you want.</li>
-  <li><strong>No GPU occlusion culling.</strong> That's the game's job. The engine provides the tool; the game decides what to draw.</li>
-</ul>
+<h2>The engine roadmap — where we push next</h2>
+<p>The 10,000-agent load test pointed at exactly where the engine can still win. The bottleneck at extreme scale isn't the GPU skinning — it's the CPU path feeding it: C# builds arrays, marshals Variants into GDScript, GDScript loops per-instance <code>set_instance_transform</code> / <code>set_instance_custom_data</code>. Three layers of per-instance overhead, all on the main thread. The fixes are engine-deep.</p>
+<h3>Tier A — kill the per-frame CPU path</h3>
+<p><strong>Bulk instance-upload API.</strong> Add <code>set_instance_data_bulk()</code> that does a single memcpy into the MultiMesh buffer instead of N scripted per-instance calls. One marshalled call + one copy per herd per frame instead of thousands.</p>
+<p><strong>GPU-driven cull + indirect multi-draw.</strong> A compute pass classifies frustum, distance, and LOD-tier on the GPU and writes an indirect draw buffer per tier — the CPU stops iterating instances entirely. Pairs with bulk upload: together the main thread does ~zero per-instance work.</p>
+<p><strong>GPU dead-reckoning of position.</strong> Store per-instance velocity in custom data. Advance transforms from <code>TIME</code> in the vertex shader. The CPU only touches an instance on a sim snapshot (~every 0.4s), not every frame.</p>
+<h3>Tier B — skinning core upgrades</h3>
+<p><strong>Dual-quaternion skinning.</strong> 2 texels per bone instead of 4. Halves palette VRAM, halves per-vertex texel fetches, and fixes the "candy-wrapper" collapse on twisting joints that linear blend skinning has. A real engine-grade upgrade.</p>
+<p><strong>Mat4x3 storage.</strong> The 4th column of a bone matrix is always <code>(0,0,0,1)</code> — dropping it saves 25% VRAM with zero quality loss. A quick win if dual-quat is too big a step.</p>
+<p><strong>Reduced-bone far LOD.</strong> Drop fingers, tail, and face bones for the far tier — fewer fetches where detail isn't visible.</p>
+<h3>Tier C — visibility & render passes</h3>
+<p><strong>Frustum-cull integration.</strong> MultiMesh draws everything in <code>visible_instance_count</code> — wire per-instance frustum culling into the engine's visibility system.</p>
+<p><strong>Shadow-pass LOD.</strong> The skinning shader runs again in the depth/shadow pass. Skip skinning or drop shadow casters beyond a distance — often a hidden ~2× vertex cost.</p>
+<h3>Tier D — quality & pipeline</h3>
+<p><strong>In-shader clip cross-fade.</strong> Blend two clip blocks per instance (second custom slot + blend factor) instead of hard state cuts — brings hero-rig smoothness to the whole crowd without a real skeleton.</p>
+<p><strong>Threaded bake.</strong> Move the palette bake to a worker thread so first-encounter of a new animal type never hitches the main thread.</p>
+<p>The recommended order: bulk upload (directly fixes the measured bottleneck, small, low-risk) → mat4x3 storage (immediate VRAM win) → GPU-driven cull + indirect draw (removes CPU from the loop entirely, unlocks tens of thousands) → dual-quaternion skinning (the skinning-quality leap). The first two are a day each and compounding; the latter two are the deep engine investments that make this a genuinely AAA crowd platform.</p>
 <h2>Get the build</h2>
 <p>Pre-built editor binaries with <code>agent_skinned</code> and the GPU-driven palette baked in — no engine compile required. The game's <code>animal_perf_test.tscn</code> lets you spawn 10/100/1,000/10,000 animals and read live FPS:</p>
 <p>| Platform | Binary |</p>
diff --git a/posts/gpu-skinned-herds.md b/posts/gpu-skinned-herds.md
index 143f23a..c11fe50 100644
--- a/posts/gpu-skinned-herds.md
+++ b/posts/gpu-skinned-herds.md
@@ -164,11 +164,39 @@ The platform supports three tiers by distance, all driven by the same `(clip, co
 
 One abstraction, three detail levels. The same code that drives 30 animals today will drive thousands of colonists at launch.
 
-## What's deliberately not here
+## The engine roadmap — where we push next
 
-- **No C# wrapper.** Instantiate from GDScript via `ClassDB.instantiate()` — the binding surface is small and stable.
-- **No automatic `AnimationPlayer` integration.** You drive poses at bake time. We give you the texture. Freedom to animate however you want.
-- **No GPU occlusion culling.** That's the game's job. The engine provides the tool; the game decides what to draw.
+The 10,000-agent load test pointed at exactly where the engine can still win. The bottleneck at extreme scale isn't the GPU skinning — it's the CPU path feeding it: C# builds arrays, marshals Variants into GDScript, GDScript loops per-instance `set_instance_transform` / `set_instance_custom_data`. Three layers of per-instance overhead, all on the main thread. The fixes are engine-deep.
+
+### Tier A — kill the per-frame CPU path
+
+**Bulk instance-upload API.** Add `set_instance_data_bulk()` that does a single memcpy into the MultiMesh buffer instead of N scripted per-instance calls. One marshalled call + one copy per herd per frame instead of thousands.
+
+**GPU-driven cull + indirect multi-draw.** A compute pass classifies frustum, distance, and LOD-tier on the GPU and writes an indirect draw buffer per tier — the CPU stops iterating instances entirely. Pairs with bulk upload: together the main thread does ~zero per-instance work.
+
+**GPU dead-reckoning of position.** Store per-instance velocity in custom data. Advance transforms from `TIME` in the vertex shader. The CPU only touches an instance on a sim snapshot (~every 0.4s), not every frame.
+
+### Tier B — skinning core upgrades
+
+**Dual-quaternion skinning.** 2 texels per bone instead of 4. Halves palette VRAM, halves per-vertex texel fetches, and fixes the "candy-wrapper" collapse on twisting joints that linear blend skinning has. A real engine-grade upgrade.
+
+**Mat4x3 storage.** The 4th column of a bone matrix is always `(0,0,0,1)` — dropping it saves 25% VRAM with zero quality loss. A quick win if dual-quat is too big a step.
+
+**Reduced-bone far LOD.** Drop fingers, tail, and face bones for the far tier — fewer fetches where detail isn't visible.
+
+### Tier C — visibility & render passes
+
+**Frustum-cull integration.** MultiMesh draws everything in `visible_instance_count` — wire per-instance frustum culling into the engine's visibility system.
+
+**Shadow-pass LOD.** The skinning shader runs again in the depth/shadow pass. Skip skinning or drop shadow casters beyond a distance — often a hidden ~2× vertex cost.
+
+### Tier D — quality & pipeline
+
+**In-shader clip cross-fade.** Blend two clip blocks per instance (second custom slot + blend factor) instead of hard state cuts — brings hero-rig smoothness to the whole crowd without a real skeleton.
+
+**Threaded bake.** Move the palette bake to a worker thread so first-encounter of a new animal type never hitches the main thread.
+
+The recommended order: bulk upload (directly fixes the measured bottleneck, small, low-risk) → mat4x3 storage (immediate VRAM win) → GPU-driven cull + indirect draw (removes CPU from the loop entirely, unlocks tens of thousands) → dual-quaternion skinning (the skinning-quality leap). The first two are a day each and compounding; the latter two are the deep engine investments that make this a genuinely AAA crowd platform.
 
 ## Get the build