diff --git a/gpu-skinned-herds.html b/gpu-skinned-herds.html index 2cbce02..6c99e2a 100644 --- a/gpu-skinned-herds.html +++ b/gpu-skinned-herds.html @@ -380,23 +380,12 @@ NORMAL = normalize((skin * vec4(NORMAL, 0.0)).xyz);
One abstraction, three detail levels. The same code that drives 30 animals today will drive thousands of colonists at launch.
-The 10,000-agent load test pointed at exactly where the engine can still win. The bottleneck at extreme scale isn't the GPU skinning — it's the CPU path feeding it: C# builds arrays, marshals Variants into GDScript, GDScript loops per-instance set_instance_transform / set_instance_custom_data. Three layers of per-instance overhead, all on the main thread. The fixes are engine-deep.
Bulk instance-upload API. Add set_instance_data_bulk() that does a single memcpy into the MultiMesh buffer instead of N scripted per-instance calls. One marshalled call + one copy per herd per frame instead of thousands.
GPU-driven cull + indirect multi-draw. A compute pass classifies frustum, distance, and LOD-tier on the GPU and writes an indirect draw buffer per tier — the CPU stops iterating instances entirely. Pairs with bulk upload: together the main thread does ~zero per-instance work.
-GPU dead-reckoning of position. Store per-instance velocity in custom data. Advance transforms from TIME in the vertex shader. The CPU only touches an instance on a sim snapshot (~every 0.4s), not every frame.
Dual-quaternion skinning. 2 texels per bone instead of 4. Halves palette VRAM, halves per-vertex texel fetches, and fixes the "candy-wrapper" collapse on twisting joints that linear blend skinning has. A real engine-grade upgrade.
-Mat4x3 storage. The 4th column of a bone matrix is always (0,0,0,1) — dropping it saves 25% VRAM with zero quality loss. A quick win if dual-quat is too big a step.
Reduced-bone far LOD. Drop fingers, tail, and face bones for the far tier — fewer fetches where detail isn't visible.
-Frustum-cull integration. MultiMesh draws everything in visible_instance_count — wire per-instance frustum culling into the engine's visibility system.
Shadow-pass LOD. The skinning shader runs again in the depth/shadow pass. Skip skinning or drop shadow casters beyond a distance — often a hidden ~2× vertex cost.
-In-shader clip cross-fade. Blend two clip blocks per instance (second custom slot + blend factor) instead of hard state cuts — brings hero-rig smoothness to the whole crowd without a real skeleton.
-Threaded bake. Move the palette bake to a worker thread so first-encounter of a new animal type never hitches the main thread.
-The recommended order: bulk upload (directly fixes the measured bottleneck, small, low-risk) → mat4x3 storage (immediate VRAM win) → GPU-driven cull + indirect draw (removes CPU from the loop entirely, unlocks tens of thousands) → dual-quaternion skinning (the skinning-quality leap). The first two are a day each and compounding; the latter two are the deep engine investments that make this a genuinely AAA crowd platform.
+ClassDB.instantiate() — the binding surface is small and stable.AnimationPlayer integration. You drive poses at bake time. We give you the texture. Freedom to animate however you want.Pre-built editor binaries with agent_skinned and the GPU-driven palette baked in — no engine compile required. The game's animal_perf_test.tscn lets you spawn 10/100/1,000/10,000 animals and read live FPS:
| Platform | Binary |
diff --git a/posts/gpu-skinned-herds.md b/posts/gpu-skinned-herds.md index c11fe50..143f23a 100644 --- a/posts/gpu-skinned-herds.md +++ b/posts/gpu-skinned-herds.md @@ -164,39 +164,11 @@ The platform supports three tiers by distance, all driven by the same `(clip, co One abstraction, three detail levels. The same code that drives 30 animals today will drive thousands of colonists at launch. -## The engine roadmap — where we push next +## What's deliberately not here -The 10,000-agent load test pointed at exactly where the engine can still win. The bottleneck at extreme scale isn't the GPU skinning — it's the CPU path feeding it: C# builds arrays, marshals Variants into GDScript, GDScript loops per-instance `set_instance_transform` / `set_instance_custom_data`. Three layers of per-instance overhead, all on the main thread. The fixes are engine-deep. - -### Tier A — kill the per-frame CPU path - -**Bulk instance-upload API.** Add `set_instance_data_bulk()` that does a single memcpy into the MultiMesh buffer instead of N scripted per-instance calls. One marshalled call + one copy per herd per frame instead of thousands. - -**GPU-driven cull + indirect multi-draw.** A compute pass classifies frustum, distance, and LOD-tier on the GPU and writes an indirect draw buffer per tier — the CPU stops iterating instances entirely. Pairs with bulk upload: together the main thread does ~zero per-instance work. - -**GPU dead-reckoning of position.** Store per-instance velocity in custom data. Advance transforms from `TIME` in the vertex shader. The CPU only touches an instance on a sim snapshot (~every 0.4s), not every frame. - -### Tier B — skinning core upgrades - -**Dual-quaternion skinning.** 2 texels per bone instead of 4. Halves palette VRAM, halves per-vertex texel fetches, and fixes the "candy-wrapper" collapse on twisting joints that linear blend skinning has. A real engine-grade upgrade. - -**Mat4x3 storage.** The 4th column of a bone matrix is always `(0,0,0,1)` — dropping it saves 25% VRAM with zero quality loss. A quick win if dual-quat is too big a step. - -**Reduced-bone far LOD.** Drop fingers, tail, and face bones for the far tier — fewer fetches where detail isn't visible. - -### Tier C — visibility & render passes - -**Frustum-cull integration.** MultiMesh draws everything in `visible_instance_count` — wire per-instance frustum culling into the engine's visibility system. - -**Shadow-pass LOD.** The skinning shader runs again in the depth/shadow pass. Skip skinning or drop shadow casters beyond a distance — often a hidden ~2× vertex cost. - -### Tier D — quality & pipeline - -**In-shader clip cross-fade.** Blend two clip blocks per instance (second custom slot + blend factor) instead of hard state cuts — brings hero-rig smoothness to the whole crowd without a real skeleton. - -**Threaded bake.** Move the palette bake to a worker thread so first-encounter of a new animal type never hitches the main thread. - -The recommended order: bulk upload (directly fixes the measured bottleneck, small, low-risk) → mat4x3 storage (immediate VRAM win) → GPU-driven cull + indirect draw (removes CPU from the loop entirely, unlocks tens of thousands) → dual-quaternion skinning (the skinning-quality leap). The first two are a day each and compounding; the latter two are the deep engine investments that make this a genuinely AAA crowd platform. +- **No C# wrapper.** Instantiate from GDScript via `ClassDB.instantiate()` — the binding surface is small and stable. +- **No automatic `AnimationPlayer` integration.** You drive poses at bake time. We give you the texture. Freedom to animate however you want. +- **No GPU occlusion culling.** That's the game's job. The engine provides the tool; the game decides what to draw. ## Get the build