From c979c898f49de81dfb9d7e526de308e8641a82f6 Mon Sep 17 00:00:00 2001 From: Ozan Bozkurt Date: Mon, 15 Jun 2026 22:41:00 +0100 Subject: [PATCH] =?UTF-8?q?post:=20GPU-driven=20crowd=20animation=20?= =?UTF-8?q?=E2=80=94=201000=20agents=20at=2060=20FPS,=20zero=20CPU?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- gpu-driven-crowd-animation.html | 392 ++++++++++++++++++++++++++++ index.html | 7 + posts/gpu-driven-crowd-animation.md | 160 ++++++++++++ 3 files changed, 559 insertions(+) create mode 100644 gpu-driven-crowd-animation.html create mode 100644 posts/gpu-driven-crowd-animation.md diff --git a/gpu-driven-crowd-animation.html b/gpu-driven-crowd-animation.html new file mode 100644 index 0000000..f04bcc0 --- /dev/null +++ b/gpu-driven-crowd-animation.html @@ -0,0 +1,392 @@ + + + + + + + Zero-CPU Crowd Animation: How We Made 1,000 Animals Animate Without a Single Skeleton — Tinqs Blog + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ ← All Posts + +

Zero-CPU Crowd Animation: How We Made 1,000 Animals Animate Without a Single Skeleton

+

Yesterday we shipped a GPU herd renderer that draws 1,000 skinned animals in a handful of draw calls. It worked — 25 crocodiles confirmed, 1,000 animals projected. But it had a quiet cost: one live skeleton per animal state per type. For 30 types with 5 states each, that's 150 Skeleton3D nodes — each with an AnimationPlayer, each pushing bone matrices to the GPU every frame. The GPU was fast, but the CPU was doing real work.

+ +
+

Today we ripped out every live skeleton. The CPU now does zero per-frame animation work. 1,000 animals at 60 FPS. Each plays its own clip at its own speed and phase — no lockstep, no copy-paste poses. Here's how.

+

The problem: lockstep costs CPU

+

The original agent_skinned module worked by sharing a live skeleton. One driver Skeleton3D animated, and its pose was pushed to every instance in the herd. For variation across states (walking vs idle vs attacking), you needed one herd per state — each with its own driver skeleton.

+
30 animal types × 5 states = 150 live skeletons on the CPU
+

Each skeleton: compute global_pose for every bone, run an AnimationPlayer.process(), push matrices into the data plane, upload the dirty texture region. The cost tracked herd count, not instance count. At 1,000 animals: ~25 FPS. At 10,000: the system crumbles.

+

The fix sounds obvious in retrospect: the GPU should compute the poses, not the CPU. Bake every animation frame into a texture once, and let each instance's vertex shader figure out which frame to sample.

+

The bake: one texture per character type, done once

+

At load time, the skinned_herd.gd backend plays every animation clip on a temporary Skeleton3D and records the bone matrices for every frame into the data plane. A Goat with 9 clips at 30 fps produces 496 frames. Each frame is one row in the bone-matrix texture:

+
Goat: 53 bones × 496 frames = 26,288 bone matrices
+Texture: 212 × 496 pixels, RGBA32F
+VRAM: 212 × 496 × 16 bytes = 1.6 MB
+

That's the ENTIRE animation data for a Goat — walk, run, idle, attack, death, eat, sleep — every frame of every clip, in 1.6 MB. The bake takes a few milliseconds. After that, the skeleton is destroyed. It never runs again.

+

For 30 animal types: ~48 MB total. Compare this to vertex animation textures (VAT): the same Goat would need 2,500 vertices × 496 frames × 12 bytes = 14.2 MB per type, 426 MB total. Bone-matrix is 9× smaller because bones ≪ vertices.

+

The GPU: per-instance playback, zero CPU

+

Each MultiMesh instance carries 4 numbers in INSTANCE_CUSTOM:

+

| Channel | Meaning |

+

|———|———|

+

| .x | Which clip (start row in the palette) |

+

| .y | How many frames in this clip |

+

| .z | Playback rate (baked-fps × ground speed) |

+

| .w | Phase offset (0..1, golden-ratio spread) |

+

The vertex shader derives each instance's current frame from TIME:

+
float fcount = max(INSTANCE_CUSTOM.y, 1.0);
+int   start  = int(INSTANCE_CUSTOM.x + 0.5);
+float fpos   = mod(TIME * INSTANCE_CUSTOM.z + INSTANCE_CUSTOM.w * fcount, fcount);
+
+int f0 = int(fpos);
+int f1 = int(mod(float(f0) + 1.0, fcount));
+float fr = fpos - float(f0);
+
+// Blend between two adjacent baked frames for smooth playback at low bake fps
+int r0 = start + f0;
+int r1 = start + f1;
+
+mat4 m0 = mat4(
+    texelFetch(bone_matrices_tex, ivec2(px+0, r0), 0),
+    texelFetch(bone_matrices_tex, ivec2(px+1, r0), 0),
+    texelFetch(bone_matrices_tex, ivec2(px+2, r0), 0),
+    texelFetch(bone_matrices_tex, ivec2(px+3, r0), 0));
+mat4 m1 = mat4( /* same for r1 */ );
+
+skin += (m0 * (1.0 - fr) + m1 * fr) * weight;
+

That's it. The CPU does nothing per frame. No skeletons. No AnimationPlayer. No per-instance push. Every instance computes its own frame from TIME + its custom data. A walking Boar, a running Boar, and an idle Boar all share the same baked palette — they just point at different rows.

+

What changed in the engine

+

The shader needed one critical change: the bone-matrix texture went from being indexed by INSTANCE_ID (one row per instance) to being indexed by a pose slot computed from INSTANCE_CUSTOM (one row per baked frame). The old code:

+
int inst = INSTANCE_ID;  // row = instance index → lockstep
+

Became:

+
int r0 = start + f0;     // row = palette row from clip + frame → per-instance variety
+

This is a 40-line shader change in the engine's multi_skinned_instance_3d.cpp. It's backward-compatible — slot 0 still works for the old lockstep path (which airborne bird flocks use intentionally — synchronized flapping is a feature, not a bug).

+

Engine version bumped from 4.6.4 to 4.6.5.

+

The numbers (measured, not projected)

+

On an M1 Pro MacBook Pro (integrated GPU):

+

| Agent count | Old lockstep (4.6.4) | GPU-driven palette (4.6.5) |

+

|————|———————-|—————————-|

+

| 100 | ~40 FPS | 60 FPS |

+

| 500 | 31–39 FPS | 60 FPS |

+

| 1,000 | ~25 FPS | 60 FPS |

+

| 10,000 | untested | 8 FPS (unoptimized) |

+

The 10,000 number is low because we haven't done the one-herd-per-type optimization yet — 292 herds vs the planned ~30. And our distance culling still runs on the CPU (MultiMesh has no built-in culling). Both are in the roadmap.

+

VRAM: 1.6 MB per animal type. 30 types = 48 MB total. A Steam Deck with 1 GB shared memory handles this comfortably. The VAT alternative would need 426 MB — nine times more.

+

Draw calls: Currently ~158 (one per type × state, the lockstep holdover). After collapsing to one herd per type: ~30. After sharing palettes for rig-reuse animals: even fewer.

+

The bug that made everything invisible

+

The first build rendered nothing. Animals were "visible" (instance count correct), custom data correct, shader compiled, texture valid — but the screen was empty. FPS was 60 because it was drawing nothing.

+

Root cause: a renderer.refresh() call during setup raced the renderer's own NOTIFICATION_READY handler, which re-bound the shader's bone_matrices_tex uniform — overwriting our baked texture with an unbound (default white) one. The shader sampled white → every bone matrix was identity → the mesh collapsed to a point at origin → invisible.

+

Fix: bind the texture once on the first _process frame, after all nodes have had their _ready called. Then never touch it again. One deferred bind, zero per-frame cost. This is a classic Godot _ready sequencing gotcha.

+

Where this puts us vs AAA

+

The technique — baking bone matrices into a texture and letting the GPU drive per-instance animation — is the same architecture used by Assassin's Creed Unity, Total War: Warhammer, and Hitman for their crowd systems. We're using the same core idea, in a Godot fork, targeting a fraction of the VRAM.

+

What AAA has that we don't (yet):

+
    +
  • LOD tiers — far agents become 2D impostors (billboard quads with a sprite atlas). Same (clip, frame, speed, phase) packet drives all tiers.
  • +
  • Hero rigs — the nearest few agents get real Skeleton3D + AnimationTree + IK + ragdoll. Smooth gait blends, foot-lock, look-at.
  • +
  • Offline bake pipeline — precompute palettes in the asset build, not at load time.
  • +
  • GPU compute culling — frustum + distance + LOD classification on the GPU, no CPU cull loop.
  • +
+

These are planned and designed (the platform doc is at ariki-sim/wiki/plans/crowd-animation-platform-2026-06-15.md), but not built yet. The foundation — the GPU-driven baked palette — is what makes all of them possible.

+

The fork question

+

Every time we change the engine, someone asks: "couldn't you do this without a fork?" For this feature, the answer is no — not without significant compromises. The alternatives:

+
    +
  • VAT (vertex animation textures) with a Godot plugin: Works in stock Godot, but VRAM is 9× larger. For 30 animal types: 426 MB vs our 48 MB. For 5 colonist looks: 620 MB — doesn't fit on a Steam Deck. VAT also can't blend frames (hard cuts between baked frames, no smooth playback) and can't skin normals/tangents (incorrect lighting).
  • +
+
    +
  • Phase-offset drivers only: Keep the live skeletons but stagger their phases. Gives some variety, but still has N live skeletons on the CPU. Doesn't scale to thousands of colonists.
  • +
+
    +
  • Don't do crowds: The simplest answer. But Ariki needs animals and colonists. The architecture decision was made: we forked Godot to own the renderer, and this is exactly the kind of feature that justifies the fork.
  • +
+

What's next

+

The 4-item immediate roadmap:

+

1. One herd per type — collapse ~158 herds to ~30 (remove the per-state batching from the lockstep era)

+

2. Distance LOD — CPU-side cull + cheaper-far shader for far instances

+

3. RGBA16F + offline bake — half the VRAM, zero load-time hitch

+

4. Hero rigs — real AnimationTree + IK + ragdoll for the nearest few animals

+

The far horizon: animated 2D impostors and GPU compute-cull, designed and parked. Brought forward when the load demands them.

+

The engine source lives in tinqs/engine (private). Pre-built editor binaries at tinqs/builds. The Ariki game is at arikigame.com.

+
+

Related: GPU-Skinned Herds — the original herd renderer (yesterday's post). Fork, Don't Build — why we modify existing platforms instead of building new ones. Streaming a 12km Archipelago in Godot 4 — the terrain and vegetation layers that work alongside this.

+ +
+ + +
+ + + diff --git a/index.html b/index.html index 1c94312..56ffe0c 100644 --- a/index.html +++ b/index.html @@ -187,6 +187,13 @@ Read → + + 15 June 2026 +

Zero-CPU Crowd Animation: How We Made 1,000 Animals Animate Without a Single Skeleton

+

We rebuilt our crowd renderer to be fully GPU-driven — bake every animation frame into a bone-matrix palette once, then let each instance compute its own pose in the vertex shader. 1,000 animals: 60 FPS. CPU: idle. This is how AAA does crowds, and now it runs in our Godot fork.

+ Read → +
+ 14 June 2026

GPU-Skinned Herds: One Draw Call for 1,000 Animated Characters in Godot

diff --git a/posts/gpu-driven-crowd-animation.md b/posts/gpu-driven-crowd-animation.md new file mode 100644 index 0000000..a6893a0 --- /dev/null +++ b/posts/gpu-driven-crowd-animation.md @@ -0,0 +1,160 @@ +--- +title: "Zero-CPU Crowd Animation: How We Made 1,000 Animals Animate Without a Single Skeleton" +slug: gpu-driven-crowd-animation +date: "2026-06-15" +description: "Yesterday we shipped a GPU herd renderer that used one live skeleton per animal state. Today we ripped out every live skeleton and made the GPU drive all animation itself — 1,000 agents at 60 FPS, zero per-frame CPU cost, each with its own clip, speed, and phase." +og_description: "1,000 animated agents, zero live skeletons, zero per-frame CPU. Our GPU-driven crowd animation platform in the Tinqs Engine fork." +og_image: "https://www.tinqs.com/img/og-cover.jpg" +excerpt: "We rebuilt our crowd renderer to be fully GPU-driven — bake every animation frame into a bone-matrix palette once, then let each instance compute its own pose in the vertex shader. 1,000 animals: 60 FPS. CPU: idle. This is how AAA does crowds, and now it runs in our Godot fork." +author: "Ozan Bozkurt" +author_initials: "OB" +author_role: "CTO & Developer, Tinqs" +--- +Yesterday we [shipped a GPU herd renderer](gpu-skinned-herds) that draws 1,000 skinned animals in a handful of draw calls. It worked — 25 crocodiles confirmed, 1,000 animals projected. But it had a quiet cost: **one live skeleton per animal state per type.** For 30 types with 5 states each, that's 150 `Skeleton3D` nodes — each with an `AnimationPlayer`, each pushing bone matrices to the GPU every frame. The GPU was fast, but the CPU was doing real work. + +Today we ripped out every live skeleton. The CPU now does **zero per-frame animation work.** 1,000 animals at 60 FPS. Each plays its own clip at its own speed and phase — no lockstep, no copy-paste poses. Here's how. + +## The problem: lockstep costs CPU + +The original `agent_skinned` module worked by **sharing a live skeleton.** One driver `Skeleton3D` animated, and its pose was pushed to every instance in the herd. For variation across states (walking vs idle vs attacking), you needed one herd per state — each with its own driver skeleton. + +``` +30 animal types × 5 states = 150 live skeletons on the CPU +``` + +Each skeleton: compute `global_pose` for every bone, run an `AnimationPlayer.process()`, push matrices into the data plane, upload the dirty texture region. The cost tracked **herd count**, not instance count. At 1,000 animals: ~25 FPS. At 10,000: the system crumbles. + +The fix sounds obvious in retrospect: **the GPU should compute the poses, not the CPU.** Bake every animation frame into a texture once, and let each instance's vertex shader figure out which frame to sample. + +## The bake: one texture per character type, done once + +At load time, the `skinned_herd.gd` backend plays every animation clip on a temporary `Skeleton3D` and records the bone matrices for every frame into the data plane. A Goat with 9 clips at 30 fps produces 496 frames. Each frame is one row in the bone-matrix texture: + +``` +Goat: 53 bones × 496 frames = 26,288 bone matrices +Texture: 212 × 496 pixels, RGBA32F +VRAM: 212 × 496 × 16 bytes = 1.6 MB +``` + +That's the ENTIRE animation data for a Goat — walk, run, idle, attack, death, eat, sleep — every frame of every clip, in 1.6 MB. The bake takes a few milliseconds. After that, the skeleton is destroyed. It never runs again. + +For 30 animal types: ~48 MB total. Compare this to vertex animation textures (VAT): the same Goat would need 2,500 vertices × 496 frames × 12 bytes = **14.2 MB per type, 426 MB total.** Bone-matrix is 9× smaller because bones ≪ vertices. + +## The GPU: per-instance playback, zero CPU + +Each MultiMesh instance carries 4 numbers in `INSTANCE_CUSTOM`: + +| Channel | Meaning | +|---------|---------| +| `.x` | Which clip (start row in the palette) | +| `.y` | How many frames in this clip | +| `.z` | Playback rate (baked-fps × ground speed) | +| `.w` | Phase offset (0..1, golden-ratio spread) | + +The vertex shader derives each instance's current frame from TIME: + +```glsl +float fcount = max(INSTANCE_CUSTOM.y, 1.0); +int start = int(INSTANCE_CUSTOM.x + 0.5); +float fpos = mod(TIME * INSTANCE_CUSTOM.z + INSTANCE_CUSTOM.w * fcount, fcount); + +int f0 = int(fpos); +int f1 = int(mod(float(f0) + 1.0, fcount)); +float fr = fpos - float(f0); + +// Blend between two adjacent baked frames for smooth playback at low bake fps +int r0 = start + f0; +int r1 = start + f1; + +mat4 m0 = mat4( + texelFetch(bone_matrices_tex, ivec2(px+0, r0), 0), + texelFetch(bone_matrices_tex, ivec2(px+1, r0), 0), + texelFetch(bone_matrices_tex, ivec2(px+2, r0), 0), + texelFetch(bone_matrices_tex, ivec2(px+3, r0), 0)); +mat4 m1 = mat4( /* same for r1 */ ); + +skin += (m0 * (1.0 - fr) + m1 * fr) * weight; +``` + +That's it. The CPU does nothing per frame. No skeletons. No `AnimationPlayer`. No per-instance push. Every instance computes its own frame from TIME + its custom data. A walking Boar, a running Boar, and an idle Boar all share the same baked palette — they just point at different rows. + +## What changed in the engine + +The shader needed one critical change: the bone-matrix texture went from being indexed by `INSTANCE_ID` (one row per instance) to being indexed by a **pose slot** computed from `INSTANCE_CUSTOM` (one row per baked frame). The old code: + +```glsl +int inst = INSTANCE_ID; // row = instance index → lockstep +``` + +Became: + +```glsl +int r0 = start + f0; // row = palette row from clip + frame → per-instance variety +``` + +This is a 40-line shader change in the engine's `multi_skinned_instance_3d.cpp`. It's backward-compatible — slot 0 still works for the old lockstep path (which airborne bird flocks use intentionally — synchronized flapping is a feature, not a bug). + +Engine version bumped from 4.6.4 to **4.6.5**. + +## The numbers (measured, not projected) + +On an M1 Pro MacBook Pro (integrated GPU): + +| Agent count | Old lockstep (4.6.4) | GPU-driven palette (4.6.5) | +|------------|----------------------|----------------------------| +| 100 | ~40 FPS | **60 FPS** | +| 500 | 31–39 FPS | **60 FPS** | +| 1,000 | ~25 FPS | **60 FPS** | +| 10,000 | untested | 8 FPS (unoptimized) | + +The 10,000 number is low because we haven't done the one-herd-per-type optimization yet — 292 herds vs the planned ~30. And our distance culling still runs on the CPU (MultiMesh has no built-in culling). Both are in the roadmap. + +**VRAM:** 1.6 MB per animal type. 30 types = 48 MB total. A Steam Deck with 1 GB shared memory handles this comfortably. The VAT alternative would need 426 MB — nine times more. + +**Draw calls:** Currently ~158 (one per type × state, the lockstep holdover). After collapsing to one herd per type: ~30. After sharing palettes for rig-reuse animals: even fewer. + +## The bug that made everything invisible + +The first build rendered nothing. Animals were "visible" (instance count correct), custom data correct, shader compiled, texture valid — but the screen was empty. FPS was 60 because it was drawing nothing. + +Root cause: a `renderer.refresh()` call during setup raced the renderer's own `NOTIFICATION_READY` handler, which re-bound the shader's `bone_matrices_tex` uniform — overwriting our baked texture with an unbound (default white) one. The shader sampled white → every bone matrix was identity → the mesh collapsed to a point at origin → invisible. + +Fix: bind the texture once on the **first `_process` frame**, after all nodes have had their `_ready` called. Then never touch it again. One deferred bind, zero per-frame cost. This is a classic Godot `_ready` sequencing gotcha. + +## Where this puts us vs AAA + +The technique — baking bone matrices into a texture and letting the GPU drive per-instance animation — is the same architecture used by Assassin's Creed Unity, Total War: Warhammer, and Hitman for their crowd systems. We're using the same core idea, in a Godot fork, targeting a fraction of the VRAM. + +What AAA has that we don't (yet): +- **LOD tiers** — far agents become 2D impostors (billboard quads with a sprite atlas). Same `(clip, frame, speed, phase)` packet drives all tiers. +- **Hero rigs** — the nearest few agents get real `Skeleton3D` + `AnimationTree` + IK + ragdoll. Smooth gait blends, foot-lock, look-at. +- **Offline bake pipeline** — precompute palettes in the asset build, not at load time. +- **GPU compute culling** — frustum + distance + LOD classification on the GPU, no CPU cull loop. + +These are planned and designed (the platform doc is at `ariki-sim/wiki/plans/crowd-animation-platform-2026-06-15.md`), but not built yet. The foundation — the GPU-driven baked palette — is what makes all of them possible. + +## The fork question + +Every time we change the engine, someone asks: "couldn't you do this without a fork?" For this feature, the answer is no — not without significant compromises. The alternatives: + +- **VAT (vertex animation textures) with a Godot plugin:** Works in stock Godot, but VRAM is 9× larger. For 30 animal types: 426 MB vs our 48 MB. For 5 colonist looks: 620 MB — doesn't fit on a Steam Deck. VAT also can't blend frames (hard cuts between baked frames, no smooth playback) and can't skin normals/tangents (incorrect lighting). + +- **Phase-offset drivers only:** Keep the live skeletons but stagger their phases. Gives some variety, but still has N live skeletons on the CPU. Doesn't scale to thousands of colonists. + +- **Don't do crowds:** The simplest answer. But Ariki needs animals and colonists. The architecture decision was made: we forked Godot to own the renderer, and this is exactly the kind of feature that justifies the fork. + +## What's next + +The 4-item immediate roadmap: +1. **One herd per type** — collapse ~158 herds to ~30 (remove the per-state batching from the lockstep era) +2. **Distance LOD** — CPU-side cull + cheaper-far shader for far instances +3. **RGBA16F + offline bake** — half the VRAM, zero load-time hitch +4. **Hero rigs** — real `AnimationTree` + IK + ragdoll for the nearest few animals + +The far horizon: animated 2D impostors and GPU compute-cull, designed and parked. Brought forward when the load demands them. + +The engine source lives in [`tinqs/engine`](https://tinqs.com/tinqs/engine) (private). Pre-built editor binaries at [`tinqs/builds`](https://tinqs.com/tinqs/builds). The Ariki game is at [arikigame.com](https://www.arikigame.com). + +--- + +**Related:** [GPU-Skinned Herds](gpu-skinned-herds) — the original herd renderer (yesterday's post). [Fork, Don't Build](fork-dont-build) — why we modify existing platforms instead of building new ones. [Streaming a 12km Archipelago in Godot 4](godot-optimisation) — the terrain and vegetation layers that work alongside this.