Zero-CPU Crowd Animation: How We Made 1,000 Animals Animate Without a Single Skeleton
+Yesterday we shipped a GPU herd renderer that draws 1,000 skinned animals in a handful of draw calls. It worked — 25 crocodiles confirmed, 1,000 animals projected. But it had a quiet cost: one live skeleton per animal state per type. For 30 types with 5 states each, that's 150 Skeleton3D nodes — each with an AnimationPlayer, each pushing bone matrices to the GPU every frame. The GPU was fast, but the CPU was doing real work.
Today we ripped out every live skeleton. The CPU now does zero per-frame animation work. 1,000 animals at 60 FPS. Each plays its own clip at its own speed and phase — no lockstep, no copy-paste poses. Here's how.
+The problem: lockstep costs CPU
+The original agent_skinned module worked by sharing a live skeleton. One driver Skeleton3D animated, and its pose was pushed to every instance in the herd. For variation across states (walking vs idle vs attacking), you needed one herd per state — each with its own driver skeleton.
30 animal types × 5 states = 150 live skeletons on the CPU
+Each skeleton: compute global_pose for every bone, run an AnimationPlayer.process(), push matrices into the data plane, upload the dirty texture region. The cost tracked herd count, not instance count. At 1,000 animals: ~25 FPS. At 10,000: the system crumbles.
The fix sounds obvious in retrospect: the GPU should compute the poses, not the CPU. Bake every animation frame into a texture once, and let each instance's vertex shader figure out which frame to sample.
+The bake: one texture per character type, done once
+At load time, the skinned_herd.gd backend plays every animation clip on a temporary Skeleton3D and records the bone matrices for every frame into the data plane. A Goat with 9 clips at 30 fps produces 496 frames. Each frame is one row in the bone-matrix texture:
Goat: 53 bones × 496 frames = 26,288 bone matrices
+Texture: 212 × 496 pixels, RGBA32F
+VRAM: 212 × 496 × 16 bytes = 1.6 MB
+That's the ENTIRE animation data for a Goat — walk, run, idle, attack, death, eat, sleep — every frame of every clip, in 1.6 MB. The bake takes a few milliseconds. After that, the skeleton is destroyed. It never runs again.
+For 30 animal types: ~48 MB total. Compare this to vertex animation textures (VAT): the same Goat would need 2,500 vertices × 496 frames × 12 bytes = 14.2 MB per type, 426 MB total. Bone-matrix is 9× smaller because bones ≪ vertices.
+The GPU: per-instance playback, zero CPU
+Each MultiMesh instance carries 4 numbers in INSTANCE_CUSTOM:
| Channel | Meaning |
+|———|———|
+| .x | Which clip (start row in the palette) |
| .y | How many frames in this clip |
| .z | Playback rate (baked-fps × ground speed) |
| .w | Phase offset (0..1, golden-ratio spread) |
The vertex shader derives each instance's current frame from TIME:
+float fcount = max(INSTANCE_CUSTOM.y, 1.0);
+int start = int(INSTANCE_CUSTOM.x + 0.5);
+float fpos = mod(TIME * INSTANCE_CUSTOM.z + INSTANCE_CUSTOM.w * fcount, fcount);
+
+int f0 = int(fpos);
+int f1 = int(mod(float(f0) + 1.0, fcount));
+float fr = fpos - float(f0);
+
+// Blend between two adjacent baked frames for smooth playback at low bake fps
+int r0 = start + f0;
+int r1 = start + f1;
+
+mat4 m0 = mat4(
+ texelFetch(bone_matrices_tex, ivec2(px+0, r0), 0),
+ texelFetch(bone_matrices_tex, ivec2(px+1, r0), 0),
+ texelFetch(bone_matrices_tex, ivec2(px+2, r0), 0),
+ texelFetch(bone_matrices_tex, ivec2(px+3, r0), 0));
+mat4 m1 = mat4( /* same for r1 */ );
+
+skin += (m0 * (1.0 - fr) + m1 * fr) * weight;
+That's it. The CPU does nothing per frame. No skeletons. No AnimationPlayer. No per-instance push. Every instance computes its own frame from TIME + its custom data. A walking Boar, a running Boar, and an idle Boar all share the same baked palette — they just point at different rows.
What changed in the engine
+The shader needed one critical change: the bone-matrix texture went from being indexed by INSTANCE_ID (one row per instance) to being indexed by a pose slot computed from INSTANCE_CUSTOM (one row per baked frame). The old code:
int inst = INSTANCE_ID; // row = instance index → lockstep
+Became:
+int r0 = start + f0; // row = palette row from clip + frame → per-instance variety
+This is a 40-line shader change in the engine's multi_skinned_instance_3d.cpp. It's backward-compatible — slot 0 still works for the old lockstep path (which airborne bird flocks use intentionally — synchronized flapping is a feature, not a bug).
Engine version bumped from 4.6.4 to 4.6.5.
+The numbers (measured, not projected)
+On an M1 Pro MacBook Pro (integrated GPU):
+| Agent count | Old lockstep (4.6.4) | GPU-driven palette (4.6.5) |
+|————|———————-|—————————-|
+| 100 | ~40 FPS | 60 FPS |
+| 500 | 31–39 FPS | 60 FPS |
+| 1,000 | ~25 FPS | 60 FPS |
+| 10,000 | untested | 8 FPS (unoptimized) |
+The 10,000 number is low because we haven't done the one-herd-per-type optimization yet — 292 herds vs the planned ~30. And our distance culling still runs on the CPU (MultiMesh has no built-in culling). Both are in the roadmap.
+VRAM: 1.6 MB per animal type. 30 types = 48 MB total. A Steam Deck with 1 GB shared memory handles this comfortably. The VAT alternative would need 426 MB — nine times more.
+Draw calls: Currently ~158 (one per type × state, the lockstep holdover). After collapsing to one herd per type: ~30. After sharing palettes for rig-reuse animals: even fewer.
+The bug that made everything invisible
+The first build rendered nothing. Animals were "visible" (instance count correct), custom data correct, shader compiled, texture valid — but the screen was empty. FPS was 60 because it was drawing nothing.
+Root cause: a renderer.refresh() call during setup raced the renderer's own NOTIFICATION_READY handler, which re-bound the shader's bone_matrices_tex uniform — overwriting our baked texture with an unbound (default white) one. The shader sampled white → every bone matrix was identity → the mesh collapsed to a point at origin → invisible.
Fix: bind the texture once on the first _process frame, after all nodes have had their _ready called. Then never touch it again. One deferred bind, zero per-frame cost. This is a classic Godot _ready sequencing gotcha.
Where this puts us vs AAA
+The technique — baking bone matrices into a texture and letting the GPU drive per-instance animation — is the same architecture used by Assassin's Creed Unity, Total War: Warhammer, and Hitman for their crowd systems. We're using the same core idea, in a Godot fork, targeting a fraction of the VRAM.
+What AAA has that we don't (yet):
+-
+
- LOD tiers — far agents become 2D impostors (billboard quads with a sprite atlas). Same
(clip, frame, speed, phase)packet drives all tiers.
+ - Hero rigs — the nearest few agents get real
Skeleton3D+AnimationTree+ IK + ragdoll. Smooth gait blends, foot-lock, look-at.
+ - Offline bake pipeline — precompute palettes in the asset build, not at load time. +
- GPU compute culling — frustum + distance + LOD classification on the GPU, no CPU cull loop. +
These are planned and designed (the platform doc is at ariki-sim/wiki/plans/crowd-animation-platform-2026-06-15.md), but not built yet. The foundation — the GPU-driven baked palette — is what makes all of them possible.
The fork question
+Every time we change the engine, someone asks: "couldn't you do this without a fork?" For this feature, the answer is no — not without significant compromises. The alternatives:
+-
+
- VAT (vertex animation textures) with a Godot plugin: Works in stock Godot, but VRAM is 9× larger. For 30 animal types: 426 MB vs our 48 MB. For 5 colonist looks: 620 MB — doesn't fit on a Steam Deck. VAT also can't blend frames (hard cuts between baked frames, no smooth playback) and can't skin normals/tangents (incorrect lighting). +
-
+
- Phase-offset drivers only: Keep the live skeletons but stagger their phases. Gives some variety, but still has N live skeletons on the CPU. Doesn't scale to thousands of colonists. +
-
+
- Don't do crowds: The simplest answer. But Ariki needs animals and colonists. The architecture decision was made: we forked Godot to own the renderer, and this is exactly the kind of feature that justifies the fork. +
What's next
+The 4-item immediate roadmap:
+1. One herd per type — collapse ~158 herds to ~30 (remove the per-state batching from the lockstep era)
+2. Distance LOD — CPU-side cull + cheaper-far shader for far instances
+3. RGBA16F + offline bake — half the VRAM, zero load-time hitch
+4. Hero rigs — real AnimationTree + IK + ragdoll for the nearest few animals
The far horizon: animated 2D impostors and GPU compute-cull, designed and parked. Brought forward when the load demands them.
+The engine source lives in tinqs/engine (private). Pre-built editor binaries at tinqs/builds. The Ariki game is at arikigame.com.
+
Related: GPU-Skinned Herds — the original herd renderer (yesterday's post). Fork, Don't Build — why we modify existing platforms instead of building new ones. Streaming a 12km Archipelago in Godot 4 — the terrain and vegetation layers that work alongside this.
+ +