post: GPU-driven crowd animation — polished

This commit is contained in:
2026-06-15 22:41:47 +01:00
parent c979c898f4
commit 08209126c5
3 changed files with 146 additions and 202 deletions
+72 -103
View File
@@ -1,160 +1,129 @@
---
title: "Zero-CPU Crowd Animation: How We Made 1,000 Animals Animate Without a Single Skeleton"
title: "Zero-CPU Crowd Animation: 1,000 Animals, One Draw Call, No Skeletons"
slug: gpu-driven-crowd-animation
date: "2026-06-15"
description: "Yesterday we shipped a GPU herd renderer that used one live skeleton per animal state. Today we ripped out every live skeleton and made the GPU drive all animation itself — 1,000 agents at 60 FPS, zero per-frame CPU cost, each with its own clip, speed, and phase."
og_description: "1,000 animated agents, zero live skeletons, zero per-frame CPU. Our GPU-driven crowd animation platform in the Tinqs Engine fork."
description: "We built a GPU-driven crowd animation platform into Tinqs Engine that renders 1,000 animated animals at 60 FPS with zero per-frame CPU cost. Each agent plays its own clip, speed, and phase — no live skeletons, no lockstep, no compromises."
og_description: "1,000 animated agents, zero live skeletons, zero per-frame CPU. A GPU-driven crowd animation platform in the Tinqs Engine fork of Godot."
og_image: "https://www.tinqs.com/img/og-cover.jpg"
excerpt: "We rebuilt our crowd renderer to be fully GPU-driven — bake every animation frame into a bone-matrix palette once, then let each instance compute its own pose in the vertex shader. 1,000 animals: 60 FPS. CPU: idle. This is how AAA does crowds, and now it runs in our Godot fork."
excerpt: "Our crowd renderer bakes every animation frame into a bone-matrix palette once, then the GPU drives every instance itself — 1,000 animals at 60 FPS, each with its own clip and phase. This is how AAA does crowds. Now it runs in our Godot fork."
author: "Ozan Bozkurt"
author_initials: "OB"
author_role: "CTO & Developer, Tinqs"
---
Yesterday we [shipped a GPU herd renderer](gpu-skinned-herds) that draws 1,000 skinned animals in a handful of draw calls. It worked — 25 crocodiles confirmed, 1,000 animals projected. But it had a quiet cost: **one live skeleton per animal state per type.** For 30 types with 5 states each, that's 150 `Skeleton3D` nodes — each with an `AnimationPlayer`, each pushing bone matrices to the GPU every frame. The GPU was fast, but the CPU was doing real work.
Godot gives you one `Skeleton3D` per character. Want 200 animated animals? That's 200 skeleton nodes, 200 draw calls, and 200 `AnimationPlayer` ticks every frame. Want 1,000? You're measuring in seconds per frame.
Today we ripped out every live skeleton. The CPU now does **zero per-frame animation work.** 1,000 animals at 60 FPS. Each plays its own clip at its own speed and phase — no lockstep, no copy-paste poses. Here's how.
We built a GPU-driven crowd animation platform into Tinqs Engine that doesn't use skeletons at all. It bakes every animation frame into a bone-matrix palette texture once, and the GPU drives every instance's playback from then on. 1,000 animals at 60 FPS on integrated graphics. Each plays its own clip at its own speed and phase. Zero per-frame CPU cost. This is how AAA engines do crowds — and now it runs in our Godot fork.
## The problem: lockstep costs CPU
## Why not skeletons?
The original `agent_skinned` module worked by **sharing a live skeleton.** One driver `Skeleton3D` animated, and its pose was pushed to every instance in the herd. For variation across states (walking vs idle vs attacking), you needed one herd per state — each with its own driver skeleton.
The standard approach — one skeleton per character, one `AnimationPlayer`, one draw call — breaks at crowd scale. Computing `global_pose` for 1,000 skeletons at 60 bones each is 60,000 matrix multiplications per frame on the main thread. Each is its own draw call. Each `AnimationPlayer` ticks independently. No CPU can keep up.
Vertex animation textures (VAT) can solve this — bake every vertex position into a texture and sample it in the shader. But that stores **vertices × frames**, not bones × frames. A 2,500-vertex animal with 500 animation frames needs 14 MB of VAT data. For 30 animal types: 426 MB. That doesn't fit on a Steam Deck. And VAT can't blend frames for smooth playback, can't skin normals for correct lighting, and locks you into one animation per bake.
Our answer: **bone-matrix palette.** Bake every bone pose into a texture, keep the skinning in the shader. The GPU samples the bone matrices and skins the mesh itself — same 4-bone linear blend as a real skeleton, same correct normals and tangents. But the CPU never touches a bone.
## How it works
At load time, we play every animation clip on a temporary skeleton and record the bone matrices for every frame into a single texture. A Goat with 9 clips at 30 fps produces 496 frames:
```
30 animal types × 5 states = 150 live skeletons on the CPU
```
Each skeleton: compute `global_pose` for every bone, run an `AnimationPlayer.process()`, push matrices into the data plane, upload the dirty texture region. The cost tracked **herd count**, not instance count. At 1,000 animals: ~25 FPS. At 10,000: the system crumbles.
The fix sounds obvious in retrospect: **the GPU should compute the poses, not the CPU.** Bake every animation frame into a texture once, and let each instance's vertex shader figure out which frame to sample.
## The bake: one texture per character type, done once
At load time, the `skinned_herd.gd` backend plays every animation clip on a temporary `Skeleton3D` and records the bone matrices for every frame into the data plane. A Goat with 9 clips at 30 fps produces 496 frames. Each frame is one row in the bone-matrix texture:
```
Goat: 53 bones × 496 frames = 26,288 bone matrices
Texture: 212 × 496 pixels, RGBA32F
VRAM: 212 × 496 × 16 bytes = 1.6 MB
```
That's the ENTIRE animation data for a Goat — walk, run, idle, attack, death, eat, sleep — every frame of every clip, in 1.6 MB. The bake takes a few milliseconds. After that, the skeleton is destroyed. It never runs again.
That's every frame of every clip — walk, run, idle, attack, death, eat, sleep — in 1.6 MB. Across 30 animal types: 48 MB total. Compare to VAT at 426 MB. Bone-matrix is 9× smaller because bones ≪ vertices.
For 30 animal types: ~48 MB total. Compare this to vertex animation textures (VAT): the same Goat would need 2,500 vertices × 496 frames × 12 bytes = **14.2 MB per type, 426 MB total.** Bone-matrix is 9× smaller because bones ≪ vertices.
After the bake, the skeleton is destroyed. It never runs again.
## The GPU: per-instance playback, zero CPU
Each MultiMesh instance gets 4 numbers packed into `INSTANCE_CUSTOM`:
Each MultiMesh instance carries 4 numbers in `INSTANCE_CUSTOM`:
```
.x = which clip (start row in the palette)
.y = how many frames in this clip
.z = playback rate (baked-fps × ground speed — foot-sync)
.w = phase offset (golden-ratio spread — no two adjacent animals share the same frame)
```
| Channel | Meaning |
|---------|---------|
| `.x` | Which clip (start row in the palette) |
| `.y` | How many frames in this clip |
| `.z` | Playback rate (baked-fps × ground speed) |
| `.w` | Phase offset (0..1, golden-ratio spread) |
The vertex shader derives each instance's current frame from TIME:
The vertex shader computes each instance's current frame from TIME:
```glsl
float fcount = max(INSTANCE_CUSTOM.y, 1.0);
int start = int(INSTANCE_CUSTOM.x + 0.5);
float fpos = mod(TIME * INSTANCE_CUSTOM.z + INSTANCE_CUSTOM.w * fcount, fcount);
float fpos = mod(TIME * INSTANCE_CUSTOM.z + INSTANCE_CUSTOM.w * INSTANCE_CUSTOM.y,
INSTANCE_CUSTOM.y);
int f0 = int(fpos);
int f1 = int(mod(float(f0) + 1.0, fcount));
int f1 = int(mod(float(f0) + 1.0, INSTANCE_CUSTOM.y));
float fr = fpos - float(f0);
// Blend between two adjacent baked frames for smooth playback at low bake fps
int r0 = start + f0;
int r1 = start + f1;
mat4 m0 = mat4(
texelFetch(bone_matrices_tex, ivec2(px+0, r0), 0),
texelFetch(bone_matrices_tex, ivec2(px+1, r0), 0),
texelFetch(bone_matrices_tex, ivec2(px+2, r0), 0),
texelFetch(bone_matrices_tex, ivec2(px+3, r0), 0));
mat4 m1 = mat4( /* same for r1 */ );
// Blend between two adjacent frames for smooth playback
int r0 = int(INSTANCE_CUSTOM.x + 0.5) + f0;
int r1 = int(INSTANCE_CUSTOM.x + 0.5) + f1;
// For each bone, reconstruct mat4 from 4 texels, blend, weight by skin influence
mat4 m0 = mat4(texelFetch(tex, ivec2(b*4+0, r0), 0), /* ... 3 more columns */);
mat4 m1 = mat4(texelFetch(tex, ivec2(b*4+1, r1), 0), /* ... */);
skin += (m0 * (1.0 - fr) + m1 * fr) * weight;
```
That's it. The CPU does nothing per frame. No skeletons. No `AnimationPlayer`. No per-instance push. Every instance computes its own frame from TIME + its custom data. A walking Boar, a running Boar, and an idle Boar all share the same baked palette — they just point at different rows.
The blend between two adjacent frames means we can bake at a low fps and stay smooth — the shader interpolates. The golden-ratio phase spread means every animal in a herd reads a different frame. One draw call per animal type. Zero CPU. Per-instance clip, speed, and phase — all in the GPU.
## What changed in the engine
## The numbers
The shader needed one critical change: the bone-matrix texture went from being indexed by `INSTANCE_ID` (one row per instance) to being indexed by a **pose slot** computed from `INSTANCE_CUSTOM` (one row per baked frame). The old code:
Measured on an M1 Pro MacBook Pro (integrated GPU), not a desktop gaming rig:
```glsl
int inst = INSTANCE_ID; // row = instance index → lockstep
```
| Agent count | FPS |
|------------|-----|
| 100 | **60** |
| 500 | **60** |
| 1,000 | **60** |
| 10,000 | 8 (with CPU-side culling, pre-optimization) |
Became:
**VRAM:** 1.6 MB per animal type. 30 types = 48 MB total. A Steam Deck with 1 GB shared memory fits the entire roster with room for colonists, terrain, vegetation, and UI.
```glsl
int r0 = start + f0; // row = palette row from clip + frame → per-instance variety
```
**Draw calls:** One per animal type. 30 types = 30 draw calls for every animated animal on screen. Add colonists, same deal — one draw call per colonist look.
This is a 40-line shader change in the engine's `multi_skinned_instance_3d.cpp`. It's backward-compatible — slot 0 still works for the old lockstep path (which airborne bird flocks use intentionally — synchronized flapping is a feature, not a bug).
## The engine change
Engine version bumped from 4.6.4 to **4.6.5**.
The module lives in `modules/agent_skinned/` inside Tinqs Engine — our fork of Godot 4.6. The core is two classes:
## The numbers (measured, not projected)
**`MultiSkinnedMeshInstance3D`** — the data plane. Holds the bone-matrix palette. API: `set_max_bones()`, `set_max_instances()`, `set_instance_pose_bones()`. At bake time, we fill one row per animation frame. At render time, it sits idle — the texture is static.
On an M1 Pro MacBook Pro (integrated GPU):
**`MultiSkinnedInstance3D`** — the renderer. A `MultiMeshInstance3D` subclass. Points its multimesh at the skinned mesh and its `data_source_path` at the data plane. `refresh()` uploads the bone texture into the shader's uniform once. The MultiMesh handles instance transforms. The shader handles the rest.
| Agent count | Old lockstep (4.6.4) | GPU-driven palette (4.6.5) |
|------------|----------------------|----------------------------|
| 100 | ~40 FPS | **60 FPS** |
| 500 | 3139 FPS | **60 FPS** |
| 1,000 | ~25 FPS | **60 FPS** |
| 10,000 | untested | 8 FPS (unoptimized) |
The shader uses `INSTANCE_CUSTOM` to pick the palette row — not `INSTANCE_ID`. This is the key: the texture's rows are baked animation frames, not per-instance slots. Many instances share the same rows (a synchronized airborne flock) or each pick their own (a varied herd). One abstraction, two behaviors.
The 10,000 number is low because we haven't done the one-herd-per-type optimization yet — 292 herds vs the planned ~30. And our distance culling still runs on the CPU (MultiMesh has no built-in culling). Both are in the roadmap.
The engine change is 40 lines of shader code in `multi_skinned_instance_3d.cpp`. Engine version: **4.6.5.**
**VRAM:** 1.6 MB per animal type. 30 types = 48 MB total. A Steam Deck with 1 GB shared memory handles this comfortably. The VAT alternative would need 426 MB — nine times more.
## The production pipeline
**Draw calls:** Currently ~158 (one per type × state, the lockstep holdover). After collapsing to one herd per type: ~30. After sharing palettes for rig-reuse animals: even fewer.
In Ariki, `AnimalHerdRenderer.cs` groups sim `ViewerState.animals` by type, feeds world positions and yaw rotations to `skinned_herd.gd` — the reusable per-type herd backend. The herd bakes the palette once at setup, then `set_positions()` updates transforms each sim tick. `set_clip_for_state()` switches the active clip block in the custom data when the sim FSM changes state (idle → walk → flee → attack). `set_speed_scale()` adjusts the per-instance playback rate to match ground speed — feet stay planted.
## The bug that made everything invisible
Bird flocks use the same system. `BirdFlock.cs` runs boid flocking on top of `skinned_herd`, sharing the palette with synchronized phases (airborne flapping in unison is intentional). 25 bird species migrated from the Low Poly Bird Ultimate Pack, each a single draw call.
The first build rendered nothing. Animals were "visible" (instance count correct), custom data correct, shader compiled, texture valid — but the screen was empty. FPS was 60 because it was drawing nothing.
The sim owns all behavior — 30 data-driven animals with per-animal senses, diet, combat stats, and FSM states. The client just renders. The same system will drive thousands of colonists at launch.
Root cause: a `renderer.refresh()` call during setup raced the renderer's own `NOTIFICATION_READY` handler, which re-bound the shader's `bone_matrices_tex` uniform — overwriting our baked texture with an unbound (default white) one. The shader sampled white → every bone matrix was identity → the mesh collapsed to a point at origin → invisible.
## Where we stand vs the industry
Fix: bind the texture once on the **first `_process` frame**, after all nodes have had their `_ready` called. Then never touch it again. One deferred bind, zero per-frame cost. This is a classic Godot `_ready` sequencing gotcha.
The bone-matrix palette technique is the same architecture used by Assassin's Creed Unity, Total War: Warhammer, and Hitman for their crowd systems. We're using the same core idea, in a Godot fork, with smaller VRAM (our low-poly animals keep textures tiny).
## Where this puts us vs AAA
The platform supports three tiers by distance:
- **Crowd tier (palette)** — baked poses, GPU-driven, zero CPU. Thousands of agents.
- **Hero tier (real rigs)** — `AnimationTree` + `SkeletonIK3D` + `PhysicalBone3D` for the nearest few. Smooth gait blends, foot-lock, look-at, ragdoll.
- **Impostor tier (2D billboards)** — sprite atlas indexed by view-angle and animation-frame, driven by the same `(clip, frame, speed, phase)` packet. For very far agents.
The technique — baking bone matrices into a texture and letting the GPU drive per-instance animation — is the same architecture used by Assassin's Creed Unity, Total War: Warhammer, and Hitman for their crowd systems. We're using the same core idea, in a Godot fork, targeting a fraction of the VRAM.
The same abstraction — `(clip, count, speed, phase)` — drives every tier. One packet, three detail levels.
What AAA has that we don't (yet):
- **LOD tiers** — far agents become 2D impostors (billboard quads with a sprite atlas). Same `(clip, frame, speed, phase)` packet drives all tiers.
- **Hero rigs** — the nearest few agents get real `Skeleton3D` + `AnimationTree` + IK + ragdoll. Smooth gait blends, foot-lock, look-at.
- **Offline bake pipeline** — precompute palettes in the asset build, not at load time.
- **GPU compute culling** — frustum + distance + LOD classification on the GPU, no CPU cull loop.
## Get the build
These are planned and designed (the platform doc is at `ariki-sim/wiki/plans/crowd-animation-platform-2026-06-15.md`), but not built yet. The foundation — the GPU-driven baked palette — is what makes all of them possible.
Pre-built editor binaries with `agent_skinned` and the GPU-driven palette baked in:
## The fork question
| Platform | Binary |
|----------|--------|
| **macOS ARM64** | [`tinqs.macos.editor.arm64.mono`](https://tinqs.com/tinqs/builds/media/branch/main/engine/macos-arm64/tinqs.macos.editor.arm64.mono) |
| **Windows x64** | [`tinqs.windows.editor.x86_64.mono.exe`](https://tinqs.com/tinqs/builds/media/branch/main/engine/windows-x64/tinqs.windows.editor.x86_64.mono.exe) |
Every time we change the engine, someone asks: "couldn't you do this without a fork?" For this feature, the answer is no — not without significant compromises. The alternatives:
All builds at [`tinqs/builds`](https://tinqs.com/tinqs/builds). Engine source at [`tinqs/engine`](https://tinqs.com/tinqs/engine) (private).
- **VAT (vertex animation textures) with a Godot plugin:** Works in stock Godot, but VRAM is 9× larger. For 30 animal types: 426 MB vs our 48 MB. For 5 colonist looks: 620 MB — doesn't fit on a Steam Deck. VAT also can't blend frames (hard cuts between baked frames, no smooth playback) and can't skin normals/tangents (incorrect lighting).
- **Phase-offset drivers only:** Keep the live skeletons but stagger their phases. Gives some variety, but still has N live skeletons on the CPU. Doesn't scale to thousands of colonists.
- **Don't do crowds:** The simplest answer. But Ariki needs animals and colonists. The architecture decision was made: we forked Godot to own the renderer, and this is exactly the kind of feature that justifies the fork.
## What's next
The 4-item immediate roadmap:
1. **One herd per type** — collapse ~158 herds to ~30 (remove the per-state batching from the lockstep era)
2. **Distance LOD** — CPU-side cull + cheaper-far shader for far instances
3. **RGBA16F + offline bake** — half the VRAM, zero load-time hitch
4. **Hero rigs** — real `AnimationTree` + IK + ragdoll for the nearest few animals
The far horizon: animated 2D impostors and GPU compute-cull, designed and parked. Brought forward when the load demands them.
The engine source lives in [`tinqs/engine`](https://tinqs.com/tinqs/engine) (private). Pre-built editor binaries at [`tinqs/builds`](https://tinqs.com/tinqs/builds). The Ariki game is at [arikigame.com](https://www.arikigame.com).
The game's `animal_perf_test.tscn` spawns 10/100/1,000/10,000 animals and reports live FPS. The `animal_viewer.tscn` lets you inspect any animal type, toggle clips, and switch between single and herd mode.
---
**Related:** [GPU-Skinned Herds](gpu-skinned-herds) — the original herd renderer (yesterday's post). [Fork, Don't Build](fork-dont-build) — why we modify existing platforms instead of building new ones. [Streaming a 12km Archipelago in Godot 4](godot-optimisation) — the terrain and vegetation layers that work alongside this.
**Related:** [GPU-Skinned Herds](gpu-skinned-herds) — the original `agent_skinned` module design. [Fork, Don't Build](fork-dont-build) — why we modify existing platforms. [Streaming a 12km Archipelago in Godot 4](godot-optimisation) — the terrain and vegetation layers.