161 lines
10 KiB
Markdown
161 lines
10 KiB
Markdown
---
|
||
title: "Zero-CPU Crowd Animation: How We Made 1,000 Animals Animate Without a Single Skeleton"
|
||
slug: gpu-driven-crowd-animation
|
||
date: "2026-06-15"
|
||
description: "Yesterday we shipped a GPU herd renderer that used one live skeleton per animal state. Today we ripped out every live skeleton and made the GPU drive all animation itself — 1,000 agents at 60 FPS, zero per-frame CPU cost, each with its own clip, speed, and phase."
|
||
og_description: "1,000 animated agents, zero live skeletons, zero per-frame CPU. Our GPU-driven crowd animation platform in the Tinqs Engine fork."
|
||
og_image: "https://www.tinqs.com/img/og-cover.jpg"
|
||
excerpt: "We rebuilt our crowd renderer to be fully GPU-driven — bake every animation frame into a bone-matrix palette once, then let each instance compute its own pose in the vertex shader. 1,000 animals: 60 FPS. CPU: idle. This is how AAA does crowds, and now it runs in our Godot fork."
|
||
author: "Ozan Bozkurt"
|
||
author_initials: "OB"
|
||
author_role: "CTO & Developer, Tinqs"
|
||
---
|
||
Yesterday we [shipped a GPU herd renderer](gpu-skinned-herds) that draws 1,000 skinned animals in a handful of draw calls. It worked — 25 crocodiles confirmed, 1,000 animals projected. But it had a quiet cost: **one live skeleton per animal state per type.** For 30 types with 5 states each, that's 150 `Skeleton3D` nodes — each with an `AnimationPlayer`, each pushing bone matrices to the GPU every frame. The GPU was fast, but the CPU was doing real work.
|
||
|
||
Today we ripped out every live skeleton. The CPU now does **zero per-frame animation work.** 1,000 animals at 60 FPS. Each plays its own clip at its own speed and phase — no lockstep, no copy-paste poses. Here's how.
|
||
|
||
## The problem: lockstep costs CPU
|
||
|
||
The original `agent_skinned` module worked by **sharing a live skeleton.** One driver `Skeleton3D` animated, and its pose was pushed to every instance in the herd. For variation across states (walking vs idle vs attacking), you needed one herd per state — each with its own driver skeleton.
|
||
|
||
```
|
||
30 animal types × 5 states = 150 live skeletons on the CPU
|
||
```
|
||
|
||
Each skeleton: compute `global_pose` for every bone, run an `AnimationPlayer.process()`, push matrices into the data plane, upload the dirty texture region. The cost tracked **herd count**, not instance count. At 1,000 animals: ~25 FPS. At 10,000: the system crumbles.
|
||
|
||
The fix sounds obvious in retrospect: **the GPU should compute the poses, not the CPU.** Bake every animation frame into a texture once, and let each instance's vertex shader figure out which frame to sample.
|
||
|
||
## The bake: one texture per character type, done once
|
||
|
||
At load time, the `skinned_herd.gd` backend plays every animation clip on a temporary `Skeleton3D` and records the bone matrices for every frame into the data plane. A Goat with 9 clips at 30 fps produces 496 frames. Each frame is one row in the bone-matrix texture:
|
||
|
||
```
|
||
Goat: 53 bones × 496 frames = 26,288 bone matrices
|
||
Texture: 212 × 496 pixels, RGBA32F
|
||
VRAM: 212 × 496 × 16 bytes = 1.6 MB
|
||
```
|
||
|
||
That's the ENTIRE animation data for a Goat — walk, run, idle, attack, death, eat, sleep — every frame of every clip, in 1.6 MB. The bake takes a few milliseconds. After that, the skeleton is destroyed. It never runs again.
|
||
|
||
For 30 animal types: ~48 MB total. Compare this to vertex animation textures (VAT): the same Goat would need 2,500 vertices × 496 frames × 12 bytes = **14.2 MB per type, 426 MB total.** Bone-matrix is 9× smaller because bones ≪ vertices.
|
||
|
||
## The GPU: per-instance playback, zero CPU
|
||
|
||
Each MultiMesh instance carries 4 numbers in `INSTANCE_CUSTOM`:
|
||
|
||
| Channel | Meaning |
|
||
|---------|---------|
|
||
| `.x` | Which clip (start row in the palette) |
|
||
| `.y` | How many frames in this clip |
|
||
| `.z` | Playback rate (baked-fps × ground speed) |
|
||
| `.w` | Phase offset (0..1, golden-ratio spread) |
|
||
|
||
The vertex shader derives each instance's current frame from TIME:
|
||
|
||
```glsl
|
||
float fcount = max(INSTANCE_CUSTOM.y, 1.0);
|
||
int start = int(INSTANCE_CUSTOM.x + 0.5);
|
||
float fpos = mod(TIME * INSTANCE_CUSTOM.z + INSTANCE_CUSTOM.w * fcount, fcount);
|
||
|
||
int f0 = int(fpos);
|
||
int f1 = int(mod(float(f0) + 1.0, fcount));
|
||
float fr = fpos - float(f0);
|
||
|
||
// Blend between two adjacent baked frames for smooth playback at low bake fps
|
||
int r0 = start + f0;
|
||
int r1 = start + f1;
|
||
|
||
mat4 m0 = mat4(
|
||
texelFetch(bone_matrices_tex, ivec2(px+0, r0), 0),
|
||
texelFetch(bone_matrices_tex, ivec2(px+1, r0), 0),
|
||
texelFetch(bone_matrices_tex, ivec2(px+2, r0), 0),
|
||
texelFetch(bone_matrices_tex, ivec2(px+3, r0), 0));
|
||
mat4 m1 = mat4( /* same for r1 */ );
|
||
|
||
skin += (m0 * (1.0 - fr) + m1 * fr) * weight;
|
||
```
|
||
|
||
That's it. The CPU does nothing per frame. No skeletons. No `AnimationPlayer`. No per-instance push. Every instance computes its own frame from TIME + its custom data. A walking Boar, a running Boar, and an idle Boar all share the same baked palette — they just point at different rows.
|
||
|
||
## What changed in the engine
|
||
|
||
The shader needed one critical change: the bone-matrix texture went from being indexed by `INSTANCE_ID` (one row per instance) to being indexed by a **pose slot** computed from `INSTANCE_CUSTOM` (one row per baked frame). The old code:
|
||
|
||
```glsl
|
||
int inst = INSTANCE_ID; // row = instance index → lockstep
|
||
```
|
||
|
||
Became:
|
||
|
||
```glsl
|
||
int r0 = start + f0; // row = palette row from clip + frame → per-instance variety
|
||
```
|
||
|
||
This is a 40-line shader change in the engine's `multi_skinned_instance_3d.cpp`. It's backward-compatible — slot 0 still works for the old lockstep path (which airborne bird flocks use intentionally — synchronized flapping is a feature, not a bug).
|
||
|
||
Engine version bumped from 4.6.4 to **4.6.5**.
|
||
|
||
## The numbers (measured, not projected)
|
||
|
||
On an M1 Pro MacBook Pro (integrated GPU):
|
||
|
||
| Agent count | Old lockstep (4.6.4) | GPU-driven palette (4.6.5) |
|
||
|------------|----------------------|----------------------------|
|
||
| 100 | ~40 FPS | **60 FPS** |
|
||
| 500 | 31–39 FPS | **60 FPS** |
|
||
| 1,000 | ~25 FPS | **60 FPS** |
|
||
| 10,000 | untested | 8 FPS (unoptimized) |
|
||
|
||
The 10,000 number is low because we haven't done the one-herd-per-type optimization yet — 292 herds vs the planned ~30. And our distance culling still runs on the CPU (MultiMesh has no built-in culling). Both are in the roadmap.
|
||
|
||
**VRAM:** 1.6 MB per animal type. 30 types = 48 MB total. A Steam Deck with 1 GB shared memory handles this comfortably. The VAT alternative would need 426 MB — nine times more.
|
||
|
||
**Draw calls:** Currently ~158 (one per type × state, the lockstep holdover). After collapsing to one herd per type: ~30. After sharing palettes for rig-reuse animals: even fewer.
|
||
|
||
## The bug that made everything invisible
|
||
|
||
The first build rendered nothing. Animals were "visible" (instance count correct), custom data correct, shader compiled, texture valid — but the screen was empty. FPS was 60 because it was drawing nothing.
|
||
|
||
Root cause: a `renderer.refresh()` call during setup raced the renderer's own `NOTIFICATION_READY` handler, which re-bound the shader's `bone_matrices_tex` uniform — overwriting our baked texture with an unbound (default white) one. The shader sampled white → every bone matrix was identity → the mesh collapsed to a point at origin → invisible.
|
||
|
||
Fix: bind the texture once on the **first `_process` frame**, after all nodes have had their `_ready` called. Then never touch it again. One deferred bind, zero per-frame cost. This is a classic Godot `_ready` sequencing gotcha.
|
||
|
||
## Where this puts us vs AAA
|
||
|
||
The technique — baking bone matrices into a texture and letting the GPU drive per-instance animation — is the same architecture used by Assassin's Creed Unity, Total War: Warhammer, and Hitman for their crowd systems. We're using the same core idea, in a Godot fork, targeting a fraction of the VRAM.
|
||
|
||
What AAA has that we don't (yet):
|
||
- **LOD tiers** — far agents become 2D impostors (billboard quads with a sprite atlas). Same `(clip, frame, speed, phase)` packet drives all tiers.
|
||
- **Hero rigs** — the nearest few agents get real `Skeleton3D` + `AnimationTree` + IK + ragdoll. Smooth gait blends, foot-lock, look-at.
|
||
- **Offline bake pipeline** — precompute palettes in the asset build, not at load time.
|
||
- **GPU compute culling** — frustum + distance + LOD classification on the GPU, no CPU cull loop.
|
||
|
||
These are planned and designed (the platform doc is at `ariki-sim/wiki/plans/crowd-animation-platform-2026-06-15.md`), but not built yet. The foundation — the GPU-driven baked palette — is what makes all of them possible.
|
||
|
||
## The fork question
|
||
|
||
Every time we change the engine, someone asks: "couldn't you do this without a fork?" For this feature, the answer is no — not without significant compromises. The alternatives:
|
||
|
||
- **VAT (vertex animation textures) with a Godot plugin:** Works in stock Godot, but VRAM is 9× larger. For 30 animal types: 426 MB vs our 48 MB. For 5 colonist looks: 620 MB — doesn't fit on a Steam Deck. VAT also can't blend frames (hard cuts between baked frames, no smooth playback) and can't skin normals/tangents (incorrect lighting).
|
||
|
||
- **Phase-offset drivers only:** Keep the live skeletons but stagger their phases. Gives some variety, but still has N live skeletons on the CPU. Doesn't scale to thousands of colonists.
|
||
|
||
- **Don't do crowds:** The simplest answer. But Ariki needs animals and colonists. The architecture decision was made: we forked Godot to own the renderer, and this is exactly the kind of feature that justifies the fork.
|
||
|
||
## What's next
|
||
|
||
The 4-item immediate roadmap:
|
||
1. **One herd per type** — collapse ~158 herds to ~30 (remove the per-state batching from the lockstep era)
|
||
2. **Distance LOD** — CPU-side cull + cheaper-far shader for far instances
|
||
3. **RGBA16F + offline bake** — half the VRAM, zero load-time hitch
|
||
4. **Hero rigs** — real `AnimationTree` + IK + ragdoll for the nearest few animals
|
||
|
||
The far horizon: animated 2D impostors and GPU compute-cull, designed and parked. Brought forward when the load demands them.
|
||
|
||
The engine source lives in [`tinqs/engine`](https://tinqs.com/tinqs/engine) (private). Pre-built editor binaries at [`tinqs/builds`](https://tinqs.com/tinqs/builds). The Ariki game is at [arikigame.com](https://www.arikigame.com).
|
||
|
||
---
|
||
|
||
**Related:** [GPU-Skinned Herds](gpu-skinned-herds) — the original herd renderer (yesterday's post). [Fork, Don't Build](fork-dont-build) — why we modify existing platforms instead of building new ones. [Streaming a 12km Archipelago in Godot 4](godot-optimisation) — the terrain and vegetation layers that work alongside this.
|