GPU-Skinned Herds: One Draw Call for 1,000 Animated Characters in Godot
Godot gives you one Skeleton3D per character. Want 200 animated animals? That's 200 skeleton nodes, 200 draw calls, and 200 AnimationPlayer ticks every frame. Want 1,000? You're measuring in seconds per frame.
We built a GPU-driven crowd animation platform into Tinqs Engine that doesn't use skeletons at all. It bakes every animation frame into a bone-matrix palette texture once, and the GPU drives every instance's playback from then on. 1,000 animals at 60 FPS on integrated graphics. Each plays its own clip at its own speed and phase. Zero per-frame CPU cost. This is how AAA engines do crowds — and now it runs in our Godot fork.
Why the engine needs to change
The standard Godot approach — one Skeleton3D + one MeshInstance3D per character — works for a handful of animated entities. It breaks down hard at crowd scale:
- CPU bone transforms. Computing
global_posefor 1,000 skeletons × 60 bones each = 60,000 matrix multiplications per frame, all on the main thread. - Draw call explosion. Each
MeshInstance3Dis its own draw call. Even with MultiMesh, there's no built-in path for skinned meshes —MultiMeshInstance3Donly handles static geometry. - AnimationPlayer sprawl. Each skeleton needs its own
AnimationPlayerand its ownprocess()tick.
Vertex animation textures (VAT) can solve this — bake every vertex position into a texture and sample it in the shader. But that stores vertices × frames, not bones × frames. A 2,500-vertex animal with 500 animation frames needs 14 MB of VAT data. For 30 animal types: 426 MB. That doesn't fit on a Steam Deck. And VAT can't blend frames for smooth playback, can't skin normals for correct lighting, and locks you into one animation per bake.
Our answer: bone-matrix palette. Bake every bone pose into a texture, keep the skinning in the shader. The GPU samples the bone matrices and skins the mesh itself — same 4-bone linear blend as a real skeleton, same correct normals and tangents. But the CPU never touches a bone.
How it works: two classes, one texture
The module lives in modules/agent_skinned/ inside Tinqs Engine. Two classes, one job.
MultiSkinnedMeshInstance3D — the data plane
Holds the bone-matrix palette. Allocates an ImageTexture of size [4 × max_bones, total_frames] in RGBA32F — each texel is one column of a 4×4 bone matrix, each row is one baked animation frame. At load time, we play every animation clip on a temporary skeleton and record the bone matrices for every frame:
Goat: 53 bones × 9 clips × 496 frames
Texture: 212 × 496 pixels, RGBA32F
VRAM: 212 × 496 × 16 bytes = 1.6 MB
That's every frame of every clip — walk, run, idle, attack, death, eat, sleep — in 1.6 MB. Across 30 animal types: 48 MB total. Compare to VAT at 426 MB. Bone-matrix is 9× smaller because bones ≪ vertices.
After the bake, the skeleton is destroyed. It never runs again. The API is straightforward:
var data := MultiSkinnedMeshInstance3D.new()
data.set_max_bones(53)
data.set_max_instances(496) # palette rows = baked frames
# Bake: play each clip, seek to each frame, record bone matrices
for clip in clips:
for frame in clip.frames:
skeleton.seek(frame.time)
data.set_instance_pose_bones(row, bone_transforms)
The data plane stores matrices column-major — 4 texels per bone = 4 columns of a 4×4 transform. The getter matches the layout, and a doctest asserts it so a transpose can't silently regress.
MultiSkinnedInstance3D — the renderer
A MultiMeshInstance3D subclass. Set its multimesh with the skinned mesh and instance transforms, point its data_source_path at the data plane. Call refresh() once — it uploads the bone texture into the shader material's bone_matrices_tex uniform.
Each MultiMesh instance carries 4 numbers in INSTANCE_CUSTOM (enable multimesh.use_custom_data):
| Channel | Meaning |
|———|———|
| .x | Which clip (start row in the palette) |
| .y | How many frames in this clip |
| .z | Playback rate (baked-fps × ground speed — foot-sync) |
| .w | Phase offset (golden-ratio spread — no two adjacent animals share the same frame) |
The vertex shader derives each instance's current frame from TIME:
float fpos = mod(TIME * INSTANCE_CUSTOM.z + INSTANCE_CUSTOM.w * INSTANCE_CUSTOM.y,
INSTANCE_CUSTOM.y);
int f0 = int(fpos);
int f1 = int(mod(float(f0) + 1.0, INSTANCE_CUSTOM.y));
float fr = fpos - float(f0);
// Blend between two adjacent frames for smooth playback at low bake fps
int r0 = int(INSTANCE_CUSTOM.x + 0.5) + f0;
int r1 = int(INSTANCE_CUSTOM.x + 0.5) + f1;
// For each bone (up to 4 per vertex), reconstruct mat4 from 4 texels, blend, weight
mat4 m0 = mat4(
texelFetch(bone_matrices_tex, ivec2(b*4 + 0, r0), 0),
texelFetch(bone_matrices_tex, ivec2(b*4 + 1, r0), 0),
texelFetch(bone_matrices_tex, ivec2(b*4 + 2, r0), 0),
texelFetch(bone_matrices_tex, ivec2(b*4 + 3, r0), 0));
mat4 m1 = mat4( /* same for r1 */ );
skin += (m0 * (1.0 - fr) + m1 * fr) * weight;
// Apply skin to vertex, normal, tangent
VERTEX = (skin * vec4(VERTEX, 1.0)).xyz;
NORMAL = normalize((skin * vec4(NORMAL, 0.0)).xyz);
The shader uses INSTANCE_CUSTOM to pick the palette row — not INSTANCE_ID. This is the key: the texture's rows are baked animation frames, not per-instance slots. Many instances share the same rows (a synchronized airborne flock) or each pick their own (a varied herd). One abstraction, two behaviors.
The blend between two adjacent frames means we can bake at a low fps and stay smooth — the shader interpolates. The golden-ratio phase spread means every animal in a herd reads a different frame. One draw call per animal type. Zero CPU. Per-instance clip, speed, and phase — all in the GPU.
The shader ships as the default material on MultiSkinnedInstance3D. It includes an albedo_tex uniform — the caller sets it from the source mesh's material so herds texture out of the box. No ShaderMaterial assembly required unless you want custom shading.
The numbers
Measured on an M1 Pro MacBook Pro (integrated GPU):
| Agent count | FPS |
|————|—–|
| 100 | 60 |
| 500 | 60 |
| 1,000 | 60 |
| 10,000 | 8 (with CPU-side culling, pre-optimization) |
VRAM: 1.6 MB per animal type. 30 types = 48 MB total. A Steam Deck with 1 GB shared memory fits the entire roster.
Draw calls: One per animal type. 30 types = 30 draw calls for every animated animal on screen. Future colonists share the same architecture — one draw call per colonist look.
What's driving it
In Ariki, the sim tracks animal migration across a 12km archipelago. AnimalHerdRenderer.cs groups sim ViewerState.animals by type, feeds world positions and yaw rotations to skinned_herd.gd — the reusable per-type herd backend. The herd bakes the palette once at setup, then set_positions() updates transforms each sim tick. set_clip_for_state() switches the active clip block in the custom data when the sim FSM changes state. set_speed_scale() adjusts the per-instance playback rate to match ground speed — feet stay planted.
The sim owns all behavior — 30 data-driven animals with per-animal senses, diet, combat stats, and FSM states (graze, drink, sleep, hunt, flee, scavenge, die). The client just renders. This is the same code in single-player and multiplayer — the sim is the host.
Bird flocks use the same system. BirdFlock.cs runs boid flocking on top of skinned_herd, sharing the palette with synchronized phases (airborne flapping in unison is intentional). 25 bird species, each a single draw call.
Per-instance custom data means a walking Boar, a running Boar, an idle Boar, and an attacking Boar all share the same baked palette — they just point at different rows. The renderer groups by type, not by state. One palette, one draw call, any number of states.
Two bugs we shipped and fixed
The module had data-plane doctests from day one — round-trip pose get/set, dirty tracking, size clamping, AABB, column-major layout. All green. Then we put it on screen and two things were wrong.
Bug 1: Shader compile failure. The default skinning shader compared TANGENT as vec4. Godot 4 exposes it as vec3. Fixed in one line, added albedo_tex uniform so herds texture out of the box.
Bug 2: Bone matrices stored transposed. The initial data plane wrote basis rows (standard Godot Transform3D.basis is row-major), but the shader reads mat4(c0,c1,c2,c3) as columns. Every bone matrix was transposed — the mesh crumpled. Not a scale bug, not an orientation bug — a layout mismatch. Fixed by storing column-major, with a doctest to prevent regression.
The lesson: doctests catch logic. Rendering catches truth. You need both.
The engine change
The module is 40 lines of shader code and ~500 lines of C++ in the engine's modules/agent_skinned/. The critical detail is in the shader: the bone-matrix texture is indexed by a pose slot computed from INSTANCE_CUSTOM, not by INSTANCE_ID. This is what decouples the palette from the instance count — the texture stores animation frames, the MultiMesh stores instance transforms, and the shader bridges them.
Engine version: 4.6.5.
No C# wrapper is generated — instantiate from GDScript via ClassDB.instantiate() and call the bound methods. The binding surface is small and stable. See ariki-game/scenes/animals/skinned_herd.gd for the reference backend.
The production pipeline
Each animal model ships as a game-ready GLB with baked animation clips. A catalog file maps each animal to its clips, default state, and per-animal speed reference for foot-sync.
At runtime, AnimalHerdRenderer spawns one skinned_herd per animal type. The herd bakes the palette from the model's clips. Animation logic maps sim FSM states to clip keywords (attack → attack/bite, flee → run/gallop, wander → walk). The renderer lerps positions between sim ticks for smooth motion and writes per-instance custom data each frame. Zero per-frame CPU on the animation path.
The platform
Each animal type gets one draw call. The GPU palette handles thousands at zero CPU cost. A distance LOD drops far instances to a cheaper shader path, and a cull radius hides everything beyond the horizon. Palette VRAM is halved with RGBA16F storage, cached to disk between runs. The nearest few animals get promoted to real skeletons with crossfades and head look-at — hidden from the palette so they don't double-render.
Stock Godot has no answer for this. Skeleton3D per character caps at ~20. MultiMesh can't skin. There is no built-in crowd animation path. The bone-matrix palette technique is the approach documented in NVIDIA GPU Gems 3 as the standard for GPU crowd animation — the same class of technique used across the industry for rendering thousands of animated characters.
25 bird species share the same platform. Each flock is one synced draw call — airborne flapping in unison is a feature, not a bug. Same code drives 30 animals today. Same code will drive thousands of colonists at launch.
What's deliberately not here
- No C# wrapper. Instantiate from GDScript via
ClassDB.instantiate()— the binding surface is small and stable. - No automatic
AnimationPlayerintegration. You drive poses at bake time. We give you the texture. Freedom to animate however you want. - No GPU occlusion culling. That's the game's job. The engine provides the tool; the game decides what to draw.
Get the build
Pre-built editor binaries with agent_skinned and the GPU-driven palette baked in — no engine compile required. The game's animal_perf_test.tscn lets you spawn 10/100/1,000/10,000 animals and read live FPS:
| Platform | Binary |
|———-|——–|
| macOS ARM64 | tinqs.macos.editor.arm64.mono |
| Windows x64 | tinqs.windows.editor.x86_64.mono.exe |
All builds at tinqs/builds — engine source is private, but the binaries are yours. See manifest.json for checksums and build details.
The engine source lives in tinqs/engine (private). Module docs: modules/agent_skinned/README.md and .agents/wiki/agent-skinned-gpu-herd.md.
Related: Fork, Don't Build — why we modify existing platforms instead of building new ones. Streaming a 12km Archipelago in Godot 4 — the terrain and vegetation streaming layers that work alongside this.