<metaname="description"content="Godot has no built-in way to render 1,000 skinned characters in one draw call. We built a GPU-driven crowd animation platform into Tinqs Engine that does — 1,000 animals at 60 FPS, each with its own clip and phase, zero per-frame CPU. Pre-built binaries for macOS and Windows.">
<metaproperty="og:description"content="One draw call, 1,000 animated characters, zero CPU. GPU-driven crowd animation platform built into the Tinqs Engine fork of Godot.">
<metaname="twitter:description"content="One draw call, 1,000 animated characters, zero CPU. GPU-driven crowd animation platform built into the Tinqs Engine fork of Godot.">
"description":"Godot has no built-in way to render 1,000 skinned characters in one draw call. We built a GPU-driven crowd animation platform into Tinqs Engine that does — 1,000 animals at 60 FPS, each with its own clip and phase, zero per-frame CPU. Pre-built binaries for macOS and Windows."
<pclass="post__lead">Godot gives you one <code>Skeleton3D</code> per character. Want 200 animated animals? That's 200 skeleton nodes, 200 draw calls, and 200 <code>AnimationPlayer</code> ticks every frame. Want 1,000? You're measuring in seconds per frame.</p>
<p>We built a GPU-driven crowd animation platform into Tinqs Engine that doesn't use skeletons at all. It bakes every animation frame into a bone-matrix palette texture once, and the GPU drives every instance's playback from then on. 1,000 animals at 60 FPS on integrated graphics. Each plays its own clip at its own speed and phase. Zero per-frame CPU cost. This is how AAA engines do crowds — and now it runs in our Godot fork.</p>
<p>The standard Godot approach — one <code>Skeleton3D</code> + one <code>MeshInstance3D</code> per character — works for a handful of animated entities. It breaks down hard at crowd scale:</p>
<li><strong>CPU bone transforms.</strong> Computing <code>global_pose</code> for 1,000 skeletons × 60 bones each = 60,000 matrix multiplications per frame, all on the main thread.</li>
<li><strong>Draw call explosion.</strong> Each <code>MeshInstance3D</code> is its own draw call. Even with MultiMesh, there's no built-in path for skinned meshes — <code>MultiMeshInstance3D</code> only handles static geometry.</li>
<li><strong>AnimationPlayer sprawl.</strong> Each skeleton needs its own <code>AnimationPlayer</code> and its own <code>process()</code> tick.</li>
<p>Vertex animation textures (VAT) can solve this — bake every vertex position into a texture and sample it in the shader. But that stores <strong>vertices × frames</strong>, not bones × frames. A 2,500-vertex animal with 500 animation frames needs 14 MB of VAT data. For 30 animal types: 426 MB. That doesn't fit on a Steam Deck. And VAT can't blend frames for smooth playback, can't skin normals for correct lighting, and locks you into one animation per bake.</p>
<p>Our answer: <strong>bone-matrix palette.</strong> Bake every bone pose into a texture, keep the skinning in the shader. The GPU samples the bone matrices and skins the mesh itself — same 4-bone linear blend as a real skeleton, same correct normals and tangents. But the CPU never touches a bone.</p>
<p>The module lives in <code>modules/agent_skinned/</code> inside <ahref="https://tinqs.com/tinqs/engine"style="color: var(--c-lime);">Tinqs Engine</a>. Two classes, one job.</p>
<p>Holds the bone-matrix palette. Allocates an <code>ImageTexture</code> of size <code>[4 × max_bones, total_frames]</code> in RGBA32F — each texel is one column of a 4×4 bone matrix, each row is one baked animation frame. At load time, we play every animation clip on a temporary skeleton and record the bone matrices for every frame:</p>
<pre><code>Goat: 53 bones × 9 clips × 496 frames
Texture: 212 × 496 pixels, RGBA32F
VRAM: 212 × 496 × 16 bytes = 1.6 MB</code></pre>
<p>That's every frame of every clip — walk, run, idle, attack, death, eat, sleep — in 1.6 MB. Across 30 animal types: <strong>48 MB total.</strong> Compare to VAT at 426 MB. Bone-matrix is 9× smaller because bones ≪ vertices.</p>
<p>After the bake, the skeleton is destroyed. It never runs again. The API is straightforward:</p>
<p>The data plane stores matrices column-major — 4 texels per bone = 4 columns of a 4×4 transform. The getter matches the layout, and a doctest asserts it so a transpose can't silently regress.</p>
<p>A <code>MultiMeshInstance3D</code> subclass. Set its multimesh with the skinned mesh and instance transforms, point its <code>data_source_path</code> at the data plane. Call <code>refresh()</code> once — it uploads the bone texture into the shader material's <code>bone_matrices_tex</code> uniform.</p>
<p>Each MultiMesh instance carries 4 numbers in <code>INSTANCE_CUSTOM</code> (enable <code>multimesh.use_custom_data</code>):</p>
NORMAL = normalize((skin * vec4(NORMAL, 0.0)).xyz);</code></pre>
<p>The shader uses <code>INSTANCE_CUSTOM</code> to pick the palette row — not <code>INSTANCE_ID</code>. This is the key: the texture's rows are baked animation frames, not per-instance slots. Many instances share the same rows (a synchronized airborne flock) or each pick their own (a varied herd). One abstraction, two behaviors.</p>
<p>The blend between two adjacent frames means we can bake at a low fps and stay smooth — the shader interpolates. The golden-ratio phase spread means every animal in a herd reads a different frame. One draw call per animal type. Zero CPU. Per-instance clip, speed, and phase — all in the GPU.</p>
<p>The shader ships as the default material on <code>MultiSkinnedInstance3D</code>. It includes an <code>albedo_tex</code> uniform — the caller sets it from the source mesh's material so herds texture out of the box. No <code>ShaderMaterial</code> assembly required unless you want custom shading.</p>
<h2>The numbers</h2>
<p>Measured on an M1 Pro MacBook Pro (integrated GPU):</p>
<p><strong>VRAM:</strong> 1.6 MB per animal type. 30 types = 48 MB total. A Steam Deck with 1 GB shared memory fits the entire roster.</p>
<p><strong>Draw calls:</strong> One per animal type. 30 types = 30 draw calls for every animated animal on screen. Future colonists share the same architecture — one draw call per colonist look.</p>
<h2>What's driving it</h2>
<p>In <ahref="https://www.arikigame.com"style="color: var(--c-lime);">Ariki</a>, the sim tracks animal migration across a 12km archipelago. <code>AnimalHerdRenderer.cs</code> groups sim <code>ViewerState.animals</code> by type, feeds world positions and yaw rotations to <code>skinned_herd.gd</code> — the reusable per-type herd backend. The herd bakes the palette once at setup, then <code>set_positions()</code> updates transforms each sim tick. <code>set_clip_for_state()</code> switches the active clip block in the custom data when the sim FSM changes state. <code>set_speed_scale()</code> adjusts the per-instance playback rate to match ground speed — feet stay planted.</p>
<p>The sim owns all behavior — 30 data-driven animals with per-animal senses, diet, combat stats, and FSM states (graze, drink, sleep, hunt, flee, scavenge, die). The client just renders. This is the same code in single-player and multiplayer — the sim is the host.</p>
<p>Bird flocks use the same system. <code>BirdFlock.cs</code> runs boid flocking on top of <code>skinned_herd</code>, sharing the palette with synchronized phases (airborne flapping in unison is intentional). 25 bird species migrated from the Low Poly Bird Ultimate Pack, each a single draw call.</p>
<p>Per-instance custom data means a walking Boar, a running Boar, an idle Boar, and an attacking Boar all share the same baked palette — they just point at different rows. The renderer groups by type, not by state. One palette, one draw call, any number of states.</p>
<p>The module had data-plane doctests from day one — round-trip pose get/set, dirty tracking, size clamping, AABB, column-major layout. All green. Then we put it on screen and two things were wrong.</p>
<p><strong>Bug 1: Shader compile failure.</strong> The default skinning shader compared <code>TANGENT</code> as <code>vec4</code>. Godot 4 exposes it as <code>vec3</code>. Fixed in one line, added <code>albedo_tex</code> uniform so herds texture out of the box.</p>
<p><strong>Bug 2: Bone matrices stored transposed.</strong> The initial data plane wrote basis rows (standard Godot <code>Transform3D.basis</code> is row-major), but the shader reads <code>mat4(c0,c1,c2,c3)</code> as columns. Every bone matrix was transposed — the mesh crumpled. Not a scale bug, not an orientation bug — a layout mismatch. Fixed by storing column-major, with a doctest to prevent regression.</p>
<p>The module is 40 lines of shader code and ~500 lines of C++ in the engine's <code>modules/agent_skinned/</code>. The critical detail is in the shader: the bone-matrix texture is indexed by a <strong>pose slot</strong> computed from <code>INSTANCE_CUSTOM</code>, not by <code>INSTANCE_ID</code>. This is what decouples the palette from the instance count — the texture stores animation frames, the MultiMesh stores instance transforms, and the shader bridges them.</p>
<p>Engine version: <strong>4.6.5.</strong></p>
<p>No C# wrapper is generated — instantiate from GDScript via <code>ClassDB.instantiate()</code> and call the bound methods. The binding surface is small and stable. See <code>ariki-game/scenes/animals/skinned_herd.gd</code> for the reference backend.</p>
<h2>The production pipeline</h2>
<p>The <code>migrate_animals.py</code> tool converts polyperfect FBX packs to game-ready GLBs — imports, cleans hierarchy, rebuilds named NLA clips from frame ranges, strips duplicate meshes, bakes into the flat <code>assets/models/glbs/</code> directory. Each animal gets a catalog entry in <code>animals_catalog.json</code> with clip metadata, default state mapping, and an <code>animSpeedRef</code> for foot-sync.</p>
<p>At runtime, <code>AnimalHerdRenderer</code> spawns one <code>skinned_herd</code> per animal type. The herd bakes the palette from the catalog GLB's clips. <code>AnimalAnimationLogic</code> maps sim FSM states to clip keywords (attack → "attack"/"bite", flee → "run"/"gallop", wander → "walk"). The renderer lerps positions between sim ticks for smooth motion and writes per-instance custom data each frame. Zero per-frame CPU on the animation path.</p>
<h2>Where we stand vs the industry</h2>
<p>The bone-matrix palette technique is the same architecture used by Assassin's Creed Unity, Total War: Warhammer, and Hitman for their crowd systems. We're using the same core idea, in a Godot fork, with smaller VRAM — our low-poly animals keep textures tiny.</p>
<p>The platform supports three tiers by distance, all driven by the same <code>(clip, count, speed, phase)</code> packet:</p>
<ul>
<li><strong>Crowd tier (palette)</strong> — baked poses, GPU-driven, zero CPU. Thousands of agents.</li>
<li><strong>Hero tier (real rigs)</strong> — <code>AnimationTree</code> + <code>SkeletonIK3D</code> + <code>PhysicalBone3D</code> for the nearest few. Smooth gait blends, foot-lock, look-at, ragdoll.</li>
<li><strong>Impostor tier (2D billboards)</strong> — sprite atlas indexed by view-angle and animation-frame. For very far agents.</li>
</ul>
<p>One abstraction, three detail levels. The same code that drives 30 animals today will drive thousands of colonists at launch.</p>
<p>The 10,000-agent load test pointed at exactly where the engine can still win. The bottleneck at extreme scale isn't the GPU skinning — it's the CPU path feeding it: C# builds arrays, marshals Variants into GDScript, GDScript loops per-instance <code>set_instance_transform</code> / <code>set_instance_custom_data</code>. Three layers of per-instance overhead, all on the main thread. The fixes are engine-deep.</p>
<h3>Tier A — kill the per-frame CPU path</h3>
<p><strong>Bulk instance-upload API.</strong> Add <code>set_instance_data_bulk()</code> that does a single memcpy into the MultiMesh buffer instead of N scripted per-instance calls. One marshalled call + one copy per herd per frame instead of thousands.</p>
<p><strong>GPU-driven cull + indirect multi-draw.</strong> A compute pass classifies frustum, distance, and LOD-tier on the GPU and writes an indirect draw buffer per tier — the CPU stops iterating instances entirely. Pairs with bulk upload: together the main thread does ~zero per-instance work.</p>
<p><strong>GPU dead-reckoning of position.</strong> Store per-instance velocity in custom data. Advance transforms from <code>TIME</code> in the vertex shader. The CPU only touches an instance on a sim snapshot (~every 0.4s), not every frame.</p>
<h3>Tier B — skinning core upgrades</h3>
<p><strong>Dual-quaternion skinning.</strong> 2 texels per bone instead of 4. Halves palette VRAM, halves per-vertex texel fetches, and fixes the "candy-wrapper" collapse on twisting joints that linear blend skinning has. A real engine-grade upgrade.</p>
<p><strong>Mat4x3 storage.</strong> The 4th column of a bone matrix is always <code>(0,0,0,1)</code> — dropping it saves 25% VRAM with zero quality loss. A quick win if dual-quat is too big a step.</p>
<p><strong>Reduced-bone far LOD.</strong> Drop fingers, tail, and face bones for the far tier — fewer fetches where detail isn't visible.</p>
<h3>Tier C — visibility & render passes</h3>
<p><strong>Frustum-cull integration.</strong> MultiMesh draws everything in <code>visible_instance_count</code> — wire per-instance frustum culling into the engine's visibility system.</p>
<p><strong>Shadow-pass LOD.</strong> The skinning shader runs again in the depth/shadow pass. Skip skinning or drop shadow casters beyond a distance — often a hidden ~2× vertex cost.</p>
<h3>Tier D — quality & pipeline</h3>
<p><strong>In-shader clip cross-fade.</strong> Blend two clip blocks per instance (second custom slot + blend factor) instead of hard state cuts — brings hero-rig smoothness to the whole crowd without a real skeleton.</p>
<p><strong>Threaded bake.</strong> Move the palette bake to a worker thread so first-encounter of a new animal type never hitches the main thread.</p>
<p>The recommended order: bulk upload (directly fixes the measured bottleneck, small, low-risk) → mat4x3 storage (immediate VRAM win) → GPU-driven cull + indirect draw (removes CPU from the loop entirely, unlocks tens of thousands) → dual-quaternion skinning (the skinning-quality leap). The first two are a day each and compounding; the latter two are the deep engine investments that make this a genuinely AAA crowd platform.</p>
<p>Pre-built editor binaries with <code>agent_skinned</code> and the GPU-driven palette baked in — no engine compile required. The game's <code>animal_perf_test.tscn</code> lets you spawn 10/100/1,000/10,000 animals and read live FPS:</p>
<p>All builds at <ahref="https://tinqs.com/tinqs/builds"style="color: var(--c-lime);"><code>tinqs/builds</code></a> — engine source is private, but the binaries are yours. See <ahref="https://tinqs.com/tinqs/builds/src/branch/main/manifest.json"style="color: var(--c-lime);"><code>manifest.json</code></a> for checksums and build details.</p>
<p><strong>Related:</strong><ahref="fork-dont-build"style="color: var(--c-lime);">Fork, Don't Build</a> — why we modify existing platforms instead of building new ones. <ahref="godot-optimisation"style="color: var(--c-lime);">Streaming a 12km Archipelago in Godot 4</a> — the terrain and vegetation streaming layers that work alongside this.</p>