<title>How Pi Agents Build, Test, and Ship Game Code with Oracle-Backed Flows — Tinqs Blog</title>
<metaname="description"content="We use Pi flows with oracle-backed gates to make agents compile, test, drive the live game, measure feel, fix CI failures, and ship green PRs — all autonomously.">
<metaproperty="og:title"content="How Pi Agents Build, Test, and Ship Game Code with Oracle-Backed Flows">
<metaproperty="og:description"content="Pi flows + oracle-backed gates: agents that compile, test, drive the game, measure feel, fix CI, and ship green PRs.">
<metaname="twitter:title"content="How Pi Agents Build, Test, and Ship Game Code with Oracle-Backed Flows">
<metaname="twitter:description"content="Pi flows + oracle-backed gates: agents that compile, test, drive the game, measure feel, fix CI, and ship green PRs.">
"description":"We use Pi flows with oracle-backed gates to make agents compile, test, drive the live game, measure feel, fix CI failures, and ship green PRs — all autonomously."
<pclass="post__lead">Think of a restaurant kitchen during dinner rush. The head chef doesn't cook every dish. She runs the pass — each plate gets inspected before it leaves. One cook handles sauces, another pastry, another the grill. The expediter calls orders, coordinates timing, makes sure table 4's mains don't arrive before table 2's starters. A dish comes back? It goes straight to the station that messed up, with a ticket explaining exactly what's wrong. That kitchen runs on flows. So does our game engine.</p>
<p><strong>The kitchen</strong> = Pi (the agent harness). <strong>The recipe</strong> = a JavaScript flow (<code>.flow.mjs</code>). <strong>The line cooks</strong> = agents (each with a station and tools). <strong>The pass</strong> = the flow engine (routes finished work). <strong>The head chef's inspection</strong> = the five gates. <strong>The order ticket</strong> = a spawn task or <code>tinqs flow run</code>. <strong>"Send it back!"</strong> = the fix loop.</p>
<p>You run <code>tinqs flow run game-feature --task 'add a double-jump with cooldown'</code> or click Run Flow on the dashboard. The ticket hits the kitchen. What follows is not one agent doing everything — it's a brigade running their stations.</p>
<figcaptionstyle="color:#9aa7b4;font-size:0.85rem;margin-top:8px;">A real failure loops back to <em>implement</em> with gate evidence (bounded to three tries); anything green falls through to the judge.</figcaption>
<h2>The Five Gates: What the Head Chef Checks</h2>
<p>In a kitchen, the head chef doesn't trust — she verifies. Every plate hits the pass and gets inspected. Our flows have the same instinct. Each gate is a sub-agent with one job, one tool, and absolute veto power.</p>
<spanclass="kitchen-col__title kitchen-col__title--kitchen">In the Kitchen</span>
<p><strong>Taste the sauce.</strong> Seasoning right? Acid balanced? The dish might look perfect but taste flat.</p>
</div>
<divclass="kitchen-col">
<spanclass="kitchen-col__title kitchen-col__title--reality">In the Flow</span>
<p><spanclass="gate gate--test">G2 · Tests</span> Runs <code>dotnet test</code>. Parses which assertions broke. Fixed code that passes build but fails logic gets caught here.</p>
</div>
<divclass="kitchen-col">
<spanclass="kitchen-col__title kitchen-col__title--kitchen">In the Kitchen</span>
<p><strong>Does it work?</strong> Pick it up. Does the sauce hold? Does the plating survive the walk to table 6?</p>
</div>
<divclass="kitchen-col">
<spanclass="kitchen-col__title kitchen-col__title--reality">In the Flow</span>
<p><spanclass="gate gate--behave">G3 · Behaviour</span> Sends <code>{"jump":true}</code> to the LIVE game. Samples the player body 30 times at 50ms. Did the character actually jump? Double-jump fire? This is the ground-truth oracle — what makes game dev fundamentally different from web dev.</p>
</div>
<divclass="kitchen-col">
<spanclass="kitchen-col__title kitchen-col__title--kitchen">In the Kitchen</span>
<p><strong>How does it feel?</strong> The steak is cooked but chewy. The sauce is seasoned but gloopy. Edible ≠ good.</p>
</div>
<divclass="kitchen-col">
<spanclass="kitchen-col__title kitchen-col__title--reality">In the Flow</span>
<p><spanclass="gate gate--feel">G4 · Feel</span> Measures apex height, airtime, liftoff latency, rise/fall asymmetry, landing settle. Numeric thresholds. A jump that works but takes 400ms to lift off fails. Behaviour says it happened. Feel says it felt good.</p>
</div>
<divclass="kitchen-col">
<spanclass="kitchen-col__title kitchen-col__title--kitchen">In the Kitchen</span>
<p><strong>How does it look?</strong> Is the garnish wilting? Sauce smeared? Does it match the menu photo?</p>
</div>
<divclass="kitchen-col">
<spanclass="kitchen-col__title kitchen-col__title--reality">In the Flow</span>
<p>Any red gate → evidence sent back to the cook → fix → re-enter the inspection line. Three chances max, then the head chef escalates to a human. This is the same instinct that makes a good kitchen work: catch it early, send it back with a clear note, give them a chance to fix it, but don't let the same dish circle the pass forever.</p>
<p>A kitchen doesn't redesign the whole line when they add a new dish. They add a station. Same in flows. Started with three gates — build, test, vision. Behaviour and feel came later, each a single-file extension. Gates aren't hardcoded. They're sub-agents called from JavaScript flows. Want a linting gate? Add an <code>agent()</code> call with a linter. Security scan? Same pattern. Asset bundle size check? Write the tool, declare the agent, wire it in.</p>
<p>Agents can extend the flow at runtime. If the behaviour gate keeps failing because the game window isn't focused, an agent notices the pattern and inserts a pre-condition gate that checks window focus. The flow engine handles routing; the agents handle decisions. This is what makes flows fundamentally different from a script — the pipeline isn't fixed at compile time. It's a graph that agents read, understand, and modify while they run.</p>
<h2>The CI Loop: The Dish That Came Back After It Left</h2>
<p>Gates inspect plates at the pass. But what about after the plate leaves the kitchen? What about the customer who finds a hair in their soup after it's been served?</p>
<p>Most coding agents don't care. They write code, push, walk away. A human discovers the broken CI build an hour later. That's the equivalent of a cook plating a dish, sending it out, and never checking if the diner is still alive.</p>
<p>The agent pushes, calls <code>ci_wait</code>. If CI fails, it reads <code>ci_logs</code>, fixes the exact error, pushes again. DeepSeek V4 parses compiler errors the way a cook reads a ticket: "missing import" = forgot the salt, "type mismatch" = wrong pan size, "module not found" = ingredient not in stock. Pattern-matched and fixed in seconds.</p>
<p>Adding a health check endpoint to a Go service. Agent wrote the handler and test, pushed. CI failed — the test imported a package that didn't exist on the runner. Agent read <code>ci_logs</code>, saw <code>go: module not found</code>, added the missing <code>go.mod</code> replace directive, pushed again. CI passed. PR opened. <strong>4 minutes. $0.06.</strong></p>
<p>Three safeguards prevent the kitchen grinding to a halt: <strong>retry limit</strong> (3, same dish doesn't circle forever), <strong>diff budget</strong> (retries only touch files already on the ticket), and <strong>hallucination detection</strong> (if the cook claims the customer loved it without actually asking the waiter, they get corrected).</p>
<p>This morning, I ran three flows. Each is a different kitchen, a different brigade, a different dish. Here's what actually happened — real flow logs, real verdicts, nothing staged.</p>
<p><strong>deep-implement</strong> — "Build the tinqs-gitea-read extension: list_org_repos, read_repo_file, list_repo_dir, search_repos." Nine steps, 14 minutes. Verdict: <spanclass="gate gate--test">PASS</span>. 31/31 vitest tests green, zero new TypeScript errors, session-level caching, path traversal protection. Every <code>execute()</code> body fully wired — no stubs, no placeholders. Like a saucier who doesn't just list ingredients but actually makes the sauce.</p>
<p><strong>game-feature</strong> — "Make the player jump." Build: <spanclass="gate gate--build">PASS</span>. Tests: <spanclass="gate gate--test">PASS</span>. Behaviour/Feel/Visual: <spanstyle="color:#f59e0b;">NOT RUN</span> — no live game instance was reachable. The flow didn't silently skip the visual gate. It <strong>hard-stopped</strong> and reported honestly: "FAIL — the feature has not been verified in-game." This is the kitchen saying: "The dish is cooked, but nobody tasted it. I'm not sending it out."</p>
<p><strong>cto-infra</strong> — "Synthesize cost, stability, and VCS research into an AWS architecture decision." Four research streams fed into one CTO agent. Output: 14 requirements mapped to specific decisions, cost-vs-stability tradeoffs resolved with dollar figures, EC2+EBS over Fargate+EFS, RDS Multi-AZ mandatory, S3+CloudFront for LFS. Like an executive chef reading four menu proposals, reconciling them into one service, and pricing every plate.</p>
<h2>Dinner Rush Recovery: The Crash That Interrupted Service</h2>
<p>Earlier today, a machine crash cut off a flow mid-stream — the kitchen lost power during dinner rush. Nineteen tests were left red. Contracts written, implementation half-done. Half-cooked dishes on every station.</p>
<p>Here's how the brigade actually worked. The <strong>vision-preflight</strong> agent — the chef who checks the gas is on before anyone starts cooking — verified <code>GEMINI_API_KEY</code> was set and <code>game_frames</code> could reach the live game. Both green in under a second. Without this, the whole kitchen would prep for an hour only to discover the oven doesn't work.</p>
<p>The <strong>project-context-reader</strong> — the commis who reads the entire recipe book — ingested <code>PlayerController.cs</code>, <code>PlayerAnimController.cs</code>, <code>PlayerAnimationLogic.cs</code>, the test files, the manifest. The <strong>feature-planner</strong> — the sous-chef who breaks down the order into station tasks — decomposed 19 failures into four fix groups: vegetation manifest (146 broken <code>prefabPath</code> items), animation controller (crouch parameter not plumbed), jump physics (coyote time, variable height, air control — all missing), and animation tree (entire state machine absent).</p>
<p>Then the <strong>game-builder</strong> — the line cook at the hot station — read each test failure like a dish ticket, traced it to the source, and started cooking. Coyote time: 100ms grace period after feet leave the ground. Variable jump height: velocity scaled by hold duration, tap gives 3.5, full hold gives 6.5. Air control: horizontal speed cut 40% while airborne. Jump phases: minimum 0.15s on jump_start before transitioning up. Landing timer: wait the full animation length, not length-minus-blend. Animation tree: <code>jump_start → jump → jump_land</code> states with 0.1s blends.</p>
<p>Then the inspection line: <strong>build-verifier</strong> compiled. <strong>Test-runner</strong> ran the suite. <strong>Behavioral-prober</strong> sent <code>{"jump":true}</code> to the live game and sampled the player body. <strong>Feel-judge</strong> measured apex, airtime, liftoff latency. <strong>Animation-vision-judge</strong> captured 8 frames, gridded them, had <code>gemini-2.5-flash</code> scan for T-poses and foot-slide.</p>
<p>Anything red → ticket back to the cook with the specific failure → fix → re-enter the line. Bounded to 3 returns. Anything green → falls through. All green → <strong>game-judge</strong> gives the final verdict.</p>
<p>This flow is a file at <code>.pi/flows/flows/game-feature.flow.mjs</code>. I trigger it by running <code>tinqs flow run game-feature</code> or clicking Run Flow on the dashboard. It dispatches agents, runs gates, loops on failures, reports a verdict. The dashboard at <code>:33634</code> is the control plane — spawn, steer mid-run, inspect state. That's the whole product.</p>
<p>Every flow lives in <code>.pi/flows/flows/*.flow.mjs</code> and is spawnable by name. You run <code>tinqs flow run <name> [task]</code> or click Run Flow on the dashboard.</p>
<p>"Add wall-running" becomes the task argument. The flow reads it, wires it through the agents, routes it through the gates. The JavaScript is the recipe. The conversation provides the context.</p>
<li><strong>game-feature</strong> — "add a double-jump" or "fix the 19 red tests" → brigade assembles, cooks, inspects, plates</li>
<li><strong>deep-implement</strong> — "build the gitea-read extension" → research → plan → implement → test → review → judge</li>
<li><strong>cto-infra</strong> — "reconcile cost, stability, and VCS research into architecture decisions" → 4 research streams → 1 synthesis agent → 14 requirements mapped to decisions</li>
<li><strong>flows:new</strong> — "I need a flow that..." → the Flow Architect reads the agent catalog, selects cooks, designs the recipe, writes the <code>.flow.mjs</code></li>
<p>In a real kitchen, cooks don't shout instructions across the room. They place finished plates on the pass. The expediter reads the ticket, checks the plate, routes it to the next station or to the dining room. Nobody yells. Nobody grabs someone else's pan.</p>
<p>Flows work the same way. Agents never talk to each other directly. When the game-builder finishes, it returns a result object — placing its work on the pass. The flow engine — the expediter — records it and routes it. The next agent receives the return value directly from <code>await flow.agent("game-builder")</code>.</p>
<spanclass="kitchen-col__title kitchen-col__title--kitchen">What People Expect</span>
<p>Agents chatting freely, PM-slack style: "Hey test-runner, I just pushed some code, can you check it? Also the jump feels off, maybe tune the velocity?"</p>
<p>Agent A returns <code>{ verdict: "pass", findings: ["coyote_time=100ms"] }</code> → flow engine records it → Agent B receives the result as a direct return value of <code>await flow.agent("A")</code>. No chatter. Structured handoff.</p>
<p>Why? Because unstructured chatter is how hallucination cascades start. Agent A confidently states something wrong. Agent B builds on it. Agent C compounds it. Three agents later, they're collectively wrong about a file that doesn't exist, and nobody can trace where the error came from. The pass — structured result-passing via typed return values from each <code>agent()</code> call — makes every handoff auditable, verifiable, and debuggable.</p>
<p>Pi itself is built for solo interactive work: you ask, it does, you review. The orchestration layer I wrote on top inverts that. Pi becomes the kitchen. The flow engine becomes the expediter. Agents become line cooks who place plates on the pass, never shouting across the room.</p>
<h2>The Setup: Extensions, Agents, and 15–20 Flows</h2>
<p>"How did you set this up?" is the question I get most often. Here's the honest answer: there's no dashboard with drag-and-drop. You write three kinds of files.</p>
<p><strongstyle="color:#f59e0b;">Extensions</strong> are TypeScript tools that agents call. Each is about 300 lines, MIT licensed:</p>
<thstyle="padding:8px 12px;color:#c9935a;">What agents call it for</th>
</tr>
</thead>
<tbody>
<trstyle="border-bottom:1px solid #1c2230;"><tdstyle="padding:7px 12px;color:#e6edf3;"><code>verify_build</code></td><tdstyle="padding:7px 12px;color:#cdd7e2;">Compile the game + sim, return file:line errors</td></tr>
<trstyle="border-bottom:1px solid #1c2230;"><tdstyle="padding:7px 12px;color:#e6edf3;"><code>drive_game</code></td><tdstyle="padding:7px 12px;color:#cdd7e2;">Send input to the live game, sample player body</td></tr>
<trstyle="border-bottom:1px solid #1c2230;"><tdstyle="padding:7px 12px;color:#e6edf3;"><code>game_frames</code></td><tdstyle="padding:7px 12px;color:#cdd7e2;">Capture screenshot sequences for vision judging</td></tr>
<trstyle="border-bottom:1px solid #1c2230;"><tdstyle="padding:7px 12px;color:#e6edf3;"><code>ci_status</code></td><tdstyle="padding:7px 12px;color:#cdd7e2;">Check Gitea Actions pipeline state for a branch</td></tr>
<trstyle="border-bottom:1px solid #1c2230;"><tdstyle="padding:7px 12px;color:#e6edf3;"><code>ci_logs</code></td><tdstyle="padding:7px 12px;color:#cdd7e2;">Fetch full build log from the most recent failed run</td></tr>
<trstyle="border-bottom:1px solid #1c2230;"><tdstyle="padding:7px 12px;color:#e6edf3;"><code>ci_wait</code></td><tdstyle="padding:7px 12px;color:#cdd7e2;">Poll every 15 seconds until the pipeline finishes</td></tr>
<trstyle="border-bottom:1px solid #1c2230;"><tdstyle="padding:7px 12px;color:#e6edf3;"><code>gen_image</code></td><tdstyle="padding:7px 12px;color:#cdd7e2;">Generate brand/marketing images via fal.ai flux-2-pro</td></tr>
<tr><tdstyle="padding:7px 12px;color:#e6edf3;"><code>agent_catalog</code></td><tdstyle="padding:7px 12px;color:#cdd7e2;">List available agents with their tools, inputs, outputs</td></tr>
</tbody>
</table>
<p><strongstyle="color:#f59e0b;">Agents</strong> are Markdown files with YAML frontmatter. Each declares its role, model tier, tools, inputs, and outputs:</p>
<pre><code>---
name: game-builder
description: Implements game features in C# (Godot)
<p><strongstyle="color:#f59e0b;">Flows</strong> are JavaScript modules (<code>.flow.mjs</code>) that coordinate agents with real control flow. I have about <strong>15–20 flows</strong> running across different domains:</p>
<p>The setup is not a product you install. It's a stack: Pi as the agent harness, custom extensions as the tool layer, markdown agents as the role layer, JavaScript flows as the orchestration layer. The whole thing lives in <code>.pi/flows/</code>. Version-controlled. CI-tested. Spawned via <code>tinqs flow run</code> or the dashboard.</p>
<p>"Do you define the process with these trees, or do the agents freestyle?" Both. The recipe says what to make and in what order. The technique is how each cook executes their station.</p>
<p>The flow's JavaScript is the recipe. It says: first the prep cook dices onions, then the saucier makes the base, then the grill cook sears the protein. After every station, the plate hits the pass for inspection. <strong>This order is not negotiable.</strong> A cook cannot skip the inspection because they feel confident. The inspection runs. Period.</p>
<p>Inside their station, a cook has full agency. How they dice the onions — brunoise or rough chop — is their call. Which pan they use, how they adjust the heat, whether they taste midway. The game-builder decides which files to read, which approach to take. Nobody tells it "edit line 247." It figures that out with <code>grep</code>, <code>find</code>, and reading code.</p>
<p>This balance is everything. Too much recipe → agents can't handle surprises. Too much freestyle → agents hallucinate, skip checks, ship broken code. The recipe guarantees the right things happen in the right order — preflight before build, build before test, test before ship. The technique handles the messy, unpredictable reality of actual code.</p>
<p>And when a recipe is wrong? Another flow improves it. A meta-flow reads performance data, spots bottlenecks — "the feel gate keeps failing because the cook doesn't know the jump velocity threshold" — edits the <code>.flow.mjs</code> to pass that threshold into the builder's inputs, and commits the change. <strong>Flows that edit flows.</strong> The kitchen that renovates itself between services.</p>
<p>You don't use a paring knife to butcher a cow. You don't use a cleaver to supreme an orange. Different work needs different blades. Flows use <strong>role-based model tiers</strong> — each agent declares the blade it needs, and the engine hands it the right one at dispatch time.</p>
<p>Two reasons. <strong>It's free</strong> — no usage limits, which matters when your game-builder reads 800-line files and writes 200-line diffs ten times a session. <strong>It's genuinely good at C# and Godot</strong> — I've had it write a full lighting module for our Godot fork by reading Unity API docs and adapting patterns. No agent had pulled that off before. DeepSeek can't do multimodal, so vision goes to Gemini — but for everything else, it's the chef's knife you reach for 90% of the time.</p>
<p>The point of the knife rack: you configure this <strong>once</strong>. Every agent declares <code>model: @coding</code> and gets DeepSeek V4 automatically. Swap models globally without touching any flow or agent file. The right blade, every time, no thinking required.</p>
<p>The oracle tools — <code>verify_build</code>, <code>drive_game</code>, <code>game_frames</code> — are the durable assets. About 300 lines of TypeScript each, MIT licensed, reusable in any Pi project. The flow engine composes them; the agents route through them.</p>
<p>A year ago we had a supervisor written in 1,050 lines of hardcoded TypeScript that did one thing: verify agent output compiled and passed tests. We deleted it. The same verification now runs as a composable flow with five gates, live-game testing, and CI integration. Sometimes the best architecture decision is knowing what to delete.</p>
<p><em>The flow-native brain runs on our <ahref="https://tinqs.com/tinqs/pi">Pi fork</a> inside <ahref="https://tinqs.com">Tinqs Studio</a>. The oracle extensions are MIT licensed and reusable in any Pi project.</em></p>