<title>How Pi Agents Build, Test, and Ship Game Code with Oracle-Backed Flows — Tinqs Blog</title>
<metaname="description"content="We use Pi flows with oracle-backed gates to make agents compile, test, drive the live game, measure feel, fix CI failures, and ship green PRs — all autonomously.">
<metaproperty="og:title"content="How Pi Agents Build, Test, and Ship Game Code with Oracle-Backed Flows">
<metaproperty="og:description"content="Pi flows + oracle-backed gates: agents that compile, test, drive the game, measure feel, fix CI, and ship green PRs.">
<metaname="twitter:title"content="How Pi Agents Build, Test, and Ship Game Code with Oracle-Backed Flows">
<metaname="twitter:description"content="Pi flows + oracle-backed gates: agents that compile, test, drive the game, measure feel, fix CI, and ship green PRs.">
"description":"We use Pi flows with oracle-backed gates to make agents compile, test, drive the live game, measure feel, fix CI failures, and ship green PRs — all autonomously."
<h1class="post__title">How Pi Agents Build, Test, and Ship Code with Oracle-Backed Flows</h1>
<pclass="post__lead">When we ask Pi to build a feature for Ariki — say, "add a double-jump with a cooldown indicator" — five things happen. The agent writes the code. A build gate compiles it. A test gate runs the test suite. A behaviour gate drives the live game and checks the character actually double-jumps. A feel gate measures apex height, airtime, and landing settle. And if CI disagrees with any of it, the agent reads the failure log and fixes it. None of this is magic. It's Pi flows.</p>
<h2>What Happens When You Ask Pi to Build Something</h2>
<p>The flow starts the same way every agent task does: context, then plan, then implement. That's the standard loop. What makes it interesting is what happens <em>after</em> implementation — a ladder of five gates, each run by a specialised sub-agent with its own tools and its own pass/fail authority.</p>
<figcaptionstyle="color:#9aa7b4;font-size:0.85rem;margin-top:8px;">A real failure loops back to <em>implement</em> with gate evidence (bounded to three tries); anything green falls through to the judge.</figcaption>
<p>Each gate is a sub-agent with one job and one tool.</p>
<p><strongstyle="color:#f59e0b;">G1 — Build.</strong> Runs <code>dotnet build</code> on the game and sim. Returns PASS/FAIL with file:line errors. If the code doesn't compile, nothing proceeds.</p>
<p><strongstyle="color:#f59e0b;">G2 — Tests.</strong> Runs <code>dotnet test</code> and parses results. The agent reads which tests broke and fixes assertions, mocks, or test setup.</p>
<p><strongstyle="color:#f59e0b;">G3 — Behaviour (live game).</strong> This is the one that makes game dev different from web dev. The agent sends input to the running game — <code>{"jump": true}</code> — and samples the player body 30 times at 50ms intervals. It checks: did the character actually jump? Did the double-jump fire? Was there a cooldown? The <code>drive_game</code> tool is the ground-truth oracle for whether a movement feature works in-game, not just in tests.</p>
<p><strongstyle="color:#f59e0b;">G4 — Feel (measured game-feel).</strong> Behaviour checks whether it worked. Feel checks whether it felt good. The agent measures apex height, airtime, liftoff latency, rise/fall asymmetry, and landing settle. Numeric metrics with thresholds. A jump that technically works but takes 400ms to lift off fails the feel gate.</p>
<p><strongstyle="color:#f59e0b;">G5 — Visual.</strong> Captures frame sequences from the live game and feeds them to a vision model. Checks: is the animation playing? Is the cooldown indicator visible? Are there visual artifacts?</p>
<p>Anything green falls through to the judge. Anything red loops back to implement with the failure evidence — the agent reads what went wrong, fixes it, and re-enters the gate ladder. Three retries max, then escalation to a human.</p>
<h2>Composability: Gates Are Cheap to Add</h2>
<p>The flow started with three gates — build, test, vision. Behaviour and feel were added later, each as a one-file extension. Gates are not hardcoded. They're sub-agents declared in a flow config. Want a linting gate? Add a sub-agent with a linter tool. Want a security scan? Same pattern. Want a gate that checks asset bundle sizes haven't bloated? Write the tool, declare the sub-agent, wire it into the flow.</p>
<p>Agents themselves can extend the flow. If a sub-agent notices a pattern of failures — "the last three behaviour checks failed because the game window wasn't focused" — it can insert a pre-condition gate that checks window focus before proceeding. The flow engine handles routing; the agents handle decisions.</p>
<p>This is what makes flows fundamentally different from a script: the pipeline is not fixed at compile time. It's a graph that agents read, understand, and modify at runtime.</p>
<h2>The CI Loop: Agents That Fix Their Own Builds</h2>
<p>Gates handle pre-push verification. But what about after push? What about CI?</p>
<p>Most coding agents don't care if the code compiles on the CI runner. They write, they push, they walk away. A human discovers the broken build an hour later.</p>
<p>We closed this loop with the <code>tinqs-ci</code> extension — three tools that give agents post-push autonomy:</p>
<p>The agent pushes its branch, calls <code>ci_wait</code>, and if CI fails, reads <code>ci_logs</code>, fixes the issue, pushes again, and polls again. DeepSeek V4 parses compiler errors, identifies files and lines, and fixes them. A missing import, a type mismatch, a module not found — pattern-matched and corrected in seconds.</p>
<p>A real example from last week: adding a health check endpoint to a Go service. Agent wrote the handler and test, pushed. CI failed — the test imported a package that didn't exist on the runner. Agent read <code>ci_logs</code>, saw <code>go: module not found</code>, added the missing <code>go.mod</code> replace directive, pushed again. CI passed. PR opened. <strong>4 minutes. $0.06.</strong></p>
<p>Three safeguards prevent runaway loops: <strong>retry limit</strong> (3, hard-coded in the orchestrator), <strong>diff budget</strong> (retries only touch files already in the changeset), and <strong>hallucination detection</strong> (if the agent claims CI passed without calling <code>ci_status</code>, it gets corrected).</p>
<p>The oracle tools — <code>verify_build</code>, <code>drive_game</code>, <code>game_frames</code> — are the durable assets. About 300 lines of TypeScript each, MIT licensed, reusable in any Pi project. The flow engine composes them; the agents route through them.</p>
<p>A year ago we had a supervisor written in 1,050 lines of hardcoded TypeScript that did one thing: verify agent output compiled and passed tests. We deleted it. The same verification now runs as a composable flow with five gates, live-game testing, and CI integration. Sometimes the best architecture decision is knowing what to delete.</p>
<p><em>The flow-native brain runs on our <ahref="https://tinqs.com/tinqs/pi">Pi fork</a> inside <ahref="https://tinqs.com">Tinqs Studio</a>. The oracle extensions are MIT licensed and reusable in any Pi project.</em></p>