How Pi Agents Build, Test, and Ship Code with Oracle-Backed Flows
When we ask Pi to build a feature for Ariki — say, "add a double-jump with a cooldown indicator" — five things happen. The agent writes the code. A build gate compiles it. A test gate runs the test suite. A behaviour gate drives the live game and checks the character actually double-jumps. A feel gate measures apex height, airtime, and landing settle. And if CI disagrees with any of it, the agent reads the failure log and fixes it. None of this is magic. It's Pi flows.
What Happens When You Ask Pi to Build Something
The flow starts the same way every agent task does: context, then plan, then implement. That's the standard loop. What makes it interesting is what happens after implementation — a ladder of five gates, each run by a specialised sub-agent with its own tools and its own pass/fail authority.
The Five Gates
Each gate is a sub-agent with one job and one tool.
G1 — Build. Runs dotnet build on the game and sim. Returns PASS/FAIL with file:line errors. If the code doesn't compile, nothing proceeds.
G2 — Tests. Runs dotnet test and parses results. The agent reads which tests broke and fixes assertions, mocks, or test setup.
G3 — Behaviour (live game). This is the one that makes game dev different from web dev. The agent sends input to the running game — {"jump": true} — and samples the player body 30 times at 50ms intervals. It checks: did the character actually jump? Did the double-jump fire? Was there a cooldown? The drive_game tool is the ground-truth oracle for whether a movement feature works in-game, not just in tests.
G4 — Feel (measured game-feel). Behaviour checks whether it worked. Feel checks whether it felt good. The agent measures apex height, airtime, liftoff latency, rise/fall asymmetry, and landing settle. Numeric metrics with thresholds. A jump that technically works but takes 400ms to lift off fails the feel gate.
G5 — Visual. Captures frame sequences from the live game and feeds them to a vision model. Checks: is the animation playing? Is the cooldown indicator visible? Are there visual artifacts?
Anything green falls through to the judge. Anything red loops back to implement with the failure evidence — the agent reads what went wrong, fixes it, and re-enters the gate ladder. Three retries max, then escalation to a human.
Composability: Gates Are Cheap to Add
The flow started with three gates — build, test, vision. Behaviour and feel were added later, each as a one-file extension. Gates are not hardcoded. They're sub-agents declared in a flow config. Want a linting gate? Add a sub-agent with a linter tool. Want a security scan? Same pattern. Want a gate that checks asset bundle sizes haven't bloated? Write the tool, declare the sub-agent, wire it into the flow.
Agents themselves can extend the flow. If a sub-agent notices a pattern of failures — "the last three behaviour checks failed because the game window wasn't focused" — it can insert a pre-condition gate that checks window focus before proceeding. The flow engine handles routing; the agents handle decisions.
This is what makes flows fundamentally different from a script: the pipeline is not fixed at compile time. It's a graph that agents read, understand, and modify at runtime.
The CI Loop: Agents That Fix Their Own Builds
Gates handle pre-push verification. But what about after push? What about CI?
Most coding agents don't care if the code compiles on the CI runner. They write, they push, they walk away. A human discovers the broken build an hour later.
We closed this loop with the tinqs-ci extension — three tools that give agents post-push autonomy:
- ci_status — checks pipeline state for a branch
- ci_logs — fetches the full build log from the most recent failed run
- ci_wait — polls every 15 seconds until the pipeline finishes
The agent pushes its branch, calls ci_wait, and if CI fails, reads ci_logs, fixes the issue, pushes again, and polls again. DeepSeek V4 parses compiler errors, identifies files and lines, and fixes them. A missing import, a type mismatch, a module not found — pattern-matched and corrected in seconds.
A real example from last week: adding a health check endpoint to a Go service. Agent wrote the handler and test, pushed. CI failed — the test imported a package that didn't exist on the runner. Agent read ci_logs, saw go: module not found, added the missing go.mod replace directive, pushed again. CI passed. PR opened. 4 minutes. $0.06.
Three safeguards prevent runaway loops: retry limit (3, hard-coded in the orchestrator), diff budget (retries only touch files already in the changeset), and hallucination detection (if the agent claims CI passed without calling ci_status, it gets corrected).
The Numbers
Over three weeks of running the orchestrator:
- 87 tasks completed end-to-end
- 23 tasks needed at least one CI retry (26%)
- 19 of those 23 resolved on the first retry
- 4 tasks hit the retry limit and escalated to a human
- 0 tasks produced a merged PR that later broke something else
The 26% retry rate matches what you'd see from a junior developer. The difference: the agent fixes it in 30 seconds.
The Architecture
| Layer | What | How |
|---|---|---|
| Flow engine | pi-flows orchestrator | Composes agents, gates and decision points |
| Oracle gates | verify_build, drive_game, game_frames | Return structured PASS/FAIL with evidence |
| Sub-agents | G1 build · G2 tests · G3 behaviour · G4 feel · G5 visual | Role-split, each with its own toolset |
| CI loop | tinqs-ci extension | ci_status, ci_logs, ci_wait — polls Gitea Actions, reads logs, retries |
| Decision | Agent-loop Reflexion | Self-reflect on failures, retry (≤3) or escalate |
| Visualization | FlowDashboard | Real-time pipeline state |
The oracle tools — verify_build, drive_game, game_frames — are the durable assets. About 300 lines of TypeScript each, MIT licensed, reusable in any Pi project. The flow engine composes them; the agents route through them.
A year ago we had a supervisor written in 1,050 lines of hardcoded TypeScript that did one thing: verify agent output compiled and passed tests. We deleted it. The same verification now runs as a composable flow with five gates, live-game testing, and CI integration. Sometimes the best architecture decision is knowing what to delete.
The flow-native brain runs on our Pi fork inside Tinqs Studio. The oracle extensions are MIT licensed and reusable in any Pi project.