How Pi Agents Build, Test, and Ship Code with Oracle-Backed Flows
-When we ask Pi to build a feature for Ariki — say, "add a double-jump with a cooldown indicator" — five things happen. The agent writes the code. A build gate compiles it. A test gate runs the test suite. A behaviour gate drives the live game and checks the character actually double-jumps. A feel gate measures apex height, airtime, and landing settle. And if CI disagrees with any of it, the agent reads the failure log and fixes it. None of this is magic. It's Pi flows.
+Think of a restaurant kitchen during dinner rush. The head chef doesn't cook every dish. She runs the pass — each plate gets inspected before it leaves. One cook handles sauces, another pastry, another the grill. The expediter calls orders, coordinates timing, makes sure table 4's mains don't arrive before table 2's starters. A dish comes back? It goes straight to the station that messed up, with a ticket explaining exactly what's wrong. That kitchen runs on flows. So does our game engine.
What Happens When You Ask Pi to Build Something
-The flow starts the same way every agent task does: context, then plan, then implement. That's the standard loop. What makes it interesting is what happens after implementation — a ladder of five gates, each run by a specialised sub-agent with its own tools and its own pass/fail authority.
+The kitchen = Pi (the agent harness). The recipe = a flow YAML (the DAG). The line cooks = agents (each with a station and tools). The pass = the flow engine (routes finished work). The head chef's inspection = the five gates. The order ticket = a slash command. "Send it back!" = the fix loop.
+What Happens When You Type a Slash Command
+You type /game-feature add a double-jump with cooldown and hit enter. The ticket hits the kitchen. What follows is not one agent doing everything — it's a brigade running their stations.
The Five Gates
-Each gate is a sub-agent with one job and one tool.
+The Five Gates: What the Head Chef Checks
+In a kitchen, the head chef doesn't trust — she verifies. Every plate hits the pass and gets inspected. Our flows have the same instinct. Each gate is a sub-agent with one job, one tool, and absolute veto power.
-G1 — Build. Runs dotnet build on the game and sim. Returns PASS/FAIL with file:line errors. If the code doesn't compile, nothing proceeds.
Check the base. Is the protein cooked through? If the chicken is raw, the whole plate stops here. Nothing else matters.
+G1 · Build Runs dotnet build. PASS/FAIL with file:line errors. Won't compile? Nothing proceeds.
Taste the sauce. Seasoning right? Acid balanced? The dish might look perfect but taste flat.
+G2 · Tests Runs dotnet test. Parses which assertions broke. Fixed code that passes build but fails logic gets caught here.
Does it work? Pick it up. Does the sauce hold? Does the plating survive the walk to table 6?
+G3 · Behaviour Sends {"jump":true} to the LIVE game. Samples the player body 30 times at 50ms. Did the character actually jump? Double-jump fire? This is the ground-truth oracle — what makes game dev fundamentally different from web dev.
How does it feel? The steak is cooked but chewy. The sauce is seasoned but gloopy. Edible ≠ good.
+G4 · Feel Measures apex height, airtime, liftoff latency, rise/fall asymmetry, landing settle. Numeric thresholds. A jump that works but takes 400ms to lift off fails. Behaviour says it happened. Feel says it felt good.
+How does it look? Is the garnish wilting? Sauce smeared? Does it match the menu photo?
+G5 · Visual Captures 8 frames at 100ms intervals, grids them, feeds to gemini-2.5-flash. Checks: T-pose? Foot-slide? Frozen animation? Wrong clip? Missing transitions?
G2 — Tests. Runs dotnet test and parses results. The agent reads which tests broke and fixes assertions, mocks, or test setup.
Any red gate → evidence sent back to the cook → fix → re-enter the inspection line. Three chances max, then the head chef escalates to a human. This is the same instinct that makes a good kitchen work: catch it early, send it back with a clear note, give them a chance to fix it, but don't let the same dish circle the pass forever.
+G3 — Behaviour (live game). This is the one that makes game dev different from web dev. The agent sends input to the running game — {"jump": true} — and samples the player body 30 times at 50ms intervals. It checks: did the character actually jump? Did the double-jump fire? Was there a cooldown? The drive_game tool is the ground-truth oracle for whether a movement feature works in-game, not just in tests.
Composability: Adding a New Station
+A kitchen doesn't redesign the whole line when they add a new dish. They add a station. Same in flows. Started with three gates — build, test, vision. Behaviour and feel came later, each a single-file extension. Gates aren't hardcoded. They're sub-agents declared in YAML. Want a linting gate? Add a sub-agent with a linter. Security scan? Same pattern. Asset bundle size check? Write the tool, declare the agent, wire it in.
-G4 — Feel (measured game-feel). Behaviour checks whether it worked. Feel checks whether it felt good. The agent measures apex height, airtime, liftoff latency, rise/fall asymmetry, and landing settle. Numeric metrics with thresholds. A jump that technically works but takes 400ms to lift off fails the feel gate.
+Agents can extend the flow at runtime. If the behaviour gate keeps failing because the game window isn't focused, an agent notices the pattern and inserts a pre-condition gate that checks window focus. The flow engine handles routing; the agents handle decisions. This is what makes flows fundamentally different from a script — the pipeline isn't fixed at compile time. It's a graph that agents read, understand, and modify while they run.
+G5 — Visual. Captures frame sequences from the live game and feeds them to a vision model. Checks: is the animation playing? Is the cooldown indicator visible? Are there visual artifacts?
+-
Anything green falls through to the judge. Anything red loops back to implement with the failure evidence — the agent reads what went wrong, fixes it, and re-enters the gate ladder. Three retries max, then escalation to a human.
+The CI Loop: The Dish That Came Back After It Left
+Gates inspect plates at the pass. But what about after the plate leaves the kitchen? What about the customer who finds a hair in their soup after it's been served?
-Composability: Gates Are Cheap to Add
-The flow started with three gates — build, test, vision. Behaviour and feel were added later, each as a one-file extension. Gates are not hardcoded. They're sub-agents declared in a flow config. Want a linting gate? Add a sub-agent with a linter tool. Want a security scan? Same pattern. Want a gate that checks asset bundle sizes haven't bloated? Write the tool, declare the sub-agent, wire it into the flow.
+Most coding agents don't care. They write code, push, walk away. A human discovers the broken CI build an hour later. That's the equivalent of a cook plating a dish, sending it out, and never checking if the diner is still alive.
-Agents themselves can extend the flow. If a sub-agent notices a pattern of failures — "the last three behaviour checks failed because the game window wasn't focused" — it can insert a pre-condition gate that checks window focus before proceeding. The flow engine handles routing; the agents handle decisions.
- -This is what makes flows fundamentally different from a script: the pipeline is not fixed at compile time. It's a graph that agents read, understand, and modify at runtime.
- -The CI Loop: Agents That Fix Their Own Builds
-Gates handle pre-push verification. But what about after push? What about CI?
- -Most coding agents don't care if the code compiles on the CI runner. They write, they push, they walk away. A human discovers the broken build an hour later.
- -We closed this loop with the tinqs-ci extension — three tools that give agents post-push autonomy:
We closed this loop with three tools — the waiter who brings the plate back:
-
-
- ci_status — checks pipeline state for a branch -
- ci_logs — fetches the full build log from the most recent failed run -
- ci_wait — polls every 15 seconds until the pipeline finishes +
- ci_wait — stands by the table, polls every 15 seconds until the diner finishes +
- ci_status — checks: did they enjoy it or send it back? +
- ci_logs — reads the complaint card: exactly what was wrong
The agent pushes its branch, calls ci_wait, and if CI fails, reads ci_logs, fixes the issue, pushes again, and polls again. DeepSeek V4 parses compiler errors, identifies files and lines, and fixes them. A missing import, a type mismatch, a module not found — pattern-matched and corrected in seconds.
The agent pushes, calls ci_wait. If CI fails, it reads ci_logs, fixes the exact error, pushes again. DeepSeek V4 parses compiler errors the way a cook reads a ticket: "missing import" = forgot the salt, "type mismatch" = wrong pan size, "module not found" = ingredient not in stock. Pattern-matched and fixed in seconds.
A real example from last week: adding a health check endpoint to a Go service. Agent wrote the handler and test, pushed. CI failed — the test imported a package that didn't exist on the runner. Agent read ci_logs, saw go: module not found, added the missing go.mod replace directive, pushed again. CI passed. PR opened. 4 minutes. $0.06.
Adding a health check endpoint to a Go service. Agent wrote the handler and test, pushed. CI failed — the test imported a package that didn't exist on the runner. Agent read ci_logs, saw go: module not found, added the missing go.mod replace directive, pushed again. CI passed. PR opened. 4 minutes. $0.06.
Three safeguards prevent runaway loops: retry limit (3, hard-coded in the orchestrator), diff budget (retries only touch files already in the changeset), and hallucination detection (if the agent claims CI passed without calling ci_status, it gets corrected).
Three safeguards prevent the kitchen grinding to a halt: retry limit (3, same dish doesn't circle forever), diff budget (retries only touch files already on the ticket), and hallucination detection (if the cook claims the customer loved it without actually asking the waiter, they get corrected).
The Numbers
Over three weeks of running the orchestrator:
@@ -316,6 +448,196 @@+
A Real Flow in Action: Fixing 19 Tests After a Crash
+This morning, a machine crash cut off a flow mid-stream. Nineteen tests were left red — contracts written, implementation half-done. The task: finish the interrupted jump & locomotion animation work and make them all green.
+ +I typed one slash command into Pi:
+ +/game-feature Finish the leftover jump & locomotion animation work — make the 19 FAILING tests GREEN. They are existing RED contracts written by an earlier animation flow that a machine crash cut off mid-stream; the contracts are already written, so IMPLEMENT to satisfy them (do not rewrite the contracts).
+
+What happened next was fully autonomous. Here's the flow, verbatim — this is the exact YAML that runs in production:
+ +name: game-feature
+description: Build a PLAYABLE game feature and prove it in the LIVE game.
+task_required: true
+
+steps:
+ # G0: Pre-flight — validate vision CAN run before any build work
+ - id: preflight
+ agent: vision-preflight
+ task: Check GEMINI_API_KEY is set AND game_frames reaches a live instance.
+ If EITHER fails, STOP — vision is not optional.
+
+ # Context + plan
+ - id: context
+ agent: project-context-reader
+ blockedBy: [preflight]
+
+ - id: plan
+ agent: feature-planner
+ blockedBy: [context]
+
+ # TDD: write tests FIRST (different agent than implementer)
+ - id: test-author
+ agent: test-author
+ blockedBy: [plan]
+
+ - id: implement
+ agent: game-builder
+ blockedBy: [test-author]
+
+ # G1–G5: Oracle gates (build, tests, behaviour, feel, visual)
+ - id: build → agent: build-verifier
+ - id: tests → agent: test-runner
+ - id: behavior → agent: behavioral-prober (drives LIVE game via drive_game)
+ - id: feel → agent: feel-judge (apex, airtime, latency, rise/fall)
+ - id: visual → agent: animation-vision-judge (multimodal gemini-2.5-flash)
+
+ # Self-recurring fix-loop: bounded loop back to implement with evidence
+ - id: fix-loop
+ type: agent-loop-decision
+ agent: flow-decision
+ loop_target: implement
+ exit_target: report
+ max_iterations: 3
+
+ # Final judge: one honest verdict
+ - id: report
+ agent: game-judge
+
+Eighteen steps, seven custom agents, five oracle gates, and one judge. The whole thing runs as a slash command.
+ +Here's what actually happened. The vision-preflight agent fired first — checked that GEMINI_API_KEY was set and that game_frames could reach the live game instance. Both passed in under a second. Without this gate, the rest of the flow would be meaningless — we'd do all the build work only to discover the vision judge can't run. So we check first.
The project-context-reader ingested PlayerController.cs, PlayerAnimController.cs, PlayerAnimationLogic.cs, the test files, and the manifest. The feature-planner decomposed the 19 failures into four fix groups: (1) vegetation manifest — 146 items with broken prefabPath, (2) animation controller — crouch parameter not plumbed through, (3) jump physics — coyote time, variable height, air control all unimplemented, (4) animation tree — state machine missing entirely.
Then the game-builder agent went to work. It read the test failure messages, traced each one to the source, and started implementing. Coyote time: a 100ms grace period after IsOnFloor() becomes false. Variable jump height: scale velocity by key hold duration, 3.5 at tap, 6.5 at 300ms hold. Air control: reduce horizontal velocity by 40% when airborne. Jump phases: minimum 0.15s duration on jump_start before transitioning to airborne. Landing timer: wait full jump_land length + one frame, not length - blend. Animation tree: state machine with jump_start → jump → jump_land states, 0.1s blend transitions.
The build-verifier compiled it. Test-runner ran the suite. Behavioral-prober sent {"jump": true} to the live game and sampled the player body 30 times. Feel-judge measured apex height, airtime, and liftoff latency against thresholds. Animation-vision-judge grabbed 8 frames at 100ms intervals, composed them into a grid, and had gemini-2.5-flash check for T-poses, foot-slide, frozen frames, and missing transitions.
Any red gate → evidence fed back to the game-builder → fix → re-enter the gate ladder. Bounded to 3 retries per the max_iterations in the loop decision. Any green gate → falls through to the next. All green → the game-judge produces the final honest verdict.
This isn't a demo. It's running right now, as I write this, in a Pi session on my machine. The flow is a file at .pi/flows/flows/game-feature.yaml. I trigger it with a slash command. It dispatches sub-agents, runs them through oracle gates, loops on failures, and reports a verdict. That's it.
The Flow-as-Command Pattern
+Every flow registers as a slash command. .pi/flows/flows/game-feature.yaml becomes /game-feature. Type it in Pi, describe what you want, hit enter. The flow architect dispatches the DAG, the dashboard shows agent cards with live status, and you watch it happen — or walk away and check the result later.
This is the pattern that makes flows different from scripts. Flows are not hardcoded pipelines you invoke from the terminal. They're slash commands you type in conversation. You describe what you want in natural language, the flow wires it through the agents, and the agents route through the gates. The YAML is the skeleton; the conversation is the context.
+ +A few flows I use daily:
+ +-
+
- /game-feature — "add wall-running" or "fix the 19 red tests from the crash" → research, plan, implement, five gates, judge +
- /review — "review the last PR" → research → review with code-quality agent +
- /flows:new — "I need a flow that..." → the Flow Architect reads the agent catalog, selects agents, designs a DAG, and writes the YAML +
The slash command is the interface. The flow is the implementation. The oracle gates are the safety net.
+ +How Agents Communicate (It's Not Chat)
+A common question: are the agents constantly talking to each other? The answer is no — and that's deliberate. Agents don't chat. They pass structured results through the flow engine bus.
+ +Each agent runs in an isolated session with scoped tools and file access. When agent A finishes, it calls finish({ summary: "...", artifacts: "...", files: "..." }). The flow engine records the result. Agent B receives exactly what it needs via template variables — ${{result.A.summary}}, ${{result.A.artifacts}}, ${{result.A.files}} — wired through the inputs: block in the flow YAML.
This is not agent-to-agent chatter. It's a publish/subscribe bus where the flow engine is the broker. Agents never directly invoke each other. They never read each other's raw output unless the flow explicitly wires it. The DAG's blockedBy edges define who waits for whom; the inputs: block defines what data flows across the edge.
Why not let agents talk freely? Because unstructured chatter is the fastest path to hallucination cascades. Agent A confidently states something wrong, agent B builds on it, agent C compounds it. By the time a human notices, you have three agents collectively wrong about a file that doesn't exist. Structured result-passing with typed outputs (verdict: pass, findings: ["missing import", "type mismatch"]) keeps each agent's output machine-readable and verifiable by the gates.
Pi itself is designed for solo interactive work — you ask, it does, you review. The orchestration layer I wrote on top inverts that pattern. Pi becomes the agent harness; the flow engine becomes the conductor. Agents don't talk to each other. They talk to the engine. The engine talks to the gates. The gates talk to the live game. That's the architecture.
+ +The Setup: Extensions, Agents, and 15–20 Flows
+"How did you set this up?" is the question I get most often. Here's the honest answer: there's no dashboard with drag-and-drop. You write three kinds of files.
+ +Extensions are TypeScript tools that agents call. Each is about 300 lines, MIT licensed:
+ +| Extension | +What agents call it for | +
|---|---|
verify_build | Compile the game + sim, return file:line errors |
drive_game | Send input to the live game, sample player body |
game_frames | Capture screenshot sequences for vision judging |
ci_status | Check Gitea Actions pipeline state for a branch |
ci_logs | Fetch full build log from the most recent failed run |
ci_wait | Poll every 15 seconds until the pipeline finishes |
gen_image | Generate brand/marketing images via fal.ai flux-2-pro |
agent_catalog | List available agents with their tools, inputs, outputs |
Agents are Markdown files with YAML frontmatter. Each declares its role, model tier, tools, inputs, and outputs:
+ +---
+name: game-builder
+description: Implements game features in C# (Godot)
+model: @coding
+tools: read, write, edit, bash, verify_build, drive_game
+inputs: [context, plan, build_fail, behaviour_fail, feel_fail, visual_fail]
+outputs: [summary, files]
+---
+You are a game developer. Task: ${{task}}
+Context: ${{input.context}}
+
+Flows are YAML DAGs that wire agents together. I have about 15–20 flows running across different domains:
+ +-
+
- Game dev: /game-feature, /review, /bug-hunt, /refactor +
- Design: /concept-art, /sound-design (plans → ElevenLabs generation → judge evaluates with other models) +
- Marketing: /brand-image, /trailer-clip (Sora 2 video generation → vision judge) +
- Infra: /ci-fix, /deploy-check, /tstudio-jobs (action runners on AWS Lambda, workspace management) +
- Meta: A flow that periodically reads and improves the other flows — yes, flows that edit flows +
The setup is not a product you install. It's a stack: Pi as the agent harness, custom extensions as the tool layer, markdown agents as the role layer, YAML flows as the orchestration layer. The whole thing lives in .pi/flows/. Version-controlled. CI-tested. Slash-command invoked.
Structure vs. Freestyle: The Skeleton and the Muscle
+"Do you define the process with these trees, or do the agents freestyle a bit?" Both — and knowing which is which is the whole game.
+ +The skeleton is rigid. The flow YAML defines exactly which agents run, in what order, with what dependencies (blockedBy), what inputs they receive, and which gates they must pass. The DAG is not negotiable. An agent cannot decide to skip the build gate because it feels confident. The build gate runs. Period.
The muscle is autonomous. Inside its step, an agent has full agency. The game-builder decides which files to read, which approach to take, which code to write. It discovers project structure with grep and find. It runs the test suite to understand failures. It writes the fix and verifies it compiles. No human tells it "edit line 247 of PlayerController.cs." The agent figures that out.
Think of it like a company: the org chart (DAG) defines reporting lines and handoff points. The people (agents) do the actual work their own way. The compliance department (gates) checks everything before it ships. The CEO (judge) signs off.
+ +This balance is why the system works at all. Too much structure → agents can't adapt to unexpected situations. Too much freestyle → agents hallucinate, skip checks, ship broken code. The skeleton guarantees the right things happen in the right order. The muscle handles the messy reality of actual code.
+ +And when a flow's skeleton is wrong? The meta-flow improves it. It reads flow performance data, identifies bottlenecks ("the feel gate keeps failing because the game-builder doesn't know the jump velocity threshold"), edits the YAML to wire that threshold into the builder's inputs, and commits the change. Flows that improve flows. That's the endgame.
+ +Model Strategy: DeepSeek for Code, Gemini for Vision
+"Which DeepSeek model?" The short answer: DeepSeek V4 for coding-heavy agents, DeepSeek V4 Flash for fast routing decisions. The long answer: model selection is not one-size-fits-all.
+ +Flows use role-based model tiers — each agent declares a tier (@coding, @planning, @research, @fast, @compact, @vision), and the engine resolves it to a concrete model at dispatch time. This means you can swap models globally without touching any agent or flow file.
| Tier | +Model | +Used for | +
|---|---|---|
@coding | deepseek/deepseek-v4 | Reading, writing, editing code — the game-builder, fixer, test-author |
@planning | deepseek/deepseek-v4 | Flow architect, feature planner — decomposing tasks, designing DAGs |
@fast | deepseek/deepseek-v4-flash | Routing decisions — gate pass/fail, fork choices, loop exit checks |
@research | deepseek/deepseek-v4 | Codebase investigation, reading project docs, pattern analysis |
@vision | google/gemini-2.5-flash | Multimodal frame judging — T-pose detection, animation clip verification |
@compact | deepseek/deepseek-v4-flash | Summarisation, report generation, lightweight post-processing |
Why DeepSeek? Two reasons. First, it's free — the coding tier runs on DeepSeek's API with no usage limits, which matters when your game-builder agent is reading 800-line files and writing 200-line diffs ten times a session. Second, it's genuinely good at C# and Godot — I've had it write a full lighting module for our Godot fork by reading Unity API docs and adapting patterns. No agent had pulled that off before.
+ +Vision is the exception. DeepSeek can't do multimodal, so the visual gate uses Gemini 2.5 Flash. It's fast (under 2 seconds per frame grid), cheap, and catches the things that matter: T-poses, foot-slide, frozen animations, missing transitions. The vision preflight gate checks the Gemini API key is set before any build work starts — if it's missing, the entire flow hard-stops. Vision is never silently skipped.
+ +The key insight: different work needs different brains. Code writing needs a model that understands language semantics and type systems. Vision judging needs a model that sees pixels and understands motion. Routing decisions need a model that's fast and decisive, not one that overthinks. The role-tier system means you configure this once, at the model level, and every agent that declares model: @coding gets the right brain automatically.
+
The oracle tools — verify_build, drive_game, game_frames — are the durable assets. About 300 lines of TypeScript each, MIT licensed, reusable in any Pi project. The flow engine composes them; the agents route through them.
A year ago we had a supervisor written in 1,050 lines of hardcoded TypeScript that did one thing: verify agent output compiled and passed tests. We deleted it. The same verification now runs as a composable flow with five gates, live-game testing, and CI integration. Sometimes the best architecture decision is knowing what to delete.