Games Technology About Blog Contact Press
← All Posts 4 June 2026

How Pi Agents Build, Test, and Ship Code with Oracle-Backed Flows

Think of a restaurant kitchen during dinner rush. The head chef doesn't cook every dish. She runs the pass — each plate gets inspected before it leaves. One cook handles sauces, another pastry, another the grill. The expediter calls orders, coordinates timing, makes sure table 4's mains don't arrive before table 2's starters. A dish comes back? It goes straight to the station that messed up, with a ticket explaining exactly what's wrong. That kitchen runs on flows. So does our game engine.

The Kitchen ↔ Flows Analogy

The kitchen = Pi (the agent harness). The recipe = a flow YAML (the DAG). The line cooks = agents (each with a station and tools). The pass = the flow engine (routes finished work). The head chef's inspection = the five gates. The order ticket = a slash command. "Send it back!" = the fix loop.

What Happens When You Type a Slash Command

You type /game-feature add a double-jump with cooldown and hit enter. The ticket hits the kitchen. What follows is not one agent doing everything — it's a brigade running their stations.

Context Plan Implement VERIFY-HEAVY GATES — most compute is spent checking, not writing G1 · Build G2 · Tests G3 · Behaviour G4 · Feel G5 · Visual all green ⇒ done · any fail ⇒ report Judge — honest verdict Reflexion · fix & retry ≤ 3
A real failure loops back to implement with gate evidence (bounded to three tries); anything green falls through to the judge.

The Five Gates: What the Head Chef Checks

In a kitchen, the head chef doesn't trust — she verifies. Every plate hits the pass and gets inspected. Our flows have the same instinct. Each gate is a sub-agent with one job, one tool, and absolute veto power.

In the Kitchen

Check the base. Is the protein cooked through? If the chicken is raw, the whole plate stops here. Nothing else matters.

In the Flow

G1 · Build Runs dotnet build. PASS/FAIL with file:line errors. Won't compile? Nothing proceeds.

In the Kitchen

Taste the sauce. Seasoning right? Acid balanced? The dish might look perfect but taste flat.

In the Flow

G2 · Tests Runs dotnet test. Parses which assertions broke. Fixed code that passes build but fails logic gets caught here.

In the Kitchen

Does it work? Pick it up. Does the sauce hold? Does the plating survive the walk to table 6?

In the Flow

G3 · Behaviour Sends {"jump":true} to the LIVE game. Samples the player body 30 times at 50ms. Did the character actually jump? Double-jump fire? This is the ground-truth oracle — what makes game dev fundamentally different from web dev.

In the Kitchen

How does it feel? The steak is cooked but chewy. The sauce is seasoned but gloopy. Edible ≠ good.

In the Flow

G4 · Feel Measures apex height, airtime, liftoff latency, rise/fall asymmetry, landing settle. Numeric thresholds. A jump that works but takes 400ms to lift off fails. Behaviour says it happened. Feel says it felt good.

In the Kitchen

How does it look? Is the garnish wilting? Sauce smeared? Does it match the menu photo?

In the Flow

G5 · Visual Captures 8 frames at 100ms intervals, grids them, feeds to gemini-2.5-flash. Checks: T-pose? Foot-slide? Frozen animation? Wrong clip? Missing transitions?

The Loop

Any red gate → evidence sent back to the cook → fix → re-enter the inspection line. Three chances max, then the head chef escalates to a human. This is the same instinct that makes a good kitchen work: catch it early, send it back with a clear note, give them a chance to fix it, but don't let the same dish circle the pass forever.

Composability: Adding a New Station

A kitchen doesn't redesign the whole line when they add a new dish. They add a station. Same in flows. Started with three gates — build, test, vision. Behaviour and feel came later, each a single-file extension. Gates aren't hardcoded. They're sub-agents declared in YAML. Want a linting gate? Add a sub-agent with a linter. Security scan? Same pattern. Asset bundle size check? Write the tool, declare the agent, wire it in.

Self-Improving Kitchen

Agents can extend the flow at runtime. If the behaviour gate keeps failing because the game window isn't focused, an agent notices the pattern and inserts a pre-condition gate that checks window focus. The flow engine handles routing; the agents handle decisions. This is what makes flows fundamentally different from a script — the pipeline isn't fixed at compile time. It's a graph that agents read, understand, and modify while they run.


The CI Loop: The Dish That Came Back After It Left

Gates inspect plates at the pass. But what about after the plate leaves the kitchen? What about the customer who finds a hair in their soup after it's been served?

Most coding agents don't care. They write code, push, walk away. A human discovers the broken CI build an hour later. That's the equivalent of a cook plating a dish, sending it out, and never checking if the diner is still alive.

We closed this loop with three tools — the waiter who brings the plate back:

The agent pushes, calls ci_wait. If CI fails, it reads ci_logs, fixes the exact error, pushes again. DeepSeek V4 parses compiler errors the way a cook reads a ticket: "missing import" = forgot the salt, "type mismatch" = wrong pan size, "module not found" = ingredient not in stock. Pattern-matched and fixed in seconds.

Real Example

Adding a health check endpoint to a Go service. Agent wrote the handler and test, pushed. CI failed — the test imported a package that didn't exist on the runner. Agent read ci_logs, saw go: module not found, added the missing go.mod replace directive, pushed again. CI passed. PR opened. 4 minutes. $0.06.

Three safeguards prevent the kitchen grinding to a halt: retry limit (3, same dish doesn't circle forever), diff budget (retries only touch files already on the ticket), and hallucination detection (if the cook claims the customer loved it without actually asking the waiter, they get corrected).

The Numbers

Over three weeks of running the orchestrator:

The 26% retry rate matches what you'd see from a junior developer. The difference: the agent fixes it in 30 seconds.

The Architecture

Layer What How
Flow enginepi-flows orchestratorComposes agents, gates and decision points
Oracle gatesverify_build, drive_game, game_framesReturn structured PASS/FAIL with evidence
Sub-agentsG1 build · G2 tests · G3 behaviour · G4 feel · G5 visualRole-split, each with its own toolset
CI looptinqs-ci extensionci_status, ci_logs, ci_wait — polls Gitea Actions, reads logs, retries
DecisionAgent-loop ReflexionSelf-reflect on failures, retry (≤3) or escalate
VisualizationFlowDashboardReal-time pipeline state

A Real Flow in Action: Fixing 19 Tests After a Crash

This morning, a machine crash cut off a flow mid-stream. Nineteen tests were left red — contracts written, implementation half-done. The task: finish the interrupted jump & locomotion animation work and make them all green.

I typed one slash command into Pi:

/game-feature Finish the leftover jump & locomotion animation work — make the 19 FAILING tests GREEN. They are existing RED contracts written by an earlier animation flow that a machine crash cut off mid-stream; the contracts are already written, so IMPLEMENT to satisfy them (do not rewrite the contracts).

What happened next was fully autonomous. Here's the flow, verbatim — this is the exact YAML that runs in production:

name: game-feature
description: Build a PLAYABLE game feature and prove it in the LIVE game.
task_required: true

steps:
  # G0: Pre-flight — validate vision CAN run before any build work
  - id: preflight
    agent: vision-preflight
    task: Check GEMINI_API_KEY is set AND game_frames reaches a live instance.
          If EITHER fails, STOP — vision is not optional.

  # Context + plan
  - id: context
    agent: project-context-reader
    blockedBy: [preflight]

  - id: plan
    agent: feature-planner
    blockedBy: [context]

  # TDD: write tests FIRST (different agent than implementer)
  - id: test-author
    agent: test-author
    blockedBy: [plan]

  - id: implement
    agent: game-builder
    blockedBy: [test-author]

  # G1–G5: Oracle gates (build, tests, behaviour, feel, visual)
  - id: build       → agent: build-verifier
  - id: tests       → agent: test-runner
  - id: behavior    → agent: behavioral-prober (drives LIVE game via drive_game)
  - id: feel        → agent: feel-judge (apex, airtime, latency, rise/fall)
  - id: visual      → agent: animation-vision-judge (multimodal gemini-2.5-flash)

  # Self-recurring fix-loop: bounded loop back to implement with evidence
  - id: fix-loop
    type: agent-loop-decision
    agent: flow-decision
    loop_target: implement
    exit_target: report
    max_iterations: 3

  # Final judge: one honest verdict
  - id: report
    agent: game-judge

Eighteen steps, seven custom agents, five oracle gates, and one judge. The whole thing runs as a slash command.

Here's what actually happened. The vision-preflight agent fired first — checked that GEMINI_API_KEY was set and that game_frames could reach the live game instance. Both passed in under a second. Without this gate, the rest of the flow would be meaningless — we'd do all the build work only to discover the vision judge can't run. So we check first.

The project-context-reader ingested PlayerController.cs, PlayerAnimController.cs, PlayerAnimationLogic.cs, the test files, and the manifest. The feature-planner decomposed the 19 failures into four fix groups: (1) vegetation manifest — 146 items with broken prefabPath, (2) animation controller — crouch parameter not plumbed through, (3) jump physics — coyote time, variable height, air control all unimplemented, (4) animation tree — state machine missing entirely.

Then the game-builder agent went to work. It read the test failure messages, traced each one to the source, and started implementing. Coyote time: a 100ms grace period after IsOnFloor() becomes false. Variable jump height: scale velocity by key hold duration, 3.5 at tap, 6.5 at 300ms hold. Air control: reduce horizontal velocity by 40% when airborne. Jump phases: minimum 0.15s duration on jump_start before transitioning to airborne. Landing timer: wait full jump_land length + one frame, not length - blend. Animation tree: state machine with jump_start → jump → jump_land states, 0.1s blend transitions.

The build-verifier compiled it. Test-runner ran the suite. Behavioral-prober sent {"jump": true} to the live game and sampled the player body 30 times. Feel-judge measured apex height, airtime, and liftoff latency against thresholds. Animation-vision-judge grabbed 8 frames at 100ms intervals, composed them into a grid, and had gemini-2.5-flash check for T-poses, foot-slide, frozen frames, and missing transitions.

Any red gate → evidence fed back to the game-builder → fix → re-enter the gate ladder. Bounded to 3 retries per the max_iterations in the loop decision. Any green gate → falls through to the next. All green → the game-judge produces the final honest verdict.

This isn't a demo. It's running right now, as I write this, in a Pi session on my machine. The flow is a file at .pi/flows/flows/game-feature.yaml. I trigger it with a slash command. It dispatches sub-agents, runs them through oracle gates, loops on failures, and reports a verdict. That's it.

The Flow-as-Command Pattern

Every flow registers as a slash command. .pi/flows/flows/game-feature.yaml becomes /game-feature. Type it in Pi, describe what you want, hit enter. The flow architect dispatches the DAG, the dashboard shows agent cards with live status, and you watch it happen — or walk away and check the result later.

This is the pattern that makes flows different from scripts. Flows are not hardcoded pipelines you invoke from the terminal. They're slash commands you type in conversation. You describe what you want in natural language, the flow wires it through the agents, and the agents route through the gates. The YAML is the skeleton; the conversation is the context.

A few flows I use daily:

The slash command is the interface. The flow is the implementation. The oracle gates are the safety net.

How Agents Communicate (It's Not Chat)

A common question: are the agents constantly talking to each other? The answer is no — and that's deliberate. Agents don't chat. They pass structured results through the flow engine bus.

Each agent runs in an isolated session with scoped tools and file access. When agent A finishes, it calls finish({ summary: "...", artifacts: "...", files: "..." }). The flow engine records the result. Agent B receives exactly what it needs via template variables — ${{result.A.summary}}, ${{result.A.artifacts}}, ${{result.A.files}} — wired through the inputs: block in the flow YAML.

This is not agent-to-agent chatter. It's a publish/subscribe bus where the flow engine is the broker. Agents never directly invoke each other. They never read each other's raw output unless the flow explicitly wires it. The DAG's blockedBy edges define who waits for whom; the inputs: block defines what data flows across the edge.

Why not let agents talk freely? Because unstructured chatter is the fastest path to hallucination cascades. Agent A confidently states something wrong, agent B builds on it, agent C compounds it. By the time a human notices, you have three agents collectively wrong about a file that doesn't exist. Structured result-passing with typed outputs (verdict: pass, findings: ["missing import", "type mismatch"]) keeps each agent's output machine-readable and verifiable by the gates.

Pi itself is designed for solo interactive work — you ask, it does, you review. The orchestration layer I wrote on top inverts that pattern. Pi becomes the agent harness; the flow engine becomes the conductor. Agents don't talk to each other. They talk to the engine. The engine talks to the gates. The gates talk to the live game. That's the architecture.

The Setup: Extensions, Agents, and 15–20 Flows

"How did you set this up?" is the question I get most often. Here's the honest answer: there's no dashboard with drag-and-drop. You write three kinds of files.

Extensions are TypeScript tools that agents call. Each is about 300 lines, MIT licensed:

Extension What agents call it for
verify_buildCompile the game + sim, return file:line errors
drive_gameSend input to the live game, sample player body
game_framesCapture screenshot sequences for vision judging
ci_statusCheck Gitea Actions pipeline state for a branch
ci_logsFetch full build log from the most recent failed run
ci_waitPoll every 15 seconds until the pipeline finishes
gen_imageGenerate brand/marketing images via fal.ai flux-2-pro
agent_catalogList available agents with their tools, inputs, outputs

Agents are Markdown files with YAML frontmatter. Each declares its role, model tier, tools, inputs, and outputs:

---
name: game-builder
description: Implements game features in C# (Godot)
model: @coding
tools: read, write, edit, bash, verify_build, drive_game
inputs: [context, plan, build_fail, behaviour_fail, feel_fail, visual_fail]
outputs: [summary, files]
---
You are a game developer. Task: ${{task}}
Context: ${{input.context}}

Flows are YAML DAGs that wire agents together. I have about 15–20 flows running across different domains:

The setup is not a product you install. It's a stack: Pi as the agent harness, custom extensions as the tool layer, markdown agents as the role layer, YAML flows as the orchestration layer. The whole thing lives in .pi/flows/. Version-controlled. CI-tested. Slash-command invoked.

Structure vs. Freestyle: The Skeleton and the Muscle

"Do you define the process with these trees, or do the agents freestyle a bit?" Both — and knowing which is which is the whole game.

The skeleton is rigid. The flow YAML defines exactly which agents run, in what order, with what dependencies (blockedBy), what inputs they receive, and which gates they must pass. The DAG is not negotiable. An agent cannot decide to skip the build gate because it feels confident. The build gate runs. Period.

The muscle is autonomous. Inside its step, an agent has full agency. The game-builder decides which files to read, which approach to take, which code to write. It discovers project structure with grep and find. It runs the test suite to understand failures. It writes the fix and verifies it compiles. No human tells it "edit line 247 of PlayerController.cs." The agent figures that out.

Think of it like a company: the org chart (DAG) defines reporting lines and handoff points. The people (agents) do the actual work their own way. The compliance department (gates) checks everything before it ships. The CEO (judge) signs off.

This balance is why the system works at all. Too much structure → agents can't adapt to unexpected situations. Too much freestyle → agents hallucinate, skip checks, ship broken code. The skeleton guarantees the right things happen in the right order. The muscle handles the messy reality of actual code.

And when a flow's skeleton is wrong? The meta-flow improves it. It reads flow performance data, identifies bottlenecks ("the feel gate keeps failing because the game-builder doesn't know the jump velocity threshold"), edits the YAML to wire that threshold into the builder's inputs, and commits the change. Flows that improve flows. That's the endgame.

Model Strategy: DeepSeek for Code, Gemini for Vision

"Which DeepSeek model?" The short answer: DeepSeek V4 for coding-heavy agents, DeepSeek V4 Flash for fast routing decisions. The long answer: model selection is not one-size-fits-all.

Flows use role-based model tiers — each agent declares a tier (@coding, @planning, @research, @fast, @compact, @vision), and the engine resolves it to a concrete model at dispatch time. This means you can swap models globally without touching any agent or flow file.

Tier Model Used for
@codingdeepseek/deepseek-v4Reading, writing, editing code — the game-builder, fixer, test-author
@planningdeepseek/deepseek-v4Flow architect, feature planner — decomposing tasks, designing DAGs
@fastdeepseek/deepseek-v4-flashRouting decisions — gate pass/fail, fork choices, loop exit checks
@researchdeepseek/deepseek-v4Codebase investigation, reading project docs, pattern analysis
@visiongoogle/gemini-2.5-flashMultimodal frame judging — T-pose detection, animation clip verification
@compactdeepseek/deepseek-v4-flashSummarisation, report generation, lightweight post-processing

Why DeepSeek? Two reasons. First, it's free — the coding tier runs on DeepSeek's API with no usage limits, which matters when your game-builder agent is reading 800-line files and writing 200-line diffs ten times a session. Second, it's genuinely good at C# and Godot — I've had it write a full lighting module for our Godot fork by reading Unity API docs and adapting patterns. No agent had pulled that off before.

Vision is the exception. DeepSeek can't do multimodal, so the visual gate uses Gemini 2.5 Flash. It's fast (under 2 seconds per frame grid), cheap, and catches the things that matter: T-poses, foot-slide, frozen animations, missing transitions. The vision preflight gate checks the Gemini API key is set before any build work starts — if it's missing, the entire flow hard-stops. Vision is never silently skipped.

The key insight: different work needs different brains. Code writing needs a model that understands language semantics and type systems. Vision judging needs a model that sees pixels and understands motion. Routing decisions need a model that's fast and decisive, not one that overthinks. The role-tier system means you configure this once, at the model level, and every agent that declares model: @coding gets the right brain automatically.


The oracle tools — verify_build, drive_game, game_frames — are the durable assets. About 300 lines of TypeScript each, MIT licensed, reusable in any Pi project. The flow engine composes them; the agents route through them.

A year ago we had a supervisor written in 1,050 lines of hardcoded TypeScript that did one thing: verify agent output compiled and passed tests. We deleted it. The same verification now runs as a composable flow with five gates, live-game testing, and CI integration. Sometimes the best architecture decision is knowing what to delete.

The flow-native brain runs on our Pi fork inside Tinqs Studio. The oracle extensions are MIT licensed and reusable in any Pi project.