← All Posts 4 June 2026

How Pi Agents Build, Test, and Ship Code with Oracle-Backed Flows

Think of a restaurant kitchen during dinner rush. The head chef doesn't cook every dish. She runs the pass — each plate gets inspected before it leaves. One cook handles sauces, another pastry, another the grill. The expediter calls orders, coordinates timing, makes sure table 4's mains don't arrive before table 2's starters. A dish comes back? It goes straight to the station that messed up, with a ticket explaining exactly what's wrong. That kitchen runs on flows. So does our game engine.

The Kitchen ↔ Flows Analogy

The kitchen = Pi (the agent harness). The recipe = a flow YAML (the DAG). The line cooks = agents (each with a station and tools). The pass = the flow engine (routes finished work). The head chef's inspection = the five gates. The order ticket = a slash command. "Send it back!" = the fix loop.

What Happens When You Type a Slash Command

You type /game-feature add a double-jump with cooldown and hit enter. The ticket hits the kitchen. What follows is not one agent doing everything — it's a brigade running their stations.

Context Plan Implement VERIFY-HEAVY GATES — most compute is spent checking, not writing G1 · Build G2 · Tests G3 · Behaviour G4 · Feel G5 · Visual all green ⇒ done · any fail ⇒ report Judge — honest verdict Reflexion · fix & retry ≤ 3
A real failure loops back to implement with gate evidence (bounded to three tries); anything green falls through to the judge.

The Five Gates: What the Head Chef Checks

In a kitchen, the head chef doesn't trust — she verifies. Every plate hits the pass and gets inspected. Our flows have the same instinct. Each gate is a sub-agent with one job, one tool, and absolute veto power.

In the Kitchen

Check the base. Is the protein cooked through? If the chicken is raw, the whole plate stops here. Nothing else matters.

In the Flow

G1 · Build Runs dotnet build. PASS/FAIL with file:line errors. Won't compile? Nothing proceeds.

In the Kitchen

Taste the sauce. Seasoning right? Acid balanced? The dish might look perfect but taste flat.

In the Flow

G2 · Tests Runs dotnet test. Parses which assertions broke. Fixed code that passes build but fails logic gets caught here.

In the Kitchen

Does it work? Pick it up. Does the sauce hold? Does the plating survive the walk to table 6?

In the Flow

G3 · Behaviour Sends {"jump":true} to the LIVE game. Samples the player body 30 times at 50ms. Did the character actually jump? Double-jump fire? This is the ground-truth oracle — what makes game dev fundamentally different from web dev.

In the Kitchen

How does it feel? The steak is cooked but chewy. The sauce is seasoned but gloopy. Edible ≠ good.

In the Flow

G4 · Feel Measures apex height, airtime, liftoff latency, rise/fall asymmetry, landing settle. Numeric thresholds. A jump that works but takes 400ms to lift off fails. Behaviour says it happened. Feel says it felt good.

In the Kitchen

How does it look? Is the garnish wilting? Sauce smeared? Does it match the menu photo?

In the Flow

G5 · Visual Captures 8 frames at 100ms intervals, grids them, feeds to gemini-2.5-flash. Checks: T-pose? Foot-slide? Frozen animation? Wrong clip? Missing transitions?

The Loop

Any red gate → evidence sent back to the cook → fix → re-enter the inspection line. Three chances max, then the head chef escalates to a human. This is the same instinct that makes a good kitchen work: catch it early, send it back with a clear note, give them a chance to fix it, but don't let the same dish circle the pass forever.

Composability: Adding a New Station

A kitchen doesn't redesign the whole line when they add a new dish. They add a station. Same in flows. Started with three gates — build, test, vision. Behaviour and feel came later, each a single-file extension. Gates aren't hardcoded. They're sub-agents declared in YAML. Want a linting gate? Add a sub-agent with a linter. Security scan? Same pattern. Asset bundle size check? Write the tool, declare the agent, wire it in.

Self-Improving Kitchen

Agents can extend the flow at runtime. If the behaviour gate keeps failing because the game window isn't focused, an agent notices the pattern and inserts a pre-condition gate that checks window focus. The flow engine handles routing; the agents handle decisions. This is what makes flows fundamentally different from a script — the pipeline isn't fixed at compile time. It's a graph that agents read, understand, and modify while they run.


The CI Loop: The Dish That Came Back After It Left

Gates inspect plates at the pass. But what about after the plate leaves the kitchen? What about the customer who finds a hair in their soup after it's been served?

Most coding agents don't care. They write code, push, walk away. A human discovers the broken CI build an hour later. That's the equivalent of a cook plating a dish, sending it out, and never checking if the diner is still alive.

We closed this loop with three tools — the waiter who brings the plate back:

The agent pushes, calls ci_wait. If CI fails, it reads ci_logs, fixes the exact error, pushes again. DeepSeek V4 parses compiler errors the way a cook reads a ticket: "missing import" = forgot the salt, "type mismatch" = wrong pan size, "module not found" = ingredient not in stock. Pattern-matched and fixed in seconds.

Real Example

Adding a health check endpoint to a Go service. Agent wrote the handler and test, pushed. CI failed — the test imported a package that didn't exist on the runner. Agent read ci_logs, saw go: module not found, added the missing go.mod replace directive, pushed again. CI passed. PR opened. 4 minutes. $0.06.

Three safeguards prevent the kitchen grinding to a halt: retry limit (3, same dish doesn't circle forever), diff budget (retries only touch files already on the ticket), and hallucination detection (if the cook claims the customer loved it without actually asking the waiter, they get corrected).

The Numbers

Over three weeks of running the orchestrator:

The 26% retry rate matches what you'd see from a junior developer. The difference: the agent fixes it in 30 seconds.

The Architecture

Layer What How
Flow enginepi-flows orchestratorComposes agents, gates and decision points
Oracle gatesverify_build, drive_game, game_framesReturn structured PASS/FAIL with evidence
Sub-agentsG1 build · G2 tests · G3 behaviour · G4 feel · G5 visualRole-split, each with its own toolset
CI looptinqs-ci extensionci_status, ci_logs, ci_wait — polls Gitea Actions, reads logs, retries
DecisionAgent-loop ReflexionSelf-reflect on failures, retry (≤3) or escalate
VisualizationFlowDashboardReal-time pipeline state

Three Kitchens, One Morning

This morning, I ran three flows. Each is a different kitchen, a different brigade, a different dish. Here's what actually happened — real flow logs, real verdicts, nothing staged.

Flow 1 · 4 June, 18:32

/deep-implement — "Build the tinqs-gitea-read extension: list_org_repos, read_repo_file, list_repo_dir, search_repos." Nine steps, 14 minutes. Verdict: PASS. 31/31 vitest tests green, zero new TypeScript errors, session-level caching, path traversal protection. Every execute() body fully wired — no stubs, no placeholders. Like a saucier who doesn't just list ingredients but actually makes the sauce.

Flow 2 · 4 June, 19:04

/game-feature — "Make the player jump." Build: PASS. Tests: PASS. Behaviour/Feel/Visual: NOT RUN — no live game instance was reachable. The flow didn't silently skip the visual gate. It hard-stopped and reported honestly: "FAIL — the feature has not been verified in-game." This is the kitchen saying: "The dish is cooked, but nobody tasted it. I'm not sending it out."

Flow 3 · 4 June, 19:49

/cto-infra — "Synthesize cost, stability, and VCS research into an AWS architecture decision." Four research streams fed into one CTO agent. Output: 14 requirements mapped to specific decisions, cost-vs-stability tradeoffs resolved with dollar figures, EC2+EBS over Fargate+EFS, RDS Multi-AZ mandatory, S3+CloudFront for LFS. Like an executive chef reading four menu proposals, reconciling them into one service, and pricing every plate.


Dinner Rush Recovery: The Crash That Interrupted Service

Earlier today, a machine crash cut off a flow mid-stream — the kitchen lost power during dinner rush. Nineteen tests were left red. Contracts written, implementation half-done. Half-cooked dishes on every station.

I typed one slash command — the expediter reassembled the brigade:

/game-feature Finish the leftover jump & locomotion animation work — make the 19 FAILING tests GREEN.

What happened next: the team picked up exactly where the crash left off. Here's the recipe — the exact YAML that runs in production:

name: game-feature
description: Build a PLAYABLE game feature and prove it in the LIVE game.
task_required: true

steps:
  # G0: Pre-flight — validate vision CAN run before any build work
  - id: preflight
    agent: vision-preflight
    task: Check GEMINI_API_KEY is set AND game_frames reaches a live instance.
          If EITHER fails, STOP — vision is not optional.

  # Context + plan
  - id: context
    agent: project-context-reader
    blockedBy: [preflight]

  - id: plan
    agent: feature-planner
    blockedBy: [context]

  # TDD: write tests FIRST (different agent than implementer)
  - id: test-author
    agent: test-author
    blockedBy: [plan]

  - id: implement
    agent: game-builder
    blockedBy: [test-author]

  # G1–G5: Oracle gates (build, tests, behaviour, feel, visual)
  - id: build       → agent: build-verifier
  - id: tests       → agent: test-runner
  - id: behavior    → agent: behavioral-prober (drives LIVE game via drive_game)
  - id: feel        → agent: feel-judge (apex, airtime, latency, rise/fall)
  - id: visual      → agent: animation-vision-judge (multimodal gemini-2.5-flash)

  # Self-recurring fix-loop: bounded loop back to implement with evidence
  - id: fix-loop
    type: agent-loop-decision
    agent: flow-decision
    loop_target: implement
    exit_target: report
    max_iterations: 3

  # Final judge: one honest verdict
  - id: report
    agent: game-judge

Eighteen steps, seven cooks, five inspection points, one head chef. Triggered by a single order ticket.

Here's how the brigade actually worked. The vision-preflight agent — the chef who checks the gas is on before anyone starts cooking — verified GEMINI_API_KEY was set and game_frames could reach the live game. Both green in under a second. Without this, the whole kitchen would prep for an hour only to discover the oven doesn't work.

The project-context-reader — the commis who reads the entire recipe book — ingested PlayerController.cs, PlayerAnimController.cs, PlayerAnimationLogic.cs, the test files, the manifest. The feature-planner — the sous-chef who breaks down the order into station tasks — decomposed 19 failures into four fix groups: vegetation manifest (146 broken prefabPath items), animation controller (crouch parameter not plumbed), jump physics (coyote time, variable height, air control — all missing), and animation tree (entire state machine absent).

Then the game-builder — the line cook at the hot station — read each test failure like a dish ticket, traced it to the source, and started cooking. Coyote time: 100ms grace period after feet leave the ground. Variable jump height: velocity scaled by hold duration, tap gives 3.5, full hold gives 6.5. Air control: horizontal speed cut 40% while airborne. Jump phases: minimum 0.15s on jump_start before transitioning up. Landing timer: wait the full animation length, not length-minus-blend. Animation tree: jump_start → jump → jump_land states with 0.1s blends.

Then the inspection line: build-verifier compiled. Test-runner ran the suite. Behavioral-prober sent {"jump":true} to the live game and sampled the player body. Feel-judge measured apex, airtime, liftoff latency. Animation-vision-judge captured 8 frames, gridded them, had gemini-2.5-flash scan for T-poses and foot-slide.

Anything red → ticket back to the cook with the specific failure → fix → re-enter the line. Bounded to 3 returns. Anything green → falls through. All green → game-judge gives the final verdict.

Not a Demo

This flow is a file at .pi/flows/flows/game-feature.yaml. I trigger it by typing /game-feature in Pi. It dispatches agents, runs gates, loops on failures, reports a verdict. There is no dashboard with drag-and-drop. There is a YAML file and a slash command. That's the whole product.


The Menu: Flows Are Slash Commands

Every flow becomes a slash command — the menu you read to the expediter. .pi/flows/flows/game-feature.yaml/game-feature. You don't invoke a pipeline from a terminal. You order a dish in conversation.

"Add wall-running" is not a CLI flag. It's natural language. The flow reads it, wires it through the agents, routes it through the gates. The YAML is the recipe. The conversation is the context.

The menu I call from daily:

The Pass: How Agents Hand Off Work

In a real kitchen, cooks don't shout instructions across the room. They place finished plates on the pass. The expediter reads the ticket, checks the plate, routes it to the next station or to the dining room. Nobody yells. Nobody grabs someone else's pan.

Flows work the same way. Agents never talk to each other directly. When the game-builder finishes, it doesn't ping the test-runner. It calls finish({ summary: "...", artifacts: "...", files: "..." }) — placing its work on the pass. The flow engine — the expediter — records it and routes it. The next agent receives exactly the inputs wired in the YAML: ${{result.game-builder.summary}}, ${{result.game-builder.files}}.

What People Expect

Agents chatting freely, PM-slack style: "Hey test-runner, I just pushed some code, can you check it? Also the jump feels off, maybe tune the velocity?"

What Actually Happens

Agent A → finish({verdict: "pass", findings: ["coyote_time=100ms"]}) → engine records → Agent B receives ${{result.A.findings}} via inputs: block. No chatter. Structured handoff.

Why? Because unstructured chatter is how hallucination cascades start. Agent A confidently states something wrong. Agent B builds on it. Agent C compounds it. Three agents later, they're collectively wrong about a file that doesn't exist, and nobody can trace where the error came from. The pass — structured result-passing with typed outputs — makes every handoff auditable, verifiable, and debuggable.

Pi itself is built for solo interactive work: you ask, it does, you review. The orchestration layer I wrote on top inverts that. Pi becomes the kitchen. The flow engine becomes the expediter. Agents become line cooks who place plates on the pass, never shouting across the room.

The Setup: Extensions, Agents, and 15–20 Flows

"How did you set this up?" is the question I get most often. Here's the honest answer: there's no dashboard with drag-and-drop. You write three kinds of files.

Extensions are TypeScript tools that agents call. Each is about 300 lines, MIT licensed:

Extension What agents call it for
verify_buildCompile the game + sim, return file:line errors
drive_gameSend input to the live game, sample player body
game_framesCapture screenshot sequences for vision judging
ci_statusCheck Gitea Actions pipeline state for a branch
ci_logsFetch full build log from the most recent failed run
ci_waitPoll every 15 seconds until the pipeline finishes
gen_imageGenerate brand/marketing images via fal.ai flux-2-pro
agent_catalogList available agents with their tools, inputs, outputs

Agents are Markdown files with YAML frontmatter. Each declares its role, model tier, tools, inputs, and outputs:

---
name: game-builder
description: Implements game features in C# (Godot)
model: @coding
tools: read, write, edit, bash, verify_build, drive_game
inputs: [context, plan, build_fail, behaviour_fail, feel_fail, visual_fail]
outputs: [summary, files]
---
You are a game developer. Task: ${{task}}
Context: ${{input.context}}

Flows are YAML DAGs that wire agents together. I have about 15–20 flows running across different domains:

The setup is not a product you install. It's a stack: Pi as the agent harness, custom extensions as the tool layer, markdown agents as the role layer, YAML flows as the orchestration layer. The whole thing lives in .pi/flows/. Version-controlled. CI-tested. Slash-command invoked.

The Recipe vs. The Technique

"Do you define the process with these trees, or do the agents freestyle?" Both. The recipe says what to make and in what order. The technique is how each cook executes their station.

The Recipe (Rigid)

The flow YAML is the recipe. It says: first the prep cook dices onions, then the saucier makes the base, then the grill cook sears the protein. After every station, the plate hits the pass for inspection. This order is not negotiable. A cook cannot skip the inspection because they feel confident. The inspection runs. Period.

The Technique (Autonomous)

Inside their station, a cook has full agency. How they dice the onions — brunoise or rough chop — is their call. Which pan they use, how they adjust the heat, whether they taste midway. The game-builder decides which files to read, which approach to take. Nobody tells it "edit line 247." It figures that out with grep, find, and reading code.

This balance is everything. Too much recipe → agents can't handle surprises. Too much freestyle → agents hallucinate, skip checks, ship broken code. The recipe guarantees the right things happen in the right order — preflight before build, build before test, test before ship. The technique handles the messy, unpredictable reality of actual code.

The Meta-Kitchen

And when a recipe is wrong? Another flow improves it. A meta-flow reads performance data, spots bottlenecks — "the feel gate keeps failing because the cook doesn't know the jump velocity threshold" — edits the YAML to wire that threshold into the builder's inputs, and commits the change. Flows that edit flows. The kitchen that renovates itself between services.


Picking the Right Knife: Model Strategy

You don't use a paring knife to butcher a cow. You don't use a cleaver to supreme an orange. Different work needs different blades. Flows use role-based model tiers — each agent declares the blade it needs, and the engine hands it the right one at dispatch time.

Tier The Knife What It Cuts
@codingDeepSeek V4Chef's knife — your workhorse. Reads 800-line files, writes 200-line diffs. Game-builder, fixer, test-author. Free.
@planningDeepSeek V4Boning knife — precision decomposition. Breaks tasks into steps, designs DAGs. Flow architect, feature planner.
@fastDeepSeek V4 FlashParing knife — quick, decisive cuts. Gate pass/fail, fork choices, loop exits. No overthinking.
@researchDeepSeek V4Fillet knife — flexible, follows contours. Reads codebase, traces patterns, finds what matters.
@visionGemini 2.5 FlashThe inspector's eyes — the only knife that sees. Multimodal frame judging: T-poses, foot-slide, frozen anims.
@compactDeepSeek V4 FlashKitchen shears — lightweight, versatile. Summaries, verdicts, post-processing. Fast and cheap.
Why DeepSeek?

Two reasons. It's free — no usage limits, which matters when your game-builder reads 800-line files and writes 200-line diffs ten times a session. It's genuinely good at C# and Godot — I've had it write a full lighting module for our Godot fork by reading Unity API docs and adapting patterns. No agent had pulled that off before. DeepSeek can't do multimodal, so vision goes to Gemini — but for everything else, it's the chef's knife you reach for 90% of the time.

The point of the knife rack: you configure this once. Every agent declares model: @coding and gets DeepSeek V4 automatically. Swap models globally without touching any flow or agent file. The right blade, every time, no thinking required.


The oracle tools — verify_build, drive_game, game_frames — are the durable assets. About 300 lines of TypeScript each, MIT licensed, reusable in any Pi project. The flow engine composes them; the agents route through them.

A year ago we had a supervisor written in 1,050 lines of hardcoded TypeScript that did one thing: verify agent output compiled and passed tests. We deleted it. The same verification now runs as a composable flow with five gates, live-game testing, and CI integration. Sometimes the best architecture decision is knowing what to delete.

The flow-native brain runs on our Pi fork inside Tinqs Studio. The oracle extensions are MIT licensed and reusable in any Pi project.