diff --git a/pi-flow-native-brain.html b/pi-flow-native-brain.html index eb3e6cf..3353c91 100644 --- a/pi-flow-native-brain.html +++ b/pi-flow-native-brain.html @@ -448,14 +448,34 @@
-

A Real Flow in Action: Fixing 19 Tests After a Crash

-

This morning, a machine crash cut off a flow mid-stream. Nineteen tests were left red — contracts written, implementation half-done. The task: finish the interrupted jump & locomotion animation work and make them all green.

+

Three Kitchens, One Morning

+

This morning, I ran three flows. Each is a different kitchen, a different brigade, a different dish. Here's what actually happened — real flow logs, real verdicts, nothing staged.

-

I typed one slash command into Pi:

+
+ Flow 1 · 4 June, 18:32 +

/deep-implement — "Build the tinqs-gitea-read extension: list_org_repos, read_repo_file, list_repo_dir, search_repos." Nine steps, 14 minutes. Verdict: PASS. 31/31 vitest tests green, zero new TypeScript errors, session-level caching, path traversal protection. Every execute() body fully wired — no stubs, no placeholders. Like a saucier who doesn't just list ingredients but actually makes the sauce.

+
-
/game-feature Finish the leftover jump & locomotion animation work — make the 19 FAILING tests GREEN. They are existing RED contracts written by an earlier animation flow that a machine crash cut off mid-stream; the contracts are already written, so IMPLEMENT to satisfy them (do not rewrite the contracts).
+
+ Flow 2 · 4 June, 19:04 +

/game-feature — "Make the player jump." Build: PASS. Tests: PASS. Behaviour/Feel/Visual: NOT RUN — no live game instance was reachable. The flow didn't silently skip the visual gate. It hard-stopped and reported honestly: "FAIL — the feature has not been verified in-game." This is the kitchen saying: "The dish is cooked, but nobody tasted it. I'm not sending it out."

+
-

What happened next was fully autonomous. Here's the flow, verbatim — this is the exact YAML that runs in production:

+
+ Flow 3 · 4 June, 19:49 +

/cto-infra — "Synthesize cost, stability, and VCS research into an AWS architecture decision." Four research streams fed into one CTO agent. Output: 14 requirements mapped to specific decisions, cost-vs-stability tradeoffs resolved with dollar figures, EC2+EBS over Fargate+EFS, RDS Multi-AZ mandatory, S3+CloudFront for LFS. Like an executive chef reading four menu proposals, reconciling them into one service, and pricing every plate.

+
+ +
+ +

Dinner Rush Recovery: The Crash That Interrupted Service

+

Earlier today, a machine crash cut off a flow mid-stream — the kitchen lost power during dinner rush. Nineteen tests were left red. Contracts written, implementation half-done. Half-cooked dishes on every station.

+ +

I typed one slash command — the expediter reassembled the brigade:

+ +
/game-feature Finish the leftover jump & locomotion animation work — make the 19 FAILING tests GREEN.
+ +

What happened next: the team picked up exactly where the crash left off. Here's the recipe — the exact YAML that runs in production:

name: game-feature
 description: Build a PLAYABLE game feature and prove it in the LIVE game.
@@ -505,45 +525,58 @@ steps:
   - id: report
     agent: game-judge
-

Eighteen steps, seven custom agents, five oracle gates, and one judge. The whole thing runs as a slash command.

+

Eighteen steps, seven cooks, five inspection points, one head chef. Triggered by a single order ticket.

-

Here's what actually happened. The vision-preflight agent fired first — checked that GEMINI_API_KEY was set and that game_frames could reach the live game instance. Both passed in under a second. Without this gate, the rest of the flow would be meaningless — we'd do all the build work only to discover the vision judge can't run. So we check first.

+

Here's how the brigade actually worked. The vision-preflight agent — the chef who checks the gas is on before anyone starts cooking — verified GEMINI_API_KEY was set and game_frames could reach the live game. Both green in under a second. Without this, the whole kitchen would prep for an hour only to discover the oven doesn't work.

-

The project-context-reader ingested PlayerController.cs, PlayerAnimController.cs, PlayerAnimationLogic.cs, the test files, and the manifest. The feature-planner decomposed the 19 failures into four fix groups: (1) vegetation manifest — 146 items with broken prefabPath, (2) animation controller — crouch parameter not plumbed through, (3) jump physics — coyote time, variable height, air control all unimplemented, (4) animation tree — state machine missing entirely.

+

The project-context-reader — the commis who reads the entire recipe book — ingested PlayerController.cs, PlayerAnimController.cs, PlayerAnimationLogic.cs, the test files, the manifest. The feature-planner — the sous-chef who breaks down the order into station tasks — decomposed 19 failures into four fix groups: vegetation manifest (146 broken prefabPath items), animation controller (crouch parameter not plumbed), jump physics (coyote time, variable height, air control — all missing), and animation tree (entire state machine absent).

-

Then the game-builder agent went to work. It read the test failure messages, traced each one to the source, and started implementing. Coyote time: a 100ms grace period after IsOnFloor() becomes false. Variable jump height: scale velocity by key hold duration, 3.5 at tap, 6.5 at 300ms hold. Air control: reduce horizontal velocity by 40% when airborne. Jump phases: minimum 0.15s duration on jump_start before transitioning to airborne. Landing timer: wait full jump_land length + one frame, not length - blend. Animation tree: state machine with jump_start → jump → jump_land states, 0.1s blend transitions.

+

Then the game-builder — the line cook at the hot station — read each test failure like a dish ticket, traced it to the source, and started cooking. Coyote time: 100ms grace period after feet leave the ground. Variable jump height: velocity scaled by hold duration, tap gives 3.5, full hold gives 6.5. Air control: horizontal speed cut 40% while airborne. Jump phases: minimum 0.15s on jump_start before transitioning up. Landing timer: wait the full animation length, not length-minus-blend. Animation tree: jump_start → jump → jump_land states with 0.1s blends.

-

The build-verifier compiled it. Test-runner ran the suite. Behavioral-prober sent {"jump": true} to the live game and sampled the player body 30 times. Feel-judge measured apex height, airtime, and liftoff latency against thresholds. Animation-vision-judge grabbed 8 frames at 100ms intervals, composed them into a grid, and had gemini-2.5-flash check for T-poses, foot-slide, frozen frames, and missing transitions.

+

Then the inspection line: build-verifier compiled. Test-runner ran the suite. Behavioral-prober sent {"jump":true} to the live game and sampled the player body. Feel-judge measured apex, airtime, liftoff latency. Animation-vision-judge captured 8 frames, gridded them, had gemini-2.5-flash scan for T-poses and foot-slide.

-

Any red gate → evidence fed back to the game-builder → fix → re-enter the gate ladder. Bounded to 3 retries per the max_iterations in the loop decision. Any green gate → falls through to the next. All green → the game-judge produces the final honest verdict.

+

Anything red → ticket back to the cook with the specific failure → fix → re-enter the line. Bounded to 3 returns. Anything green → falls through. All green → game-judge gives the final verdict.

-

This isn't a demo. It's running right now, as I write this, in a Pi session on my machine. The flow is a file at .pi/flows/flows/game-feature.yaml. I trigger it with a slash command. It dispatches sub-agents, runs them through oracle gates, loops on failures, and reports a verdict. That's it.

+
+ Not a Demo +

This flow is a file at .pi/flows/flows/game-feature.yaml. I trigger it by typing /game-feature in Pi. It dispatches agents, runs gates, loops on failures, reports a verdict. There is no dashboard with drag-and-drop. There is a YAML file and a slash command. That's the whole product.

+
-

The Flow-as-Command Pattern

-

Every flow registers as a slash command. .pi/flows/flows/game-feature.yaml becomes /game-feature. Type it in Pi, describe what you want, hit enter. The flow architect dispatches the DAG, the dashboard shows agent cards with live status, and you watch it happen — or walk away and check the result later.

+
-

This is the pattern that makes flows different from scripts. Flows are not hardcoded pipelines you invoke from the terminal. They're slash commands you type in conversation. You describe what you want in natural language, the flow wires it through the agents, and the agents route through the gates. The YAML is the skeleton; the conversation is the context.

+

The Menu: Flows Are Slash Commands

+

Every flow becomes a slash command — the menu you read to the expediter. .pi/flows/flows/game-feature.yaml/game-feature. You don't invoke a pipeline from a terminal. You order a dish in conversation.

-

A few flows I use daily:

+

"Add wall-running" is not a CLI flag. It's natural language. The flow reads it, wires it through the agents, routes it through the gates. The YAML is the recipe. The conversation is the context.

+ +

The menu I call from daily:

-

The slash command is the interface. The flow is the implementation. The oracle gates are the safety net.

+

The Pass: How Agents Hand Off Work

+

In a real kitchen, cooks don't shout instructions across the room. They place finished plates on the pass. The expediter reads the ticket, checks the plate, routes it to the next station or to the dining room. Nobody yells. Nobody grabs someone else's pan.

-

How Agents Communicate (It's Not Chat)

-

A common question: are the agents constantly talking to each other? The answer is no — and that's deliberate. Agents don't chat. They pass structured results through the flow engine bus.

+

Flows work the same way. Agents never talk to each other directly. When the game-builder finishes, it doesn't ping the test-runner. It calls finish({ summary: "...", artifacts: "...", files: "..." }) — placing its work on the pass. The flow engine — the expediter — records it and routes it. The next agent receives exactly the inputs wired in the YAML: ${{result.game-builder.summary}}, ${{result.game-builder.files}}.

-

Each agent runs in an isolated session with scoped tools and file access. When agent A finishes, it calls finish({ summary: "...", artifacts: "...", files: "..." }). The flow engine records the result. Agent B receives exactly what it needs via template variables — ${{result.A.summary}}, ${{result.A.artifacts}}, ${{result.A.files}} — wired through the inputs: block in the flow YAML.

+
+
+ What People Expect +

Agents chatting freely, PM-slack style: "Hey test-runner, I just pushed some code, can you check it? Also the jump feels off, maybe tune the velocity?"

+
+
+ What Actually Happens +

Agent A → finish({verdict: "pass", findings: ["coyote_time=100ms"]}) → engine records → Agent B receives ${{result.A.findings}} via inputs: block. No chatter. Structured handoff.

+
+
-

This is not agent-to-agent chatter. It's a publish/subscribe bus where the flow engine is the broker. Agents never directly invoke each other. They never read each other's raw output unless the flow explicitly wires it. The DAG's blockedBy edges define who waits for whom; the inputs: block defines what data flows across the edge.

+

Why? Because unstructured chatter is how hallucination cascades start. Agent A confidently states something wrong. Agent B builds on it. Agent C compounds it. Three agents later, they're collectively wrong about a file that doesn't exist, and nobody can trace where the error came from. The pass — structured result-passing with typed outputs — makes every handoff auditable, verifiable, and debuggable.

-

Why not let agents talk freely? Because unstructured chatter is the fastest path to hallucination cascades. Agent A confidently states something wrong, agent B builds on it, agent C compounds it. By the time a human notices, you have three agents collectively wrong about a file that doesn't exist. Structured result-passing with typed outputs (verdict: pass, findings: ["missing import", "type mismatch"]) keeps each agent's output machine-readable and verifiable by the gates.

- -

Pi itself is designed for solo interactive work — you ask, it does, you review. The orchestration layer I wrote on top inverts that pattern. Pi becomes the agent harness; the flow engine becomes the conductor. Agents don't talk to each other. They talk to the engine. The engine talks to the gates. The gates talk to the live game. That's the architecture.

+

Pi itself is built for solo interactive work: you ask, it does, you review. The orchestration layer I wrote on top inverts that. Pi becomes the kitchen. The flow engine becomes the expediter. Agents become line cooks who place plates on the pass, never shouting across the room.

The Setup: Extensions, Agents, and 15–20 Flows

"How did you set this up?" is the question I get most often. Here's the honest answer: there's no dashboard with drag-and-drop. You write three kinds of files.

@@ -594,49 +627,58 @@ Context: ${{input.context}}

The setup is not a product you install. It's a stack: Pi as the agent harness, custom extensions as the tool layer, markdown agents as the role layer, YAML flows as the orchestration layer. The whole thing lives in .pi/flows/. Version-controlled. CI-tested. Slash-command invoked.

-

Structure vs. Freestyle: The Skeleton and the Muscle

-

"Do you define the process with these trees, or do the agents freestyle a bit?" Both — and knowing which is which is the whole game.

+

The Recipe vs. The Technique

+

"Do you define the process with these trees, or do the agents freestyle?" Both. The recipe says what to make and in what order. The technique is how each cook executes their station.

-

The skeleton is rigid. The flow YAML defines exactly which agents run, in what order, with what dependencies (blockedBy), what inputs they receive, and which gates they must pass. The DAG is not negotiable. An agent cannot decide to skip the build gate because it feels confident. The build gate runs. Period.

+
+
+ The Recipe (Rigid) +

The flow YAML is the recipe. It says: first the prep cook dices onions, then the saucier makes the base, then the grill cook sears the protein. After every station, the plate hits the pass for inspection. This order is not negotiable. A cook cannot skip the inspection because they feel confident. The inspection runs. Period.

+
+
+ The Technique (Autonomous) +

Inside their station, a cook has full agency. How they dice the onions — brunoise or rough chop — is their call. Which pan they use, how they adjust the heat, whether they taste midway. The game-builder decides which files to read, which approach to take. Nobody tells it "edit line 247." It figures that out with grep, find, and reading code.

+
+
-

The muscle is autonomous. Inside its step, an agent has full agency. The game-builder decides which files to read, which approach to take, which code to write. It discovers project structure with grep and find. It runs the test suite to understand failures. It writes the fix and verifies it compiles. No human tells it "edit line 247 of PlayerController.cs." The agent figures that out.

+

This balance is everything. Too much recipe → agents can't handle surprises. Too much freestyle → agents hallucinate, skip checks, ship broken code. The recipe guarantees the right things happen in the right order — preflight before build, build before test, test before ship. The technique handles the messy, unpredictable reality of actual code.

-

Think of it like a company: the org chart (DAG) defines reporting lines and handoff points. The people (agents) do the actual work their own way. The compliance department (gates) checks everything before it ships. The CEO (judge) signs off.

+
+ The Meta-Kitchen +

And when a recipe is wrong? Another flow improves it. A meta-flow reads performance data, spots bottlenecks — "the feel gate keeps failing because the cook doesn't know the jump velocity threshold" — edits the YAML to wire that threshold into the builder's inputs, and commits the change. Flows that edit flows. The kitchen that renovates itself between services.

+
-

This balance is why the system works at all. Too much structure → agents can't adapt to unexpected situations. Too much freestyle → agents hallucinate, skip checks, ship broken code. The skeleton guarantees the right things happen in the right order. The muscle handles the messy reality of actual code.

+
-

And when a flow's skeleton is wrong? The meta-flow improves it. It reads flow performance data, identifies bottlenecks ("the feel gate keeps failing because the game-builder doesn't know the jump velocity threshold"), edits the YAML to wire that threshold into the builder's inputs, and commits the change. Flows that improve flows. That's the endgame.

- -

Model Strategy: DeepSeek for Code, Gemini for Vision

-

"Which DeepSeek model?" The short answer: DeepSeek V4 for coding-heavy agents, DeepSeek V4 Flash for fast routing decisions. The long answer: model selection is not one-size-fits-all.

- -

Flows use role-based model tiers — each agent declares a tier (@coding, @planning, @research, @fast, @compact, @vision), and the engine resolves it to a concrete model at dispatch time. This means you can swap models globally without touching any agent or flow file.

+

Picking the Right Knife: Model Strategy

+

You don't use a paring knife to butcher a cow. You don't use a cleaver to supreme an orange. Different work needs different blades. Flows use role-based model tiers — each agent declares the blade it needs, and the engine hands it the right one at dispatch time.

- - + + - - - - - - + + + + + +
TierModelUsed forThe KnifeWhat It Cuts
@codingdeepseek/deepseek-v4Reading, writing, editing code — the game-builder, fixer, test-author
@planningdeepseek/deepseek-v4Flow architect, feature planner — decomposing tasks, designing DAGs
@fastdeepseek/deepseek-v4-flashRouting decisions — gate pass/fail, fork choices, loop exit checks
@researchdeepseek/deepseek-v4Codebase investigation, reading project docs, pattern analysis
@visiongoogle/gemini-2.5-flashMultimodal frame judging — T-pose detection, animation clip verification
@compactdeepseek/deepseek-v4-flashSummarisation, report generation, lightweight post-processing
@codingDeepSeek V4Chef's knife — your workhorse. Reads 800-line files, writes 200-line diffs. Game-builder, fixer, test-author. Free.
@planningDeepSeek V4Boning knife — precision decomposition. Breaks tasks into steps, designs DAGs. Flow architect, feature planner.
@fastDeepSeek V4 FlashParing knife — quick, decisive cuts. Gate pass/fail, fork choices, loop exits. No overthinking.
@researchDeepSeek V4Fillet knife — flexible, follows contours. Reads codebase, traces patterns, finds what matters.
@visionGemini 2.5 FlashThe inspector's eyes — the only knife that sees. Multimodal frame judging: T-poses, foot-slide, frozen anims.
@compactDeepSeek V4 FlashKitchen shears — lightweight, versatile. Summaries, verdicts, post-processing. Fast and cheap.
-

Why DeepSeek? Two reasons. First, it's free — the coding tier runs on DeepSeek's API with no usage limits, which matters when your game-builder agent is reading 800-line files and writing 200-line diffs ten times a session. Second, it's genuinely good at C# and Godot — I've had it write a full lighting module for our Godot fork by reading Unity API docs and adapting patterns. No agent had pulled that off before.

+
+ Why DeepSeek? +

Two reasons. It's free — no usage limits, which matters when your game-builder reads 800-line files and writes 200-line diffs ten times a session. It's genuinely good at C# and Godot — I've had it write a full lighting module for our Godot fork by reading Unity API docs and adapting patterns. No agent had pulled that off before. DeepSeek can't do multimodal, so vision goes to Gemini — but for everything else, it's the chef's knife you reach for 90% of the time.

+
-

Vision is the exception. DeepSeek can't do multimodal, so the visual gate uses Gemini 2.5 Flash. It's fast (under 2 seconds per frame grid), cheap, and catches the things that matter: T-poses, foot-slide, frozen animations, missing transitions. The vision preflight gate checks the Gemini API key is set before any build work starts — if it's missing, the entire flow hard-stops. Vision is never silently skipped.

+

The point of the knife rack: you configure this once. Every agent declares model: @coding and gets DeepSeek V4 automatically. Swap models globally without touching any flow or agent file. The right blade, every time, no thinking required.

-

The key insight: different work needs different brains. Code writing needs a model that understands language semantics and type systems. Vision judging needs a model that sees pixels and understands motion. Routing decisions need a model that's fast and decisive, not one that overthinks. The role-tier system means you configure this once, at the model level, and every agent that declares model: @coding gets the right brain automatically.

- -
+

The oracle tools — verify_build, drive_game, game_frames — are the durable assets. About 300 lines of TypeScript each, MIT licensed, reusable in any Pi project. The flow engine composes them; the agents route through them.