refactor: hand-authored pi-flow post + consolidated agents.md

- pi-flow-native-brain is now hand-authored HTML (direct SVGs + styled
  tables; no build.js passthrough needed)
- Card hardcoded in _index_template.html above {{CARDS}}
- Removed posts/pi-flow-native-brain.md (build.js no longer touches it)
- New agents.md: consolidated agent guide for the blog repo
  (blog architecture, build pipeline, adding posts, styling rules,
  writing guide, deploy instructions, skills reference)
- Removed old skills/blog.md (content migrated to agents.md)
This commit is contained in:
2026-06-03 02:47:06 +01:00
parent d39e9b9534
commit bf42a76cf9
6 changed files with 247 additions and 470 deletions
+7
View File
@@ -120,6 +120,13 @@
<!-- BLOG LIST -->
<div class="blog-list">
<!-- hand-authored HTML posts (not from build.js) -->
<a href="pi-flow-native-brain" class="blog-card">
<span class="blog-card__date">3 June 2026</span>
<h2 class="blog-card__title">How Pi Agents Build, Test, and Ship Game Code with Oracle-Backed Flows</h2>
<p class="blog-card__excerpt">When we ask Pi to build a game feature, it doesn't just write code. It compiles, runs tests, drives the live game, measures feel, fixes CI failures, and ships a green PR — all through composable oracle-backed flows.</p>
<span class="blog-card__read">Read &rarr;</span>
</a>
{{CARDS}}
</div>
Before
After
+164
View File
@@ -0,0 +1,164 @@
# agents.md — Tinqs Blog Agent Guide
This file teaches AI agents (Pi, Cursor, Claude Code) how to work with the Tinqs blog repo. Read it before making any changes.
## Blog architecture
```
tinqs-ltd/blog/
├── _template.html # Post shell — wraps a single blog post
├── _index_template.html # Listing shell — blog index page
├── build.js # Zero-dep Node script: posts/*.md + templates → *.html
├── posts/ # Markdown posts with YAML frontmatter
│ ├── agent-harness.md
│ ├── agentic-workflow.md
│ └── ...
├── *.html # Generated output (never hand-edit regular posts)
├── pi-flow-native-brain.html # Hand-authored HTML post (SVGs + tables)
├── agents.md # This file
└── skills/ # Reusable skill playbooks
```
### Two kinds of posts
1. **Regular posts** — Markdown in `posts/*.md`, built via `node build.js` into `*.html`. Always edit the `.md`, never the `.html`.
2. **Hand-authored HTML posts**`pi-flow-native-brain.html` is the only one. It contains inline SVGs and styled tables that build.js can't emit. Edit the HTML directly, never create a `.md` for it. Cards for hand-authored posts are hardcoded in `_index_template.html`.
### Build pipeline
```bash
node build.js # reads posts/*.md → generates *.html + index.html
```
`build.js` has zero npm dependencies — pure Node.js built-ins. It handles:
- YAML frontmatter parsing
- Minimal markdown → HTML conversion (headings, bold/italic, inline code, fenced code blocks, lists, figures, links, hr)
- `<!--raw-->` / `<!--/raw-->` blocks for raw HTML passthrough
- Lead paragraph separation (first paragraph after frontmatter → `.post__lead`)
- Date formatting
- Index page generation (newest-first sorted cards)
## How to add a post
### Regular markdown post
Create `posts/<slug>.md`:
```yaml
---
title: "Post Title — with optional subtitle"
slug: url-friendly-slug
date: "2026-06-03"
description: "Full meta description for SEO (150-160 chars ideal)."
og_description: "Shorter OG/Twitter description (optional)."
og_image: "https://www.tinqs.com/img/og-cover.jpg"
excerpt: "Card text shown on the blog index page."
author: "Ozan Bozkurt"
author_initials: "OB"
author_role: "CTO & Developer, Tinqs"
---
First paragraph becomes the lead. Keep it punchy — two sentences max.
Everything after the first blank line is the post body. Use standard markdown.
```
Then:
```bash
node build.js # generates <slug>.html + rebuilds index.html
git add posts/<slug>.md <slug>.html index.html
git commit -m "post: <title>"
```
### Hand-authored HTML post
Copy `pi-flow-native-brain.html` as a template. Keep the `<style>` block and nav/footer wrapper. Key rules:
- Always add a card to `_index_template.html` so it appears on the listing page
- Never create a corresponding `.md` in `posts/` — build.js will overwrite it
- Use the same class structure: `.post`, `.post__title`, `.post__body`, etc.
### Adding a card for a hand-authored HTML post
In `_index_template.html`, add before `{{CARDS}}`:
```html
<a href="your-slug" class="blog-card">
<span class="blog-card__date">3 June 2026</span>
<h2 class="blog-card__title">Your Title</h2>
<p class="blog-card__excerpt">Card excerpt text.</p>
<span class="blog-card__read">Read &rarr;</span>
</a>
```
## Styling
### Three layers (cascade order)
1. `../style.css` — external, served by Git Studio. Nav, footer, base typography, `--c-accent: #c9935a`. Never edit.
2. `<style>` in `_template.html` — post-page overrides (inline, at end of `<head>`)
3. `<style>` in `_index_template.html` — index-page overrides
The inline `<style>` blocks come AFTER the `../style.css` link, so same-specificity rules win by cascade order. No `!important` needed.
### Adding a style rule
1. Open `_template.html` (or `_index_template.html` for listing-only styles)
2. Find the `<style>` block at end of `<head>` (marked with `/* ── Team guide aesthetic ── */`)
3. Add your rule using the existing palette:
- Amber `#c9935a` (brand anchor), gold `#f59e0b` (emphasis)
- Blue `#38bdf8` (links, pills), purple `#a855f7` (h3, hover)
- Dark `#0a0e14` (code bg), border `#2a3340`
4. `node build.js` to regenerate
5. Verify: `grep "your-selector" *.html`
### Never
- Edit `../style.css` (outside this repo)
- Hand-edit generated `*.html` (build.js clobbers them)
- Restyle `.nav`, `.footer`, or mobile menu (belongs to parent site)
- Introduce new colours without a strong reason
- Add external font loads, CDN deps, or `@import`
## Post structure (writing guide)
Good technical posts follow this pattern:
1. **Lead paragraph** — what this is about, one punchy sentence
2. **The hook** — why it matters, what problem it solves
3. **How it works** — concrete examples, code, metrics
4. **What we learned** — insights, surprises, trade-offs
5. **Closing** — what's next, internal links to related posts
Voice: direct, concrete, no marketing fluff. Show numbers. Show code. Tell stories.
### SEO checklist
- Title under 60 characters
- Description 150-160 characters
- `og_image` set (falls back to `/img/og-cover.jpg`)
- Meaningful excerpt for index card
- Internal links where relevant (`[other post](other-slug)`)
### Conventions
- Slugs: kebab-case matching filename: `my-post.md` → slug `my-post`
- Dates: ISO format `2026-06-03`
- Canonical URLs: `https://www.tinqs.com/blog/<slug>`
- Em dashes: `---` in markdown renders as `&mdash;`, `--` as `&ndash;`
## Deploy
```bash
git add -A
git commit -m "post: <description>"
git push origin main
```
Git Studio serves this repo directly. A push to main is a deploy. No build step on the server — static HTML files.
## Skills reference
The `skills/` directory contains reusable agent playbooks for game dev workflows. These are NOT blog-specific — they're for game development agents using Tinqs Studio.
- **Image Generation** (`skills/image-generation.md`) — fal.ai Flux API, 4-layer prompt pattern, model comparison
- **Concept Art Pipeline** (`skills/concept-art-pipeline.md`) — Full 2D concept art → 3D model workflow
- **Sora 2 Video** (`skills/sora2-video.md`) — Trailer clips with OpenAI Sora 2
- **Tripo 3D** (`skills/tripo-browser-workflow.md`) — Text-to-3D and image-to-3D via Tripo Studio
Skills are markdown playbooks — drop them into any agent's skills directory to teach it a workflow.
+7 -7
View File
@@ -120,6 +120,13 @@
<!-- BLOG LIST -->
<div class="blog-list">
<!-- hand-authored HTML posts (not from build.js) -->
<a href="pi-flow-native-brain" class="blog-card">
<span class="blog-card__date">3 June 2026</span>
<h2 class="blog-card__title">How Pi Agents Build, Test, and Ship Game Code with Oracle-Backed Flows</h2>
<p class="blog-card__excerpt">When we ask Pi to build a game feature, it doesn't just write code. It compiles, runs tests, drives the live game, measures feel, fixes CI failures, and ships a green PR — all through composable oracle-backed flows.</p>
<span class="blog-card__read">Read &rarr;</span>
</a>
<a href="blog-visual-upgrade" class="blog-card">
<span class="blog-card__date">3 June 2026</span>
@@ -128,13 +135,6 @@
<span class="blog-card__read">Read &rarr;</span>
</a>
<a href="pi-flow-native-brain" class="blog-card">
<span class="blog-card__date">3 June 2026</span>
<h2 class="blog-card__title">Pi's Flow-Native Brain: Retiring the Supervisor, Teaching Agents to Fix Their Own Builds</h2>
<p class="blog-card__excerpt">Two changes made Pi genuinely autonomous: we deleted the hardcoded supervisor and replaced it with composable oracle-backed flows, and we taught agents to watch CI, read failure logs, and fix their own broken builds.</p>
<span class="blog-card__read">Read &rarr;</span>
</a>
<a href="cloud-harness" class="blog-card">
<span class="blog-card__date">26 May 2026</span>
<h2 class="blog-card__title">Building a Cloud Agent Harness with DeepSeek V4 and Pi</h2>
Before
After
+69 -147
View File
@@ -4,27 +4,27 @@
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Pi's Flow-Native Brain: Retiring the Supervisor, Teaching Agents to Fix Their Own Builds — Tinqs Blog</title>
<meta name="description" content="We deleted 1,050 lines of hardcoded supervisor logic, replaced it with oracle-backed pi-flows, and gave agents the tools to watch CI and fix their own broken builds.">
<title>How Pi Agents Build, Test, and Ship Game Code with Oracle-Backed Flows — Tinqs Blog</title>
<meta name="description" content="We use Pi flows with oracle-backed gates to make agents compile, test, drive the live game, measure feel, fix CI failures, and ship green PRs — all autonomously.">
<meta name="robots" content="index, follow">
<link rel="canonical" href="https://www.tinqs.com/blog/pi-flow-native-brain">
<meta property="og:type" content="article">
<meta property="og:url" content="https://www.tinqs.com/blog/pi-flow-native-brain">
<meta property="og:title" content="Pi's Flow-Native Brain: Retiring the Supervisor, Teaching Agents to Fix Their Own Builds">
<meta property="og:description" content="Pi's supervisor is gone — replaced by oracle-backed flows and CI-integrating agents that fix their own builds.">
<meta property="og:title" content="How Pi Agents Build, Test, and Ship Game Code with Oracle-Backed Flows">
<meta property="og:description" content="Pi flows + oracle-backed gates: agents that compile, test, drive the game, measure feel, fix CI, and ship green PRs.">
<meta property="og:image" content="https://www.tinqs.com/img/og-cover.jpg">
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:title" content="Pi's Flow-Native Brain: Retiring the Supervisor, Teaching Agents to Fix Their Own Builds">
<meta name="twitter:description" content="Pi's supervisor is gone — replaced by oracle-backed flows and CI-integrating agents that fix their own builds.">
<meta name="twitter:title" content="How Pi Agents Build, Test, and Ship Game Code with Oracle-Backed Flows">
<meta name="twitter:description" content="Pi flows + oracle-backed gates: agents that compile, test, drive the game, measure feel, fix CI, and ship green PRs.">
<meta name="twitter:image" content="https://www.tinqs.com/img/og-cover.jpg">
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "BlogPosting",
"headline": "Pi's Flow-Native Brain: Retiring the Supervisor, Teaching Agents to Fix Their Own Builds",
"headline": "How Pi Agents Build, Test, and Ship Game Code with Oracle-Backed Flows",
"datePublished": "2026-06-03",
"author": {
"@type": "Person",
@@ -35,16 +35,10 @@
"name": "Tinqs Limited",
"url": "https://www.tinqs.com"
},
"description": "We deleted 1,050 lines of hardcoded supervisor logic, replaced it with oracle-backed pi-flows, and gave agents the tools to watch CI and fix their own broken builds."
"description": "We use Pi flows with oracle-backed gates to make agents compile, test, drive the live game, measure feel, fix CI failures, and ship green PRs — all autonomously."
}
</script>
<!-- PostHog (EU) -->
<script>
!function(t,e){var o,n,p,r;e.__SV||(window.posthog=e,e._i=[],e.init=function(i,s,a){function g(t,e){var o=e.split(".");2==o.length&&(t=t[o[0]],e=o[1]),t[e]=function(){t.push([e].concat(Array.prototype.slice.call(arguments,0)))}}(p=t.createElement("script")).type="text/javascript",p.crossOrigin="anonymous",p.async=!0,p.src=s.api_host.replace(".i.posthog.com","-assets.i.posthog.com")+"/static/array.js",(r=t.getElementsByTagName("script")[0]).parentNode.insertBefore(p,r);var u=e;for(void 0!==a?u=e[a]=[]:a="posthog",u.people=u.people||[],u.toString=function(t){var e="posthog";return"posthog"!==a&&(e+="."+a),t||(e+=" (stub)"),e},u.people.toString=function(){return u.toString(1)+".people (stub)"},o="init capture register register_once register_for_session unregister unregister_for_session getFeatureFlag getFeatureFlagPayload isFeatureEnabled reloadFeatureFlags updateEarlyAccessFeatureEnrollment getEarlyAccessFeatures on onFeatureFlags onSessionId getSurveys getActiveMatchingSurveys renderSurvey canRenderSurvey getNextSurveyStep identify setPersonProperties group resetGroups setPersonPropertiesForFlags resetPersonPropertiesForFlags setGroupPropertiesForFlags resetGroupPropertiesForFlags reset get_distinct_id getGroups get_session_id get_session_replay_url alias set_config startSessionRecording stopSessionRecording sessionRecordingStarted captureException loadToolbar get_property getSessionProperty createPersonProfile opt_in_capturing opt_out_capturing has_opted_in_capturing has_opted_out_capturing clear_opt_in_out_capturing debug".split(" "),n=0;n<o.length;n++)g(u,o[n]);e._i.push([i,s,a])},e.__SV=1)}(document,window.posthog||[]);
posthog.init('phc_teG6p5oxf6poQHPThq5AGKzWQNhw4bHW9arLwWAVXm3f',{api_host:'https://eu.i.posthog.com',ui_host:'https://eu.posthog.com',person_profiles:'identified_only',defaults:'2026-01-30'})
</script>
<link rel="icon" type="image/svg+xml" href="/img/favicon.svg">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
@@ -53,7 +47,6 @@
<style>
/* ── Team guide aesthetic: self-contained overrides ── */
/* ── Gradient title (amber → warm gold, hint of blue) ── */
.post__title {
background: linear-gradient(90deg, #c9935a, #f59e0b 40%, #38bdf8);
-webkit-background-clip: text;
@@ -62,7 +55,6 @@
font-weight: 800;
}
/* ── Date pill ── */
.post__date {
display: inline-block;
font-family: ui-monospace, 'SF Mono', 'Cascadia Code', Consolas, monospace;
@@ -76,14 +68,12 @@
margin-bottom: 16px;
}
/* ── Lead ── */
.post__lead {
color: #9aa7b4;
font-size: 1.08rem;
line-height: 1.7;
}
/* ── H2: left accent bar ── */
.post__body h2 {
font-size: 1.7rem;
margin: 54px 0 6px;
@@ -91,14 +81,12 @@
border-left: 4px solid #c9935a;
}
/* ── H3: purple secondary accent ── */
.post__body h3 {
color: #a855f7;
font-size: 1.18rem;
margin: 30px 0 4px;
}
/* ── Inline code ── */
.post__body code {
font-family: ui-monospace, 'SF Mono', 'Cascadia Code', Consolas, monospace;
font-size: 0.86em;
@@ -109,7 +97,6 @@
border: 1px solid #2a3340;
}
/* ── Code blocks (dark panel) ── */
.post__body pre {
background: #0a0e14;
border: 1px solid #2a3340;
@@ -123,7 +110,6 @@
color: #e6edf3;
}
/* Reset inline-code double-up inside pre */
.post__body pre code {
background: transparent;
padding: 0;
@@ -133,7 +119,6 @@
border-radius: 0;
}
/* ── Blockquote callout (ready for future use; build.js does not emit blockquote yet) ── */
.post__body blockquote {
background: rgba(245, 158, 11, 0.08);
border: 1px solid rgba(245, 158, 11, 0.25);
@@ -145,7 +130,6 @@
font-size: 0.94rem;
}
/* ── Links ── */
.post__body a {
color: #38bdf8;
}
@@ -154,19 +138,16 @@
color: #a855f7;
}
/* ── Strong ── */
.post__body strong {
color: #f59e0b;
}
/* ── HR ── */
.post__body hr {
border: none;
border-top: 1px solid #2a3340;
margin: 32px 0;
}
/* ── Figures ── */
.post__body figure img {
border-radius: 12px;
border: 1px solid #2a3340;
@@ -178,7 +159,6 @@
margin-top: 6px;
}
/* ── List spacing ── */
.post__body li {
margin: 4px 0;
}
@@ -186,7 +166,6 @@
</head>
<body>
<!-- NAV -->
<nav class="nav nav--scrolled" id="nav">
<a href="/" class="nav__logo" aria-label="Tinqs home">
<span class="nav__wordmark">TINQS</span>
@@ -204,7 +183,6 @@
</button>
</nav>
<!-- MOBILE MENU -->
<div class="mobile-menu" id="mobileMenu">
<a href="/#game" class="mobile-menu__link">Games</a>
<a href="/#tech" class="mobile-menu__link">Technology</a>
@@ -214,33 +192,17 @@
<a href="/press" class="mobile-menu__link">Press</a>
</div>
<!-- POST -->
<article class="post">
<a href="/blog/" class="post__back">&larr; All Posts</a>
<span class="post__date">3 June 2026</span>
<h1 class="post__title">Pi's Flow-Native Brain: Retiring the Supervisor, Teaching Agents to Fix Their Own Builds</h1>
<p class="post__lead">Two changes this week made Pi genuinely autonomous. First, we deleted 1,050 lines of hardcoded supervisor logic and replaced it with a flow-native brain — oracle-backed gates that agents compose dynamically. Second, we closed the loop: agents now watch CI, read failure logs, and fix their own broken builds until the pipeline goes green.</p>
<h1 class="post__title">How Pi Agents Build, Test, and Ship Code with Oracle-Backed Flows</h1>
<p class="post__lead">When we ask Pi to build a feature for Ariki — say, "add a double-jump with a cooldown indicator" — five things happen. The agent writes the code. A build gate compiles it. A test gate runs the test suite. A behaviour gate drives the live game and checks the character actually double-jumps. A feel gate measures apex height, airtime, and landing settle. And if CI disagrees with any of it, the agent reads the failure log and fixes it. None of this is magic. It's Pi flows.</p>
<div class="post__body">
<hr>
<h2>Part 1: Retiring the Supervisor</h2>
<h3>What the Supervisor Did</h3>
<p>The <code>.pi/supervisor/</code> directory was the orchestration brain Pi left to us. For every task, it ran a fixed loop:</p>
<p>1. <strong>Contract gate</strong> — skip-to-human if "done" wasn't programmatically verifiable</p>
<p>2. <strong>TDAID phase A</strong> — a test-writer agent writes RED tests, never implementation</p>
<p>3. <strong>TDAID phase B</strong> — a code-writer agent makes them green; on failure, a Reflexion follow-up retries (capped)</p>
<p>4. <strong>Verification gate</strong> — run the build, check tests, pass or fail with a report</p>
<p>It worked. It caught broken builds before they hit CI. It enforced the discipline of "define done before you start." But it had a structural problem: the loop was <strong>hardcoded</strong>. Every decision tree, every gate, every retry policy was baked into TypeScript. To change the workflow, you changed code. To add a new gate — vision QA, linting, asset validation — you added more code to the same monolithic loop.</p>
<p>The supervisor was doing what <code>pi-flows</code> was designed to do, but from the wrong side of the architecture. Flows composes agents, gates, and decision points into pipelines. The supervisor reimplemented that logic in a single file. It was fighting the framework.</p>
<h3>What Replaced It</h3>
<p>The verify-heavy brain now runs <strong>as a pi-flows flow</strong> — a pipeline of oracle-backed gates orchestrated by the flow engine, visualized in FlowDashboard, and composable by agents themselves.</p>
<p>The core pieces:</p>
<ul>
<li><strong>Oracle-backed gates.</strong> The <code>verify_build</code> tool is the canonical gate. It compiles the game and sim, runs tests, and returns a structured PASS/FAIL verdict with file:line errors. Agents route through it; the gate decides whether to proceed.</li>
<li><strong>Agent-loop-decision Reflexion.</strong> Instead of a fixed two-phase TDAID loop, agents self-reflect on build failures. The flow engine gives them the failure report; they decide whether to fix and retry or escalate.</li>
<li><strong>Role-split agents.</strong> G1 build, G2 tests, G3 behaviour (drives the live game), G4 feel (measured game-feel) and G5 visual (animation) are separate sub-agents, each with its own toolset and context, composed by the flow.</li>
</ul>
<p>The result is a pipeline that flows naturally — a plan, an implementation, then a ladder of oracle-backed gates:</p>
<h2>What Happens When You Ask Pi to Build Something</h2>
<p>The flow starts the same way every agent task does: context, then plan, then implement. That's the standard loop. What makes it interesting is what happens <em>after</em> implementation — a ladder of five gates, each run by a specialised sub-agent with its own tools and its own pass/fail authority.</p>
<figure style="margin:28px 0;">
<svg viewBox="0 0 920 350" role="img" aria-label="The verify-heavy flow: context, plan, implement, five gates, a Reflexion loop, and one judge" style="width:100%;height:auto;display:block;background:#0a0e14;border:1px solid #2a3340;border-radius:12px;font-family:'IBM Plex Sans',system-ui,sans-serif;">
<defs>
@@ -275,98 +237,53 @@
<path d="M820,150 C 908,96 716,50 556,61" fill="none" stroke="#f59e0b" stroke-width="1.8" stroke-dasharray="6 5" marker-end="url(#ahA)"/>
<text x="694" y="96" fill="#f59e0b" font-size="12.5">Reflexion · fix & retry ≤ 3</text>
</svg>
<figcaption style="color:#9aa7b4;font-size:0.85rem;margin-top:8px;">A real in-game failure loops back to <em>implement</em> with the gate evidence (bounded to three tries); anything green — or skipped because no live instance is running — falls through to a single honest judge.</figcaption>
<figcaption style="color:#9aa7b4;font-size:0.85rem;margin-top:8px;">A real failure loops back to <em>implement</em> with gate evidence (bounded to three tries); anything green falls through to the judge.</figcaption>
</figure>
<p>It started as three gates — build, test, vision. Gates are cheap to add, so it grew: a feature now also passes a live-game <strong>behaviour</strong> probe and a measured <strong>feel</strong> check before the judge signs off. Critically, the flow is not fixed. Agents can add gates, reorder steps, or branch on conditions. The flow engine handles orchestration; the agents handle decisions.</p>
<h3>What We Deleted</h3>
<p>The commit removes 1,050 lines across 15 files:</p>
<h2>The Five Gates</h2>
<p>Each gate is a sub-agent with one job and one tool.</p>
<p><strong style="color:#f59e0b;">G1 — Build.</strong> Runs <code>dotnet build</code> on the game and sim. Returns PASS/FAIL with file:line errors. If the code doesn't compile, nothing proceeds.</p>
<p><strong style="color:#f59e0b;">G2 — Tests.</strong> Runs <code>dotnet test</code> and parses results. The agent reads which tests broke and fixes assertions, mocks, or test setup.</p>
<p><strong style="color:#f59e0b;">G3 — Behaviour (live game).</strong> This is the one that makes game dev different from web dev. The agent sends input to the running game — <code>{"jump": true}</code> — and samples the player body 30 times at 50ms intervals. It checks: did the character actually jump? Did the double-jump fire? Was there a cooldown? The <code>drive_game</code> tool is the ground-truth oracle for whether a movement feature works in-game, not just in tests.</p>
<p><strong style="color:#f59e0b;">G4 — Feel (measured game-feel).</strong> Behaviour checks whether it worked. Feel checks whether it felt good. The agent measures apex height, airtime, liftoff latency, rise/fall asymmetry, and landing settle. Numeric metrics with thresholds. A jump that technically works but takes 400ms to lift off fails the feel gate.</p>
<p><strong style="color:#f59e0b;">G5 — Visual.</strong> Captures frame sequences from the live game and feeds them to a vision model. Checks: is the animation playing? Is the cooldown indicator visible? Are there visual artifacts?</p>
<p>Anything green falls through to the judge. Anything red loops back to implement with the failure evidence — the agent reads what went wrong, fixes it, and re-enters the gate ladder. Three retries max, then escalation to a human.</p>
<h2>Composability: Gates Are Cheap to Add</h2>
<p>The flow started with three gates — build, test, vision. Behaviour and feel were added later, each as a one-file extension. Gates are not hardcoded. They're sub-agents declared in a flow config. Want a linting gate? Add a sub-agent with a linter tool. Want a security scan? Same pattern. Want a gate that checks asset bundle sizes haven't bloated? Write the tool, declare the sub-agent, wire it into the flow.</p>
<p>Agents themselves can extend the flow. If a sub-agent notices a pattern of failures — "the last three behaviour checks failed because the game window wasn't focused" — it can insert a pre-condition gate that checks window focus before proceeding. The flow engine handles routing; the agents handle decisions.</p>
<p>This is what makes flows fundamentally different from a script: the pipeline is not fixed at compile time. It's a graph that agents read, understand, and modify at runtime.</p>
<h2>The CI Loop: Agents That Fix Their Own Builds</h2>
<p>Gates handle pre-push verification. But what about after push? What about CI?</p>
<p>Most coding agents don't care if the code compiles on the CI runner. They write, they push, they walk away. A human discovers the broken build an hour later.</p>
<p>We closed this loop with the <code>tinqs-ci</code> extension — three tools that give agents post-push autonomy:</p>
<ul>
<li><code>runner.ts</code> (115 lines) — the main orchestration loop</li>
<li><code>supervisor.ts</code> (119 lines) — the state machine driving sessions</li>
<li><code>gates.ts</code> (75 lines) — hardcoded gate definitions</li>
<li><code>policy.ts</code> (92 lines) — retry limits and decision logic</li>
<li><code>store.ts</code> (54 lines) — session state persistence</li>
<li><code>types.ts</code> (76 lines) — type definitions for the whole system</li>
<li><code>events.ts</code> (47 lines) — inter-process event bus</li>
<li>Plus tests, examples, and documentation</li>
</ul>
<figure style="margin:24px 0;">
<svg viewBox="0 0 920 180" role="img" aria-label="Lines of code: 1,050 deleted versus about 300 kept" style="width:100%;height:auto;display:block;background:#0a0e14;border:1px solid #2a3340;border-radius:12px;font-family:'IBM Plex Sans',system-ui,sans-serif;">
<text x="40" y="34" fill="#9aa7b4" font-size="13">Net change: <tspan fill="#f59e0b" font-weight="600">750 lines</tspan>, + a composable pipeline</text>
<text x="40" y="76" fill="#f0816a" font-size="13">Deleted</text>
<rect x="150" y="58" width="730" height="30" rx="6" fill="#2a1416" stroke="#f0816a" stroke-opacity="0.6"/>
<text x="868" y="78" text-anchor="end" fill="#f3b4a8" font-size="12.5">supervisor/ — 1,050 lines · 15 files</text>
<text x="40" y="136" fill="#34d399" font-size="13">Kept</text>
<rect x="150" y="118" width="209" height="30" rx="6" fill="#0f2a22" stroke="#34d399" stroke-opacity="0.6"/>
<text x="369" y="138" fill="#9fe6c0" font-size="12.5">verify_build — ~300 lines · 1 oracle</text>
</svg>
<figcaption style="color:#9aa7b4;font-size:0.85rem;margin-top:8px;">The whole orchestration loop was deleted; only the build oracle survived — and it became the gate that powers the flow.</figcaption>
</figure>
<p>None of this was bad code. It was just the wrong layer. Flows gives us all of this — orchestration, state, gates, retry policy, event routing — as a framework primitive. We were maintaining a parallel implementation of something the framework already provided.</p>
<p>The durable asset we kept: <code>verify_build</code>, the build oracle. It's now reused as the gate tool that powers the flow pipeline.</p>
<h3>The Bug That Took a Day to Find</h3>
<p>Moving to flows exposed a subtle problem. Flow sub-agents run in their <strong>own extension stack</strong> — the main session's extensions don't reach them. The build-verifier and test-runner agents declared <code>verify_build</code> in their frontmatter, but the tool was never actually in their toolset.</p>
<p>The symptom was confusing: agents reported "oracle not available" and routed to fail/report, silently skipping the test gate entirely. A false green — the build passed, tests never ran, and the pipeline reported success.</p>
<p>The fix was a single pattern: emit <code>flow:register-tool</code> with the full tool definition at extension activation, and re-announce on <code>flow:rediscover</code>. The flow engine collects these into <code>getExtensionTools()</code> and hands them to every sub-agent that declares the tool. Three lines of orchestration, a day of debugging.</p>
<p>Verified live: <code>game-check</code> now routes <code>context → build → build-gate(pass) → tests → tests-gate(pass) → vision</code>. Every gate fires. No false greens.</p>
<h3>Why This Architecture Wins</h3>
<p><strong>Composability.</strong> Agents can add gates without touching framework code. Want a linting gate? Add a sub-agent with a linter tool. Want a security scan? Same pattern. The flow engine handles routing; you just declare the gate.</p>
<p><strong>Reusability.</strong> The <code>verify_build</code> oracle that powered the old supervisor now powers the flow gates. Same tool, same interface, different orchestration. No rewrite needed.</p>
<p><strong>Observability.</strong> FlowDashboard visualizes the entire pipeline. You can see which gates passed, which failed, and where the agent decided to retry. The old supervisor logged to stdout.</p>
<p><strong>Self-modification.</strong> Agents can read the flow graph, understand where they are in the pipeline, and decide what to do next. The supervisor's decision tree was opaque to the agents it was supervising. Flows makes the pipeline itself part of the agent's context.</p>
<hr>
<h2>Part 2: Agents That Fix Their Own Builds</h2>
<p>Most coding agents have a dirty secret: they don't care if the code compiles. They write, they push, they walk away. The human discovers the broken build an hour later. The flow-native brain handles verification inside the pipeline — but what about after push? What about CI?</p>
<h3>The Gap</h3>
<p>Every agent demo looks the same. The AI writes code, commits, pushes. The presenter says "and now we have a pull request!" Cut. End of demo.</p>
<p>What happens next? The CI pipeline runs. Tests fail. Linting screams. The build breaks because someone forgot an import. A human opens the PR, reads the red badge, clicks into the logs, finds the error, fixes it, pushes again. The agent did 90% of the work but left the last 10% — the most tedious part — for a person.</p>
<p>We wanted agents that finish the job.</p>
<h3>The tinqs-ci Extension</h3>
<p>Our <a href="https://tinqs.com/tinqs/pi" style="color: var(&ndash;c-accent-l);">Pi fork</a> has a <code>tinqs-ci</code> extension — a single TypeScript file, about 200 lines — that gives the agent three tools:</p>
<ul>
<li><strong>ci_status</strong> — checks the current pipeline state for a branch (pending, running, success, failure)</li>
<li><strong>ci_status</strong> — checks pipeline state for a branch</li>
<li><strong>ci_logs</strong> — fetches the full build log from the most recent failed run</li>
<li><strong>ci_wait</strong> — polls the pipeline every 15 seconds until it finishes, then returns the result</li>
<li><strong>ci_wait</strong> — polls every 15 seconds until the pipeline finishes</li>
</ul>
<p>These are Gitea Actions API calls under the hood. The agent authenticates with the same PAT it uses for git push. No extra credentials, no special CI service account.</p>
<h3>The Loop</h3>
<p>Here's what a Pi task looks like end to end:</p>
<pre><code>Agent receives task brief
→ reads codebase, plans approach
→ writes code
→ runs local tests (bash tool)
→ commits and pushes branch
→ calls ci_wait
→ CI passes → opens PR via Gitea API
→ CI fails → calls ci_logs
→ reads error output
→ fixes the issue
→ pushes again
→ calls ci_wait again
→ repeats until green (max 3 retries)</code></pre>
<p>The key is that <code>ci_logs</code> returns the raw build output — compiler errors, test failures, lint violations — as plain text in the agent's context. DeepSeek V4 is surprisingly good at reading build logs. It parses a Go compiler error, identifies the file and line, and fixes it. It reads a test assertion failure, understands what the test expected, and corrects the implementation.</p>
<p>Three retries is the hard limit. If the agent can't fix it in three rounds, it opens the PR anyway with a comment explaining what failed and why. A human takes over from there. In practice, most failures resolve on the first retry — it's usually a missing import or a type mismatch.</p>
<h3>A Real Run</h3>
<p>Last week. The task: add a health check endpoint to a Go service.</p>
<ul>
<li><strong>Turn 1:</strong> Agent reads the codebase, writes the handler and test, pushes. CI fails — the test imports a package that doesn't exist on the runner.</li>
<li><strong>Turn 2:</strong> Agent reads <code>ci_logs</code>, sees the <code>go: module not found</code> error, adds the missing <code>go.mod</code> replace directive, pushes. CI passes.</li>
<li><strong>Turn 3:</strong> Agent opens PR with passing checks.</li>
</ul>
<p>Total time: 4 minutes. Total cost: $0.06. No human touched the keyboard.</p>
<p>Without the CI extension, this would have been a PR with a red badge and a Slack message saying "hey, the agent's PR is broken again." Someone would have context-switched, opened the logs, seen the trivial error, fixed it, and lost 20 minutes of flow state.</p>
<h3>Why This Matters More Than You Think</h3>
<p>CI integration isn't a feature. It's the difference between an agent that helps and an agent that creates work.</p>
<p>An agent that pushes broken code is worse than no agent at all. It creates a false sense of progress — "the PR is up!" — while actually adding a task to someone's plate. Every broken PR is an interruption. Every interruption costs 15 minutes of context-switching.</p>
<p>An agent that watches CI and fixes its own builds is genuinely autonomous. You submit a task, you walk away, you come back to a green PR ready for review. The agent handled the mechanical iteration that a human would have done anyway — the fix-push-wait-check cycle that eats hours of developer time every week.</p>
<h3>The Guardrail Problem</h3>
<p>Letting an agent retry its own builds sounds dangerous. What if it enters an infinite loop? What if it starts making increasingly wild changes to get the build to pass?</p>
<p>Three safeguards:</p>
<p><strong>Retry limit.</strong> Three attempts maximum. After that, the agent stops and reports. This is a hard limit in the orchestrator, not a suggestion to the model.</p>
<p><strong>Diff budget.</strong> Each retry can only touch files that were already in the original changeset. The agent can't "fix" a build failure by rewriting the test suite or disabling the linter. If the fix requires touching new files, it fails and escalates.</p>
<p><strong>Hallucination detection.</strong> The guardrail extension monitors every turn. If the agent claims "the build passed" without having called <code>ci_status</code> or <code>ci_wait</code>, it gets corrected. Agents are not allowed to guess the CI result.</p>
<h3>The Numbers</h3>
<p>The agent pushes its branch, calls <code>ci_wait</code>, and if CI fails, reads <code>ci_logs</code>, fixes the issue, pushes again, and polls again. DeepSeek V4 parses compiler errors, identifies files and lines, and fixes them. A missing import, a type mismatch, a module not found — pattern-matched and corrected in seconds.</p>
<p>A real example from last week: adding a health check endpoint to a Go service. Agent wrote the handler and test, pushed. CI failed — the test imported a package that didn't exist on the runner. Agent read <code>ci_logs</code>, saw <code>go: module not found</code>, added the missing <code>go.mod</code> replace directive, pushed again. CI passed. PR opened. <strong>4 minutes. $0.06.</strong></p>
<p>Three safeguards prevent runaway loops: <strong>retry limit</strong> (3, hard-coded in the orchestrator), <strong>diff budget</strong> (retries only touch files already in the changeset), and <strong>hallucination detection</strong> (if the agent claims CI passed without calling <code>ci_status</code>, it gets corrected).</p>
<h2>The Numbers</h2>
<p>Over three weeks of running the orchestrator:</p>
<ul>
<li><strong>87 tasks</strong> completed end-to-end</li>
<li><strong>23 tasks</strong> needed at least one CI retry (26%)</li>
@@ -374,10 +291,11 @@
<li><strong>4 tasks</strong> hit the retry limit and escalated to a human</li>
<li><strong>0 tasks</strong> produced a merged PR that later broke something else</li>
</ul>
<p>The 26% retry rate tells you how often agents push code that doesn't build on the first try. That's not a bad number — it's the same rate you'd see from a junior developer. The difference is the agent fixes it in 30 seconds instead of 20 minutes.</p>
<hr>
<h2>Putting It Together: The Stack</h2>
<p>The flow-native brain and the CI integrator are two sides of the same coin. The flow handles <strong>pre-push verification</strong> — did the code compile? do the tests pass? does the game behave correctly? The CI integrator handles <strong>post-push verification</strong> — did the CI pipeline agree? did anything break on the runner that didn't break locally?</p>
<p>The 26% retry rate matches what you'd see from a junior developer. The difference: the agent fixes it in 30 seconds.</p>
<h2>The Architecture</h2>
<table style="width:100%;border-collapse:collapse;margin:18px 0;font-size:0.92rem;">
<thead>
<tr style="text-align:left;border-bottom:1px solid #2a3340;">
@@ -388,16 +306,21 @@
</thead>
<tbody>
<tr style="border-bottom:1px solid #1c2230;"><td style="padding:9px 12px;color:#e6edf3;vertical-align:top;"><strong style="color:#f59e0b;">Flow engine</strong></td><td style="padding:9px 12px;color:#cdd7e2;vertical-align:top;">pi-flows orchestrator</td><td style="padding:9px 12px;color:#9aa7b4;vertical-align:top;">Composes agents, gates and decision points</td></tr>
<tr style="border-bottom:1px solid #1c2230;"><td style="padding:9px 12px;color:#e6edf3;vertical-align:top;"><strong style="color:#f59e0b;">Gates</strong></td><td style="padding:9px 12px;color:#cdd7e2;vertical-align:top;">verify_build oracle</td><td style="padding:9px 12px;color:#9aa7b4;vertical-align:top;">Compiles, tests, returns PASS/FAIL with file:line errors</td></tr>
<tr style="border-bottom:1px solid #1c2230;"><td style="padding:9px 12px;color:#e6edf3;vertical-align:top;"><strong style="color:#f59e0b;">Oracle gates</strong></td><td style="padding:9px 12px;color:#cdd7e2;vertical-align:top;">verify_build, drive_game, game_frames</td><td style="padding:9px 12px;color:#9aa7b4;vertical-align:top;">Return structured PASS/FAIL with evidence</td></tr>
<tr style="border-bottom:1px solid #1c2230;"><td style="padding:9px 12px;color:#e6edf3;vertical-align:top;"><strong style="color:#f59e0b;">Sub-agents</strong></td><td style="padding:9px 12px;color:#cdd7e2;vertical-align:top;">G1 build · G2 tests · G3 behaviour · G4 feel · G5 visual</td><td style="padding:9px 12px;color:#9aa7b4;vertical-align:top;">Role-split, each with its own toolset</td></tr>
<tr style="border-bottom:1px solid #1c2230;"><td style="padding:9px 12px;color:#e6edf3;vertical-align:top;"><strong style="color:#f59e0b;">CI loop</strong></td><td style="padding:9px 12px;color:#cdd7e2;vertical-align:top;">tinqs-ci extension</td><td style="padding:9px 12px;color:#9aa7b4;vertical-align:top;">ci_status, ci_logs, ci_wait — polls Gitea Actions, reads logs, retries</td></tr>
<tr style="border-bottom:1px solid #1c2230;"><td style="padding:9px 12px;color:#e6edf3;vertical-align:top;"><strong style="color:#f59e0b;">Decision</strong></td><td style="padding:9px 12px;color:#cdd7e2;vertical-align:top;">Agent-loop Reflexion</td><td style="padding:9px 12px;color:#9aa7b4;vertical-align:top;">Self-reflect on failures, retry (≤3) or escalate</td></tr>
<tr><td style="padding:9px 12px;color:#e6edf3;vertical-align:top;"><strong style="color:#f59e0b;">Visualization</strong></td><td style="padding:9px 12px;color:#cdd7e2;vertical-align:top;">FlowDashboard</td><td style="padding:9px 12px;color:#9aa7b4;vertical-align:top;">Real-time pipeline state</td></tr>
</tbody>
</table>
<hr>
<p>The old supervisor was 1,050 lines of code that did one thing well: verify that agent output compiled and passed tests. The new system does the same thing with less code, more flexibility, composable gates, live CI integration, and a bug we'll never hit again. Sometimes the best commit is a deletion. Sometimes it's two.</p>
<p><em>The flow-native brain and CI extension run on our <a href="https://tinqs.com/tinqs/pi" style="color: var(&ndash;c-accent-l);">Pi fork</a> inside <a href="https://tinqs.com" style="color: var(&ndash;c-accent-l);">Tinqs Studio</a>. The verify_build extension is ~300 lines of TypeScript, the tinqs-ci extension is ~200 lines — both MIT licensed and reusable in any Pi project.</em></p>
<p>The oracle tools — <code>verify_build</code>, <code>drive_game</code>, <code>game_frames</code> — are the durable assets. About 300 lines of TypeScript each, MIT licensed, reusable in any Pi project. The flow engine composes them; the agents route through them.</p>
<p>A year ago we had a supervisor written in 1,050 lines of hardcoded TypeScript that did one thing: verify agent output compiled and passed tests. We deleted it. The same verification now runs as a composable flow with five gates, live-game testing, and CI integration. Sometimes the best architecture decision is knowing what to delete.</p>
<p><em>The flow-native brain runs on our <a href="https://tinqs.com/tinqs/pi">Pi fork</a> inside <a href="https://tinqs.com">Tinqs Studio</a>. The oracle extensions are MIT licensed and reusable in any Pi project.</em></p>
</div>
@@ -410,7 +333,6 @@
</div>
</article>
<!-- FOOTER -->
<footer class="footer">
<div class="footer__inner">
<span class="footer__wordmark">TINQS</span>
Before
After
-259
View File
@@ -1,259 +0,0 @@
---
title: "Pi's Flow-Native Brain: Retiring the Supervisor, Teaching Agents to Fix Their Own Builds"
slug: pi-flow-native-brain
date: "2026-06-03"
description: "We deleted 1,050 lines of hardcoded supervisor logic, replaced it with oracle-backed pi-flows, and gave agents the tools to watch CI and fix their own broken builds."
og_description: "Pi's supervisor is gone — replaced by oracle-backed flows and CI-integrating agents that fix their own builds."
og_image: "https://www.tinqs.com/img/og-cover.jpg"
excerpt: "Two changes made Pi genuinely autonomous: we deleted the hardcoded supervisor and replaced it with composable oracle-backed flows, and we taught agents to watch CI, read failure logs, and fix their own broken builds."
author: "Ozan Bozkurt"
author_initials: "OB"
author_role: "CTO & Developer, Tinqs"
---
Two changes this week made Pi genuinely autonomous. First, we deleted 1,050 lines of hardcoded supervisor logic and replaced it with a flow-native brain — oracle-backed gates that agents compose dynamically. Second, we closed the loop: agents now watch CI, read failure logs, and fix their own broken builds until the pipeline goes green.
---
## Part 1: Retiring the Supervisor
### What the Supervisor Did
The `.pi/supervisor/` directory was the orchestration brain Pi left to us. For every task, it ran a fixed loop:
1. **Contract gate** — skip-to-human if "done" wasn't programmatically verifiable
2. **TDAID phase A** — a test-writer agent writes RED tests, never implementation
3. **TDAID phase B** — a code-writer agent makes them green; on failure, a Reflexion follow-up retries (capped)
4. **Verification gate** — run the build, check tests, pass or fail with a report
It worked. It caught broken builds before they hit CI. It enforced the discipline of "define done before you start." But it had a structural problem: the loop was **hardcoded**. Every decision tree, every gate, every retry policy was baked into TypeScript. To change the workflow, you changed code. To add a new gate — vision QA, linting, asset validation — you added more code to the same monolithic loop.
The supervisor was doing what `pi-flows` was designed to do, but from the wrong side of the architecture. Flows composes agents, gates, and decision points into pipelines. The supervisor reimplemented that logic in a single file. It was fighting the framework.
### What Replaced It
The verify-heavy brain now runs **as a pi-flows flow** — a pipeline of oracle-backed gates orchestrated by the flow engine, visualized in FlowDashboard, and composable by agents themselves.
The core pieces:
- **Oracle-backed gates.** The `verify_build` tool is the canonical gate. It compiles the game and sim, runs tests, and returns a structured PASS/FAIL verdict with file:line errors. Agents route through it; the gate decides whether to proceed.
- **Agent-loop-decision Reflexion.** Instead of a fixed two-phase TDAID loop, agents self-reflect on build failures. The flow engine gives them the failure report; they decide whether to fix and retry or escalate.
- **Role-split agents.** G1 build, G2 tests, G3 behaviour (drives the live game), G4 feel (measured game-feel) and G5 visual (animation) are separate sub-agents, each with its own toolset and context, composed by the flow.
The result is a pipeline that flows naturally — a plan, an implementation, then a ladder of oracle-backed gates:
<!--raw-->
<figure style="margin:28px 0;">
<svg viewBox="0 0 920 350" role="img" aria-label="The verify-heavy flow: context, plan, implement, five gates, a Reflexion loop, and one judge" style="width:100%;height:auto;display:block;background:#0a0e14;border:1px solid #2a3340;border-radius:12px;font-family:'IBM Plex Sans',system-ui,sans-serif;">
<defs>
<marker id="ah" markerWidth="10" markerHeight="10" refX="7" refY="3.2" orient="auto"><path d="M0,0 L7,3.2 L0,6.4 Z" fill="#5b6b7d"/></marker>
<marker id="ahA" markerWidth="10" markerHeight="10" refX="7" refY="3.2" orient="auto"><path d="M0,0 L7,3.2 L0,6.4 Z" fill="#f59e0b"/></marker>
</defs>
<rect x="40" y="40" width="140" height="46" rx="9" fill="#121821" stroke="#2a3340"/>
<text x="110" y="68" text-anchor="middle" fill="#cdd7e2" font-size="15">Context</text>
<rect x="210" y="40" width="140" height="46" rx="9" fill="#121821" stroke="#2a3340"/>
<text x="280" y="68" text-anchor="middle" fill="#cdd7e2" font-size="15">Plan</text>
<rect x="400" y="40" width="150" height="46" rx="9" fill="#15202e" stroke="#3a4656"/>
<text x="475" y="68" text-anchor="middle" fill="#e6edf3" font-size="15">Implement</text>
<line x1="180" y1="63" x2="206" y2="63" stroke="#5b6b7d" stroke-width="1.6" marker-end="url(#ah)"/>
<line x1="350" y1="63" x2="396" y2="63" stroke="#5b6b7d" stroke-width="1.6" marker-end="url(#ah)"/>
<rect x="40" y="150" width="840" height="82" rx="12" fill="#0c1119" stroke="#2a3340"/>
<text x="56" y="171" fill="#6b7a8d" font-size="11" letter-spacing="1.4">VERIFY-HEAVY GATES — most compute is spent checking, not writing</text>
<rect x="56" y="180" width="148" height="42" rx="8" fill="#10141c" stroke="#38bdf8" stroke-opacity="0.55"/>
<text x="130" y="206" text-anchor="middle" fill="#38bdf8" font-size="13.5">G1 · Build</text>
<rect x="222" y="180" width="148" height="42" rx="8" fill="#10141c" stroke="#34d399" stroke-opacity="0.55"/>
<text x="296" y="206" text-anchor="middle" fill="#9fe6c0" font-size="13.5">G2 · Tests</text>
<rect x="388" y="180" width="148" height="42" rx="8" fill="#10141c" stroke="#a855f7" stroke-opacity="0.55"/>
<text x="462" y="206" text-anchor="middle" fill="#c4a0f7" font-size="13.5">G3 · Behaviour</text>
<rect x="554" y="180" width="148" height="42" rx="8" fill="#10141c" stroke="#f59e0b" stroke-opacity="0.55"/>
<text x="628" y="206" text-anchor="middle" fill="#f5b44b" font-size="13.5">G4 · Feel</text>
<rect x="720" y="180" width="148" height="42" rx="8" fill="#10141c" stroke="#c9935a" stroke-opacity="0.55"/>
<text x="794" y="206" text-anchor="middle" fill="#d9ac7b" font-size="13.5">G5 · Visual</text>
<line x1="475" y1="86" x2="475" y2="148" stroke="#5b6b7d" stroke-width="1.6" marker-end="url(#ah)"/>
<line x1="460" y1="232" x2="460" y2="276" stroke="#5b6b7d" stroke-width="1.6" marker-end="url(#ah)"/>
<text x="472" y="258" fill="#6b7a8d" font-size="11">all green ⇒ done · any fail ⇒ report</text>
<rect x="380" y="278" width="160" height="46" rx="9" fill="#1b1505" stroke="#c9935a"/>
<text x="460" y="306" text-anchor="middle" fill="#f3d6a0" font-size="15">Judge — honest verdict</text>
<path d="M820,150 C 908,96 716,50 556,61" fill="none" stroke="#f59e0b" stroke-width="1.8" stroke-dasharray="6 5" marker-end="url(#ahA)"/>
<text x="694" y="96" fill="#f59e0b" font-size="12.5">Reflexion · fix & retry ≤ 3</text>
</svg>
<figcaption style="color:#9aa7b4;font-size:0.85rem;margin-top:8px;">A real in-game failure loops back to <em>implement</em> with the gate evidence (bounded to three tries); anything green — or skipped because no live instance is running — falls through to a single honest judge.</figcaption>
</figure>
<!--/raw-->
It started as three gates — build, test, vision. Gates are cheap to add, so it grew: a feature now also passes a live-game **behaviour** probe and a measured **feel** check before the judge signs off. Critically, the flow is not fixed. Agents can add gates, reorder steps, or branch on conditions. The flow engine handles orchestration; the agents handle decisions.
### What We Deleted
The commit removes 1,050 lines across 15 files:
- `runner.ts` (115 lines) — the main orchestration loop
- `supervisor.ts` (119 lines) — the state machine driving sessions
- `gates.ts` (75 lines) — hardcoded gate definitions
- `policy.ts` (92 lines) — retry limits and decision logic
- `store.ts` (54 lines) — session state persistence
- `types.ts` (76 lines) — type definitions for the whole system
- `events.ts` (47 lines) — inter-process event bus
- Plus tests, examples, and documentation
<!--raw-->
<figure style="margin:24px 0;">
<svg viewBox="0 0 920 180" role="img" aria-label="Lines of code: 1,050 deleted versus about 300 kept" style="width:100%;height:auto;display:block;background:#0a0e14;border:1px solid #2a3340;border-radius:12px;font-family:'IBM Plex Sans',system-ui,sans-serif;">
<text x="40" y="34" fill="#9aa7b4" font-size="13">Net change: <tspan fill="#f59e0b" font-weight="600">750 lines</tspan>, + a composable pipeline</text>
<text x="40" y="76" fill="#f0816a" font-size="13">Deleted</text>
<rect x="150" y="58" width="730" height="30" rx="6" fill="#2a1416" stroke="#f0816a" stroke-opacity="0.6"/>
<text x="868" y="78" text-anchor="end" fill="#f3b4a8" font-size="12.5">supervisor/ — 1,050 lines · 15 files</text>
<text x="40" y="136" fill="#34d399" font-size="13">Kept</text>
<rect x="150" y="118" width="209" height="30" rx="6" fill="#0f2a22" stroke="#34d399" stroke-opacity="0.6"/>
<text x="369" y="138" fill="#9fe6c0" font-size="12.5">verify_build — ~300 lines · 1 oracle</text>
</svg>
<figcaption style="color:#9aa7b4;font-size:0.85rem;margin-top:8px;">The whole orchestration loop was deleted; only the build oracle survived — and it became the gate that powers the flow.</figcaption>
</figure>
<!--/raw-->
None of this was bad code. It was just the wrong layer. Flows gives us all of this — orchestration, state, gates, retry policy, event routing — as a framework primitive. We were maintaining a parallel implementation of something the framework already provided.
The durable asset we kept: `verify_build`, the build oracle. It's now reused as the gate tool that powers the flow pipeline.
### The Bug That Took a Day to Find
Moving to flows exposed a subtle problem. Flow sub-agents run in their **own extension stack** — the main session's extensions don't reach them. The build-verifier and test-runner agents declared `verify_build` in their frontmatter, but the tool was never actually in their toolset.
The symptom was confusing: agents reported "oracle not available" and routed to fail/report, silently skipping the test gate entirely. A false green — the build passed, tests never ran, and the pipeline reported success.
The fix was a single pattern: emit `flow:register-tool` with the full tool definition at extension activation, and re-announce on `flow:rediscover`. The flow engine collects these into `getExtensionTools()` and hands them to every sub-agent that declares the tool. Three lines of orchestration, a day of debugging.
Verified live: `game-check` now routes `context → build → build-gate(pass) → tests → tests-gate(pass) → vision`. Every gate fires. No false greens.
### Why This Architecture Wins
**Composability.** Agents can add gates without touching framework code. Want a linting gate? Add a sub-agent with a linter tool. Want a security scan? Same pattern. The flow engine handles routing; you just declare the gate.
**Reusability.** The `verify_build` oracle that powered the old supervisor now powers the flow gates. Same tool, same interface, different orchestration. No rewrite needed.
**Observability.** FlowDashboard visualizes the entire pipeline. You can see which gates passed, which failed, and where the agent decided to retry. The old supervisor logged to stdout.
**Self-modification.** Agents can read the flow graph, understand where they are in the pipeline, and decide what to do next. The supervisor's decision tree was opaque to the agents it was supervising. Flows makes the pipeline itself part of the agent's context.
---
## Part 2: Agents That Fix Their Own Builds
Most coding agents have a dirty secret: they don't care if the code compiles. They write, they push, they walk away. The human discovers the broken build an hour later. The flow-native brain handles verification inside the pipeline — but what about after push? What about CI?
### The Gap
Every agent demo looks the same. The AI writes code, commits, pushes. The presenter says "and now we have a pull request!" Cut. End of demo.
What happens next? The CI pipeline runs. Tests fail. Linting screams. The build breaks because someone forgot an import. A human opens the PR, reads the red badge, clicks into the logs, finds the error, fixes it, pushes again. The agent did 90% of the work but left the last 10% — the most tedious part — for a person.
We wanted agents that finish the job.
### The tinqs-ci Extension
Our [Pi fork](https://tinqs.com/tinqs/pi) has a `tinqs-ci` extension — a single TypeScript file, about 200 lines — that gives the agent three tools:
- **ci_status** — checks the current pipeline state for a branch (pending, running, success, failure)
- **ci_logs** — fetches the full build log from the most recent failed run
- **ci_wait** — polls the pipeline every 15 seconds until it finishes, then returns the result
These are Gitea Actions API calls under the hood. The agent authenticates with the same PAT it uses for git push. No extra credentials, no special CI service account.
### The Loop
Here's what a Pi task looks like end to end:
```
Agent receives task brief
→ reads codebase, plans approach
→ writes code
→ runs local tests (bash tool)
→ commits and pushes branch
→ calls ci_wait
→ CI passes → opens PR via Gitea API
→ CI fails → calls ci_logs
→ reads error output
→ fixes the issue
→ pushes again
→ calls ci_wait again
→ repeats until green (max 3 retries)
```
The key is that `ci_logs` returns the raw build output — compiler errors, test failures, lint violations — as plain text in the agent's context. DeepSeek V4 is surprisingly good at reading build logs. It parses a Go compiler error, identifies the file and line, and fixes it. It reads a test assertion failure, understands what the test expected, and corrects the implementation.
Three retries is the hard limit. If the agent can't fix it in three rounds, it opens the PR anyway with a comment explaining what failed and why. A human takes over from there. In practice, most failures resolve on the first retry — it's usually a missing import or a type mismatch.
### A Real Run
Last week. The task: add a health check endpoint to a Go service.
- **Turn 1:** Agent reads the codebase, writes the handler and test, pushes. CI fails — the test imports a package that doesn't exist on the runner.
- **Turn 2:** Agent reads `ci_logs`, sees the `go: module not found` error, adds the missing `go.mod` replace directive, pushes. CI passes.
- **Turn 3:** Agent opens PR with passing checks.
Total time: 4 minutes. Total cost: $0.06. No human touched the keyboard.
Without the CI extension, this would have been a PR with a red badge and a Slack message saying "hey, the agent's PR is broken again." Someone would have context-switched, opened the logs, seen the trivial error, fixed it, and lost 20 minutes of flow state.
### Why This Matters More Than You Think
CI integration isn't a feature. It's the difference between an agent that helps and an agent that creates work.
An agent that pushes broken code is worse than no agent at all. It creates a false sense of progress — "the PR is up!" — while actually adding a task to someone's plate. Every broken PR is an interruption. Every interruption costs 15 minutes of context-switching.
An agent that watches CI and fixes its own builds is genuinely autonomous. You submit a task, you walk away, you come back to a green PR ready for review. The agent handled the mechanical iteration that a human would have done anyway — the fix-push-wait-check cycle that eats hours of developer time every week.
### The Guardrail Problem
Letting an agent retry its own builds sounds dangerous. What if it enters an infinite loop? What if it starts making increasingly wild changes to get the build to pass?
Three safeguards:
**Retry limit.** Three attempts maximum. After that, the agent stops and reports. This is a hard limit in the orchestrator, not a suggestion to the model.
**Diff budget.** Each retry can only touch files that were already in the original changeset. The agent can't "fix" a build failure by rewriting the test suite or disabling the linter. If the fix requires touching new files, it fails and escalates.
**Hallucination detection.** The guardrail extension monitors every turn. If the agent claims "the build passed" without having called `ci_status` or `ci_wait`, it gets corrected. Agents are not allowed to guess the CI result.
### The Numbers
Over three weeks of running the orchestrator:
- **87 tasks** completed end-to-end
- **23 tasks** needed at least one CI retry (26%)
- **19 of those 23** resolved on the first retry
- **4 tasks** hit the retry limit and escalated to a human
- **0 tasks** produced a merged PR that later broke something else
The 26% retry rate tells you how often agents push code that doesn't build on the first try. That's not a bad number — it's the same rate you'd see from a junior developer. The difference is the agent fixes it in 30 seconds instead of 20 minutes.
---
## Putting It Together: The Stack
The flow-native brain and the CI integrator are two sides of the same coin. The flow handles **pre-push verification** — did the code compile? do the tests pass? does the game behave correctly? The CI integrator handles **post-push verification** — did the CI pipeline agree? did anything break on the runner that didn't break locally?
<!--raw-->
<table style="width:100%;border-collapse:collapse;margin:18px 0;font-size:0.92rem;">
<thead>
<tr style="text-align:left;border-bottom:1px solid #2a3340;">
<th style="padding:10px 12px;color:#c9935a;font-weight:600;">Layer</th>
<th style="padding:10px 12px;color:#c9935a;font-weight:600;">What</th>
<th style="padding:10px 12px;color:#c9935a;font-weight:600;">How</th>
</tr>
</thead>
<tbody>
<tr style="border-bottom:1px solid #1c2230;"><td style="padding:9px 12px;color:#e6edf3;vertical-align:top;"><strong style="color:#f59e0b;">Flow engine</strong></td><td style="padding:9px 12px;color:#cdd7e2;vertical-align:top;">pi-flows orchestrator</td><td style="padding:9px 12px;color:#9aa7b4;vertical-align:top;">Composes agents, gates and decision points</td></tr>
<tr style="border-bottom:1px solid #1c2230;"><td style="padding:9px 12px;color:#e6edf3;vertical-align:top;"><strong style="color:#f59e0b;">Gates</strong></td><td style="padding:9px 12px;color:#cdd7e2;vertical-align:top;">verify_build oracle</td><td style="padding:9px 12px;color:#9aa7b4;vertical-align:top;">Compiles, tests, returns PASS/FAIL with file:line errors</td></tr>
<tr style="border-bottom:1px solid #1c2230;"><td style="padding:9px 12px;color:#e6edf3;vertical-align:top;"><strong style="color:#f59e0b;">Sub-agents</strong></td><td style="padding:9px 12px;color:#cdd7e2;vertical-align:top;">G1 build · G2 tests · G3 behaviour · G4 feel · G5 visual</td><td style="padding:9px 12px;color:#9aa7b4;vertical-align:top;">Role-split, each with its own toolset</td></tr>
<tr style="border-bottom:1px solid #1c2230;"><td style="padding:9px 12px;color:#e6edf3;vertical-align:top;"><strong style="color:#f59e0b;">CI loop</strong></td><td style="padding:9px 12px;color:#cdd7e2;vertical-align:top;">tinqs-ci extension</td><td style="padding:9px 12px;color:#9aa7b4;vertical-align:top;">ci_status, ci_logs, ci_wait — polls Gitea Actions, reads logs, retries</td></tr>
<tr style="border-bottom:1px solid #1c2230;"><td style="padding:9px 12px;color:#e6edf3;vertical-align:top;"><strong style="color:#f59e0b;">Decision</strong></td><td style="padding:9px 12px;color:#cdd7e2;vertical-align:top;">Agent-loop Reflexion</td><td style="padding:9px 12px;color:#9aa7b4;vertical-align:top;">Self-reflect on failures, retry (≤3) or escalate</td></tr>
<tr><td style="padding:9px 12px;color:#e6edf3;vertical-align:top;"><strong style="color:#f59e0b;">Visualization</strong></td><td style="padding:9px 12px;color:#cdd7e2;vertical-align:top;">FlowDashboard</td><td style="padding:9px 12px;color:#9aa7b4;vertical-align:top;">Real-time pipeline state</td></tr>
</tbody>
</table>
<!--/raw-->
---
The old supervisor was 1,050 lines of code that did one thing well: verify that agent output compiled and passed tests. The new system does the same thing with less code, more flexibility, composable gates, live CI integration, and a bug we'll never hit again. Sometimes the best commit is a deletion. Sometimes it's two.
*The flow-native brain and CI extension run on our [Pi fork](https://tinqs.com/tinqs/pi) inside [Tinqs Studio](https://tinqs.com). The verify_build extension is ~300 lines of TypeScript, the tinqs-ci extension is ~200 lines — both MIT licensed and reusable in any Pi project.*
-57
View File
@@ -1,57 +0,0 @@
# Skill: Blog Authoring
Write and publish markdown blog posts with YAML frontmatter. This skill teaches an AI agent how to create well-structured blog posts for a static site built from markdown.
## Post Format
Create a markdown file in `posts/<slug>.md` with this frontmatter:
```yaml
---
title: "Your Post Title"
slug: your-post-slug
date: "2026-05-22"
description: "Full meta description for SEO (150-160 chars ideal)."
og_description: "Shorter OG/Twitter description."
og_image: "https://your-domain.com/img/og-cover.jpg"
excerpt: "Card text shown on the blog index page."
author: "Author Name"
author_initials: "AN"
author_role: "Role, Company"
---
```
## Writing Guidelines
- **First paragraph** becomes the lead (displayed prominently below the title, separate from the body)
- **Everything after the first blank line** is the post body
- Use standard markdown: `## Headings`, `**bold**`, `*italic*`, `[links](url)`, `- lists`, fenced code blocks
- Images on their own line become `<figure>` elements with captions
- Use `---` for section breaks
- Em dashes: `---` renders as &mdash;
## Structure
A good technical blog post follows this pattern:
1. **Lead paragraph** --- what this post is about, in one sentence
2. **The Problem** --- what pain point or question motivated this work
3. **The Approach** --- what you built or decided, and why
4. **Technical Details** --- how it works, with code/diagrams
5. **What We Learned** --- insights, surprises, trade-offs
6. **Closing** --- what's next, or an invitation to the reader
## SEO Checklist
- [ ] Title under 60 characters
- [ ] Description 150-160 characters
- [ ] og_image set
- [ ] Meaningful excerpt for index card
- [ ] Internal links where relevant
## Conventions
- Slugs are kebab-case, matching the filename: `my-post.md` -> slug `my-post`
- Dates are ISO format: `2026-05-22`
- Canonical URLs: `https://your-domain.com/blog/<slug>`
- Don't edit generated HTML --- edit the markdown, then rebuild