# tinqs/ci

CI toolchain for Tinqs Studio — composite Gitea Actions and a Lambda dispatcher that orchestrates ephemeral Spot runners.

**This repo must stay public.** act_runner (go-git) clones action repos without auth. All other tinqs repos are private.

## Architecture

```
Push → Gitea webhook → Lambda (tinqs-ci-dispatch) → EC2 Spot → act_runner → job → self-terminate
```

Runners are ephemeral: one Spot instance per job, self-terminates on completion. Private repo clones are authenticated via `git config url.insteadOf` injected in the runner user-data.

### Key design decisions

- **Ephemeral Spot instances** (not Fargate, not persistent runners) — cheapest, cleanest, no state to manage.
- **`--ephemeral` on `act_runner register`** — runner exits after one job, triggering `shutdown -h now` → instance terminates. Without this, runners pile up as zombies.
- **No local action cache** — act_runner uses go-git internally which ignores `~/.gitconfig`. The `url.insteadOf` trick only works for the git binary (used by checkout action).
- **`tinqs.com`** — Gitea's ROOT_URL is `tinqs.com`. The old `git.tinqs.com` subdomain is retired.

## Actions

| Action | What it does |
|--------|-------------|
| `tinqs/ci/checkout@v1` | Clone a repo from tinqs.com (sparse checkout, depth control, token auth) |
| `tinqs/ci/setup-go@v1` | Install Go (skips if pre-baked in AMI) |
| `tinqs/ci/setup-node@v1` | Install Node.js + pnpm (skips if pre-baked) |
| `tinqs/ci/setup-aws@v1` | Install AWS CLI + optional ECR login |

```yaml
steps:
  - uses: tinqs/ci/checkout@v1
    with:
      sparse: 'cmd/tstudio'
  - uses: tinqs/ci/setup-go@v1
  - uses: tinqs/ci/setup-aws@v1
    with:
      ecr-login: 'true'
```

## Dispatcher (Lambda)

`orchestrator/dispatch/main.go` — receives Gitea webhooks, evaluates workflow triggers (branch + path filters), launches Spot instances with the right label.

| Label | Instance | Use |
|-------|----------|-----|
| `go` | t3.small | Go builds (tstudio, proxy, docgen) |
| `docker` | t3.medium | Docker image builds (platform, bot) |
| `deploy` | t3.micro | S3 sync, ECS update |
| `node` | t3.medium | Frontend builds |
| `godot` | t3.medium | Game exports (future) |

Runner user-data flow: boot → git auth config → act_runner register (ephemeral) → daemon → job → exit → shutdown → terminate.

## Runner Images

Dockerfiles in `images/` — lean, purpose-built. Push to ECR with `images/build-all.sh v1`.

| Image | Contents |
|-------|----------|
| `base` | Alpine + git + AWS CLI + SSH |
| `go` | base + Go 1.26 |
| `node` | base + Node 22 + pnpm |
| `docker` | docker:dind + Go + AWS CLI |
| `deploy` | base only (lightest) |
| `godot` | base + headless Godot 4.6 |

## Deploying the dispatcher

The dispatcher Lambda can't CI itself — deploy manually:

```bash
cd orchestrator/dispatch

# Build
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -o bootstrap -ldflags "-s -w" .

# Zip
# Windows:
powershell -Command "Compress-Archive -Path bootstrap -DestinationPath function.zip -Force"
# Mac/Linux:
zip -j function.zip bootstrap

# Deploy
aws lambda update-function-code --region eu-west-1 \
  --function-name tinqs-ci-dispatch \
  --zip-file fileb://function.zip

# Trigger a test build
# Push any change to cmd/tstudio/ in tinqs/studio
```

## Lambda env vars

Configured in AWS console, not in code:

| Var | Purpose |
|-----|---------|
| `GITEA_URL` | `https://tinqs.com` |
| `GITEA_TOKEN` | API token — used for fetching workflows AND runner git auth |
| `RUNNER_TOKEN` | act_runner registration token (from Gitea admin → Runners) |
| `RUNNER_AMI` | Pre-baked AMI ID (Go, Node, Docker, act_runner installed) |
| `SUBNET` | VPC subnet for Spot instances |
| `SECURITY_GROUP` | SG allowing outbound HTTPS |
| `DDB_TABLE` | DynamoDB table for run tracking (`tinqs-ci-runs`) |
| `INSTANCE_PROFILE` | IAM role for runner instances (S3, ECR, ECS access) |

## Monitoring

```bash
# Zombie check (should be 0 except during active builds)
aws ec2 describe-instances --region eu-west-1 \
  --filters "Name=tag:tinqs-ci,Values=true" "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].InstanceId'

# Lambda dispatch logs (use MSYS_NO_PATHCONV=1 on Windows/Git Bash)
MSYS_NO_PATHCONV=1 aws logs tail '/aws/lambda/tinqs-ci-dispatch' --region eu-west-1

# Build logs
TOKEN=<your-gitea-token>
curl -s "https://tinqs.com/api/v1/repos/tinqs/studio/actions/jobs/<JOB_ID>/logs" \
  -H "Authorization: token $TOKEN"

# Runner instance logs (while instance is alive)
aws ssm send-command --region eu-west-1 --instance-ids <ID> \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["cat /var/log/tinqs-ci.log"]'

# Stale DynamoDB runs
aws dynamodb scan --region eu-west-1 --table-name tinqs-ci-runs \
  --filter-expression "#s = :r" \
  --expression-attribute-names '{"#s":"status"}' \
  --expression-attribute-values '{":r":{"S":"running"}}' \
  --query Count
```

Full debug guide: `tinqs/docs/.cursor/skills/ci-pipeline-discipline/SKILL.md`

## Contributing

### Adding a new composite action

1. Create `<action-name>/action.yml` with `using: composite` and `shell: bash`
2. Keep it simple — no Node.js runtime, just bash
3. Add a `<action-name>/README.md` with inputs/outputs
4. Add to the Actions table in this README
5. Push to main — actions resolve via `@v1` (main branch)

### Modifying the dispatcher

1. Edit `orchestrator/dispatch/main.go`
2. Build: `go build .` (catches compile errors)
3. Deploy manually (see Deploying above)
4. Verify: push a change to `tinqs/studio` and watch the pipeline

### Adding a new runner label

1. Add entry to `labelToSpot` map in `main.go`
2. Create `images/<label>/Dockerfile` if needed
3. Build and push image: `cd images && ./build-all.sh v1`
4. Deploy updated Lambda
5. Add `runs-on: <label>` to the workflow that needs it

### Updating the AMI

1. Launch a t3.small from the current AMI (`RUNNER_AMI` env var)
2. SSH in, install/update tools
3. Create AMI: `aws ec2 create-image --instance-id <ID> --name tinqs-ci-runner-v<N>`
4. Update `RUNNER_AMI` Lambda env var
5. Terminate the build instance

## Incidents

- **25 May 2026**: 18 zombie runners DDoS-ing Gitea. Root cause: no `--ephemeral` on registration + no git auth after repos went private. Full post-mortem: `tinqs/internal/incidents/ci-zombie-runners-2026-05-25.md`