176 lines
6.3 KiB
Markdown
176 lines
6.3 KiB
Markdown
|
|
# tinqs/ci
|
||
|
|
|
||
|
|
CI toolchain for Tinqs Studio — composite Gitea Actions and a Lambda dispatcher that orchestrates ephemeral Spot runners.
|
||
|
|
|
||
|
|
**This repo must stay public.** act_runner (go-git) clones action repos without auth. All other tinqs repos are private.
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
```
|
||
|
|
Push → Gitea webhook → Lambda (tinqs-ci-dispatch) → EC2 Spot → act_runner → job → self-terminate
|
||
|
|
```
|
||
|
|
|
||
|
|
Runners are ephemeral: one Spot instance per job, self-terminates on completion. Private repo clones are authenticated via `git config url.insteadOf` injected in the runner user-data.
|
||
|
|
|
||
|
|
### Key design decisions
|
||
|
|
|
||
|
|
- **Ephemeral Spot instances** (not Fargate, not persistent runners) — cheapest, cleanest, no state to manage.
|
||
|
|
- **`--ephemeral` on `act_runner register`** — runner exits after one job, triggering `shutdown -h now` → instance terminates. Without this, runners pile up as zombies.
|
||
|
|
- **No local action cache** — act_runner uses go-git internally which ignores `~/.gitconfig`. The `url.insteadOf` trick only works for the git binary (used by checkout action).
|
||
|
|
- **`git.tinqs.com` vs `tinqs.com`** — Gitea's ROOT_URL is `git.tinqs.com`. Runner git auth must cover both hostnames.
|
||
|
|
|
||
|
|
## Actions
|
||
|
|
|
||
|
|
| Action | What it does |
|
||
|
|
|--------|-------------|
|
||
|
|
| `tinqs/ci/checkout@v1` | Clone a repo from tinqs.com (sparse checkout, depth control, token auth) |
|
||
|
|
| `tinqs/ci/setup-go@v1` | Install Go (skips if pre-baked in AMI) |
|
||
|
|
| `tinqs/ci/setup-node@v1` | Install Node.js + pnpm (skips if pre-baked) |
|
||
|
|
| `tinqs/ci/setup-aws@v1` | Install AWS CLI + optional ECR login |
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
steps:
|
||
|
|
- uses: tinqs/ci/checkout@v1
|
||
|
|
with:
|
||
|
|
sparse: 'cmd/tstudio'
|
||
|
|
- uses: tinqs/ci/setup-go@v1
|
||
|
|
- uses: tinqs/ci/setup-aws@v1
|
||
|
|
with:
|
||
|
|
ecr-login: 'true'
|
||
|
|
```
|
||
|
|
|
||
|
|
## Dispatcher (Lambda)
|
||
|
|
|
||
|
|
`orchestrator/dispatch/main.go` — receives Gitea webhooks, evaluates workflow triggers (branch + path filters), launches Spot instances with the right label.
|
||
|
|
|
||
|
|
| Label | Instance | Use |
|
||
|
|
|-------|----------|-----|
|
||
|
|
| `go` | t3.small | Go builds (tstudio, proxy, docgen) |
|
||
|
|
| `docker` | t3.medium | Docker image builds (platform, bot) |
|
||
|
|
| `deploy` | t3.micro | S3 sync, ECS update |
|
||
|
|
| `node` | t3.medium | Frontend builds |
|
||
|
|
| `godot` | t3.medium | Game exports (future) |
|
||
|
|
|
||
|
|
Runner user-data flow: boot → git auth config → act_runner register (ephemeral) → daemon → job → exit → shutdown → terminate.
|
||
|
|
|
||
|
|
## Runner Images
|
||
|
|
|
||
|
|
Dockerfiles in `images/` — lean, purpose-built. Push to ECR with `images/build-all.sh v1`.
|
||
|
|
|
||
|
|
| Image | Contents |
|
||
|
|
|-------|----------|
|
||
|
|
| `base` | Alpine + git + AWS CLI + SSH |
|
||
|
|
| `go` | base + Go 1.26 |
|
||
|
|
| `node` | base + Node 22 + pnpm |
|
||
|
|
| `docker` | docker:dind + Go + AWS CLI |
|
||
|
|
| `deploy` | base only (lightest) |
|
||
|
|
| `godot` | base + headless Godot 4.6 |
|
||
|
|
|
||
|
|
## Deploying the dispatcher
|
||
|
|
|
||
|
|
The dispatcher Lambda can't CI itself — deploy manually:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd orchestrator/dispatch
|
||
|
|
|
||
|
|
# Build
|
||
|
|
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -o bootstrap -ldflags "-s -w" .
|
||
|
|
|
||
|
|
# Zip
|
||
|
|
# Windows:
|
||
|
|
powershell -Command "Compress-Archive -Path bootstrap -DestinationPath function.zip -Force"
|
||
|
|
# Mac/Linux:
|
||
|
|
zip -j function.zip bootstrap
|
||
|
|
|
||
|
|
# Deploy
|
||
|
|
aws lambda update-function-code --region eu-west-1 \
|
||
|
|
--function-name tinqs-ci-dispatch \
|
||
|
|
--zip-file fileb://function.zip
|
||
|
|
|
||
|
|
# Trigger a test build
|
||
|
|
# Push any change to cmd/tstudio/ in tinqs/studio
|
||
|
|
```
|
||
|
|
|
||
|
|
## Lambda env vars
|
||
|
|
|
||
|
|
Configured in AWS console, not in code:
|
||
|
|
|
||
|
|
| Var | Purpose |
|
||
|
|
|-----|---------|
|
||
|
|
| `GITEA_URL` | `https://tinqs.com` |
|
||
|
|
| `GITEA_TOKEN` | API token — used for fetching workflows AND runner git auth |
|
||
|
|
| `RUNNER_TOKEN` | act_runner registration token (from Gitea admin → Runners) |
|
||
|
|
| `RUNNER_AMI` | Pre-baked AMI ID (Go, Node, Docker, act_runner installed) |
|
||
|
|
| `SUBNET` | VPC subnet for Spot instances |
|
||
|
|
| `SECURITY_GROUP` | SG allowing outbound HTTPS |
|
||
|
|
| `DDB_TABLE` | DynamoDB table for run tracking (`tinqs-ci-runs`) |
|
||
|
|
| `INSTANCE_PROFILE` | IAM role for runner instances (S3, ECR, ECS access) |
|
||
|
|
|
||
|
|
## Monitoring
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Zombie check (should be 0 except during active builds)
|
||
|
|
aws ec2 describe-instances --region eu-west-1 \
|
||
|
|
--filters "Name=tag:tinqs-ci,Values=true" "Name=instance-state-name,Values=running" \
|
||
|
|
--query 'Reservations[].Instances[].InstanceId'
|
||
|
|
|
||
|
|
# Lambda dispatch logs (use MSYS_NO_PATHCONV=1 on Windows/Git Bash)
|
||
|
|
MSYS_NO_PATHCONV=1 aws logs tail '/aws/lambda/tinqs-ci-dispatch' --region eu-west-1
|
||
|
|
|
||
|
|
# Build logs
|
||
|
|
TOKEN=<your-gitea-token>
|
||
|
|
curl -s "https://tinqs.com/api/v1/repos/tinqs/studio/actions/jobs/<JOB_ID>/logs" \
|
||
|
|
-H "Authorization: token $TOKEN"
|
||
|
|
|
||
|
|
# Runner instance logs (while instance is alive)
|
||
|
|
aws ssm send-command --region eu-west-1 --instance-ids <ID> \
|
||
|
|
--document-name "AWS-RunShellScript" \
|
||
|
|
--parameters 'commands=["cat /var/log/tinqs-ci.log"]'
|
||
|
|
|
||
|
|
# Stale DynamoDB runs
|
||
|
|
aws dynamodb scan --region eu-west-1 --table-name tinqs-ci-runs \
|
||
|
|
--filter-expression "#s = :r" \
|
||
|
|
--expression-attribute-names '{"#s":"status"}' \
|
||
|
|
--expression-attribute-values '{":r":{"S":"running"}}' \
|
||
|
|
--query Count
|
||
|
|
```
|
||
|
|
|
||
|
|
Full debug guide: `tinqs/docs/.cursor/skills/ci-pipeline-discipline/SKILL.md`
|
||
|
|
|
||
|
|
## Contributing
|
||
|
|
|
||
|
|
### Adding a new composite action
|
||
|
|
|
||
|
|
1. Create `<action-name>/action.yml` with `using: composite` and `shell: bash`
|
||
|
|
2. Keep it simple — no Node.js runtime, just bash
|
||
|
|
3. Add a `<action-name>/README.md` with inputs/outputs
|
||
|
|
4. Add to the Actions table in this README
|
||
|
|
5. Push to main — actions resolve via `@v1` (main branch)
|
||
|
|
|
||
|
|
### Modifying the dispatcher
|
||
|
|
|
||
|
|
1. Edit `orchestrator/dispatch/main.go`
|
||
|
|
2. Build: `go build .` (catches compile errors)
|
||
|
|
3. Deploy manually (see Deploying above)
|
||
|
|
4. Verify: push a change to `tinqs/studio` and watch the pipeline
|
||
|
|
|
||
|
|
### Adding a new runner label
|
||
|
|
|
||
|
|
1. Add entry to `labelToSpot` map in `main.go`
|
||
|
|
2. Create `images/<label>/Dockerfile` if needed
|
||
|
|
3. Build and push image: `cd images && ./build-all.sh v1`
|
||
|
|
4. Deploy updated Lambda
|
||
|
|
5. Add `runs-on: <label>` to the workflow that needs it
|
||
|
|
|
||
|
|
### Updating the AMI
|
||
|
|
|
||
|
|
1. Launch a t3.small from the current AMI (`RUNNER_AMI` env var)
|
||
|
|
2. SSH in, install/update tools
|
||
|
|
3. Create AMI: `aws ec2 create-image --instance-id <ID> --name tinqs-ci-runner-v<N>`
|
||
|
|
4. Update `RUNNER_AMI` Lambda env var
|
||
|
|
5. Terminate the build instance
|
||
|
|
|
||
|
|
## Incidents
|
||
|
|
|
||
|
|
- **25 May 2026**: 18 zombie runners DDoS-ing Gitea. Root cause: no `--ephemeral` on registration + no git auth after repos went private. Full post-mortem: `tinqs/internal/incidents/ci-zombie-runners-2026-05-25.md`
|