docs: convert ci docs to the in-repo wiki/ standard + fix stale ECS facts
Adopt the team wiki convention (in-repo wiki/ folder, plain markdown) used in tinqs/studio. Convert DEVOPS.md + PLAN.md and the heavy parts of README.md into cross-linked wiki pages: Home, Architecture, DevOps-Reference, Operations, Roadmap. Root README slimmed to a repo intro pointing at wiki/. Corrects stale topology while converting: - ECS cluster tinqs-git / EFS tinqs-git-repos retired 2026-06-05; platform now the standalone EC2 box tinqs-prod-gitea (ALB tinqs-git, ECR image, RDS). - Records this session's fixes: deploy-label dry-run route, runner-name collisions, arikigame IAM bucket, and template deploy repointed ECS→EC2/SSM. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -1,32 +1,14 @@
|
||||
# tinqs/ci
|
||||
|
||||
CI toolchain for Tinqs Studio — composite Gitea Actions and a Lambda dispatcher that orchestrates ephemeral Spot runners.
|
||||
CI toolchain for Tinqs Studio — composite Gitea Actions and a Lambda dispatcher that orchestrates ephemeral EC2 Spot runners.
|
||||
|
||||
**This repo must stay public.** act_runner (go-git) clones action repos without auth. All other tinqs repos are private.
|
||||
|
||||
## Architecture
|
||||
> ⚠️ **This repo must stay public.** `act_runner` (go-git) clones action repos without auth; every other tinqs repo is private. If this repo goes private, every `uses: tinqs/ci/...` step breaks.
|
||||
|
||||
```
|
||||
Push → Gitea webhook → Lambda (tinqs-ci-dispatch) → EC2 Spot → act_runner → job → self-terminate
|
||||
```
|
||||
|
||||
Runners are ephemeral: one Spot instance per job, self-terminates on completion. Private repo clones are authenticated via `git config url.insteadOf` injected in the runner user-data.
|
||||
|
||||
### Key design decisions
|
||||
|
||||
- **Ephemeral Spot instances** (not Fargate, not persistent runners) — cheapest, cleanest, no state to manage.
|
||||
- **`--ephemeral` on `act_runner register`** — runner exits after one job, triggering `shutdown -h now` → instance terminates. Without this, runners pile up as zombies.
|
||||
- **No local action cache** — act_runner uses go-git internally which ignores `~/.gitconfig`. The `url.insteadOf` trick only works for the git binary (used by checkout action).
|
||||
- **`tinqs.com`** — Gitea's ROOT_URL is `tinqs.com`. The old `git.tinqs.com` subdomain is retired.
|
||||
|
||||
## Actions
|
||||
|
||||
| Action | What it does |
|
||||
|--------|-------------|
|
||||
| `tinqs/ci/checkout@v1` | Clone a repo from tinqs.com (sparse checkout, depth control, token auth) |
|
||||
| `tinqs/ci/setup-go@v1` | Install Go (skips if pre-baked in AMI) |
|
||||
| `tinqs/ci/setup-node@v1` | Install Node.js + pnpm (skips if pre-baked) |
|
||||
| `tinqs/ci/setup-aws@v1` | Install AWS CLI + optional ECR login |
|
||||
## Using the actions
|
||||
|
||||
```yaml
|
||||
steps:
|
||||
@@ -39,137 +21,24 @@ steps:
|
||||
ecr-login: 'true'
|
||||
```
|
||||
|
||||
## Dispatcher (Lambda)
|
||||
| Action | What it does |
|
||||
|--------|-------------|
|
||||
| `tinqs/ci/checkout@v1` | Clone a repo from tinqs.com (sparse, depth, token auth) |
|
||||
| `tinqs/ci/setup-go@v1` | Install Go (skips if pre-baked in AMI) |
|
||||
| `tinqs/ci/setup-node@v1` | Install Node.js + pnpm (skips if pre-baked) |
|
||||
| `tinqs/ci/setup-aws@v1` | Install AWS CLI + optional ECR login |
|
||||
|
||||
`orchestrator/dispatch/main.go` — receives Gitea webhooks, evaluates workflow triggers (branch + path filters), launches Spot instances with the right label.
|
||||
## Layout
|
||||
|
||||
| Label | Instance | Use |
|
||||
|-------|----------|-----|
|
||||
| `go` | t3.small | Go builds (tstudio, proxy, docgen) |
|
||||
| `docker` | t3.medium | Docker image builds (platform, bot) |
|
||||
| `deploy` | t3.micro | S3 sync, ECS update |
|
||||
| `node` | t3.medium | Frontend builds |
|
||||
| `godot` | t3.medium | Game exports (future) |
|
||||
- `checkout/`, `setup-go/`, `setup-node/`, `setup-aws/` — composite actions
|
||||
- `orchestrator/dispatch/` — the dispatcher Lambda (`main.go`)
|
||||
- `images/` — runner image Dockerfiles
|
||||
|
||||
Runner user-data flow: boot → git auth config → act_runner register (ephemeral) → daemon → job → exit → shutdown → terminate.
|
||||
## 📖 Full docs → [`wiki/`](wiki/README.md)
|
||||
|
||||
## Runner Images
|
||||
The team wiki lives in **[`wiki/`](wiki/README.md)** (plain markdown, rendered by Gitea):
|
||||
|
||||
Dockerfiles in `images/` — lean, purpose-built. Push to ECR with `images/build-all.sh v1`.
|
||||
|
||||
| Image | Contents |
|
||||
|-------|----------|
|
||||
| `base` | Alpine + git + AWS CLI + SSH |
|
||||
| `go` | base + Go 1.26 |
|
||||
| `node` | base + Node 22 + pnpm |
|
||||
| `docker` | docker:dind + Go + AWS CLI |
|
||||
| `deploy` | base only (lightest) |
|
||||
| `godot` | base + headless Godot 4.6 |
|
||||
|
||||
## Deploying the dispatcher
|
||||
|
||||
The dispatcher Lambda can't CI itself — deploy manually:
|
||||
|
||||
```bash
|
||||
cd orchestrator/dispatch
|
||||
|
||||
# Build
|
||||
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -o bootstrap -ldflags "-s -w" .
|
||||
|
||||
# Zip
|
||||
# Windows:
|
||||
powershell -Command "Compress-Archive -Path bootstrap -DestinationPath function.zip -Force"
|
||||
# Mac/Linux:
|
||||
zip -j function.zip bootstrap
|
||||
|
||||
# Deploy
|
||||
aws lambda update-function-code --region eu-west-1 \
|
||||
--function-name tinqs-ci-dispatch \
|
||||
--zip-file fileb://function.zip
|
||||
|
||||
# Trigger a test build
|
||||
# Push any change to cmd/tstudio/ in tinqs/studio
|
||||
```
|
||||
|
||||
## Lambda env vars
|
||||
|
||||
Configured in AWS console, not in code:
|
||||
|
||||
| Var | Purpose |
|
||||
|-----|---------|
|
||||
| `GITEA_URL` | `https://tinqs.com` |
|
||||
| `GITEA_TOKEN` | API token — used for fetching workflows AND runner git auth |
|
||||
| `RUNNER_TOKEN` | act_runner registration token (from Gitea admin → Runners) |
|
||||
| `RUNNER_AMI` | Pre-baked AMI ID (Go, Node, Docker, act_runner installed) |
|
||||
| `SUBNET` | VPC subnet for Spot instances |
|
||||
| `SECURITY_GROUP` | SG allowing outbound HTTPS |
|
||||
| `DDB_TABLE` | DynamoDB table for run tracking (`tinqs-ci-runs`) |
|
||||
| `INSTANCE_PROFILE` | IAM role for runner instances (S3, ECR, ECS access) |
|
||||
|
||||
## Monitoring
|
||||
|
||||
```bash
|
||||
# Zombie check (should be 0 except during active builds)
|
||||
aws ec2 describe-instances --region eu-west-1 \
|
||||
--filters "Name=tag:tinqs-ci,Values=true" "Name=instance-state-name,Values=running" \
|
||||
--query 'Reservations[].Instances[].InstanceId'
|
||||
|
||||
# Lambda dispatch logs (use MSYS_NO_PATHCONV=1 on Windows/Git Bash)
|
||||
MSYS_NO_PATHCONV=1 aws logs tail '/aws/lambda/tinqs-ci-dispatch' --region eu-west-1
|
||||
|
||||
# Build logs
|
||||
TOKEN=<your-gitea-token>
|
||||
curl -s "https://tinqs.com/api/v1/repos/tinqs/studio/actions/jobs/<JOB_ID>/logs" \
|
||||
-H "Authorization: token $TOKEN"
|
||||
|
||||
# Runner instance logs (while instance is alive)
|
||||
aws ssm send-command --region eu-west-1 --instance-ids <ID> \
|
||||
--document-name "AWS-RunShellScript" \
|
||||
--parameters 'commands=["cat /var/log/tinqs-ci.log"]'
|
||||
|
||||
# Stale DynamoDB runs
|
||||
aws dynamodb scan --region eu-west-1 --table-name tinqs-ci-runs \
|
||||
--filter-expression "#s = :r" \
|
||||
--expression-attribute-names '{"#s":"status"}' \
|
||||
--expression-attribute-values '{":r":{"S":"running"}}' \
|
||||
--query Count
|
||||
```
|
||||
|
||||
Full debug guide: `tinqs/docs/.cursor/skills/ci-pipeline-discipline/SKILL.md`
|
||||
|
||||
## Contributing
|
||||
|
||||
### Adding a new composite action
|
||||
|
||||
1. Create `<action-name>/action.yml` with `using: composite` and `shell: bash`
|
||||
2. Keep it simple — no Node.js runtime, just bash
|
||||
3. Add a `<action-name>/README.md` with inputs/outputs
|
||||
4. Add to the Actions table in this README
|
||||
5. Push to main — actions resolve via `@v1` (main branch)
|
||||
|
||||
### Modifying the dispatcher
|
||||
|
||||
1. Edit `orchestrator/dispatch/main.go`
|
||||
2. Build: `go build .` (catches compile errors)
|
||||
3. Deploy manually (see Deploying above)
|
||||
4. Verify: push a change to `tinqs/studio` and watch the pipeline
|
||||
|
||||
### Adding a new runner label
|
||||
|
||||
1. Add entry to `labelToSpot` map in `main.go`
|
||||
2. Create `images/<label>/Dockerfile` if needed
|
||||
3. Build and push image: `cd images && ./build-all.sh v1`
|
||||
4. Deploy updated Lambda
|
||||
5. Add `runs-on: <label>` to the workflow that needs it
|
||||
|
||||
### Updating the AMI
|
||||
|
||||
1. Launch a t3.small from the current AMI (`RUNNER_AMI` env var)
|
||||
2. SSH in, install/update tools
|
||||
3. Create AMI: `aws ec2 create-image --instance-id <ID> --name tinqs-ci-runner-v<N>`
|
||||
4. Update `RUNNER_AMI` Lambda env var
|
||||
5. Terminate the build instance
|
||||
|
||||
## Incidents
|
||||
|
||||
- **25 May 2026**: 18 zombie runners DDoS-ing Gitea. Root cause: no `--ephemeral` on registration + no git auth after repos went private. Full post-mortem: `tinqs/internal/incidents/ci-zombie-runners-2026-05-25.md`
|
||||
- [Architecture](wiki/Architecture.md) — design, dispatcher, labels, runner lifecycle
|
||||
- [DevOps Reference](wiki/DevOps-Reference.md) — AWS resources, webhook flow, Spot lifecycle, env vars, cost
|
||||
- [Operations](wiki/Operations.md) — deploy the dispatcher, template deploy, rotate tokens, AMI, monitoring, incidents
|
||||
- [Roadmap](wiki/Roadmap.md) — done / next
|
||||
|
||||
Reference in New Issue
Block a user