docs: convert ci docs to the in-repo wiki/ standard + fix stale ECS facts
Adopt the team wiki convention (in-repo wiki/ folder, plain markdown) used in tinqs/studio. Convert DEVOPS.md + PLAN.md and the heavy parts of README.md into cross-linked wiki pages: Home, Architecture, DevOps-Reference, Operations, Roadmap. Root README slimmed to a repo intro pointing at wiki/. Corrects stale topology while converting: - ECS cluster tinqs-git / EFS tinqs-git-repos retired 2026-06-05; platform now the standalone EC2 box tinqs-prod-gitea (ALB tinqs-git, ECR image, RDS). - Records this session's fixes: deploy-label dry-run route, runner-name collisions, arikigame IAM bucket, and template deploy repointed ECS→EC2/SSM. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -1,115 +0,0 @@
|
|||||||
# DevOps Reference
|
|
||||||
|
|
||||||
## AWS Resources (eu-west-1)
|
|
||||||
|
|
||||||
| Resource | Name/ID | Purpose |
|
|
||||||
|----------|---------|---------|
|
|
||||||
| Lambda | `tinqs-ci-dispatch` | Webhook handler + Spot launcher |
|
|
||||||
| DynamoDB | `tinqs-ci-runs` | Run tracking (repo, run_id, instance_id, status) |
|
|
||||||
| AMI | `tinqs-ci-runner-v2` (ami-00a129385002e4de9) | Pre-baked runner (Go, Node, Docker, act_runner) |
|
|
||||||
| Security Group | sg-030bf74b43d3faac7 | Runner SG (outbound HTTPS) |
|
|
||||||
| Subnet | subnet-04b5aeec9bfc4ec2c | Default VPC subnet |
|
|
||||||
| Instance Profile | tinqs-ci-runner | IAM role (S3, ECR, ECS, SSM) |
|
|
||||||
| CloudWatch | /aws/lambda/tinqs-ci-dispatch | Dispatcher logs |
|
|
||||||
| ECS Cluster | tinqs-git | Platform (Gitea) — NOT for CI runners |
|
|
||||||
| EFS | tinqs-git-repos (fs-03f3fb4859ceb12a3) | Gitea repo storage — NOT for CI |
|
|
||||||
|
|
||||||
## Deleted resources (26 May 2026)
|
|
||||||
|
|
||||||
| Resource | Why deleted |
|
|
||||||
|----------|-------------|
|
|
||||||
| Lambda `tinqs-ci-exec` | Never successfully ran a build. Deploy jobs go through Spot now. |
|
|
||||||
| CloudWatch `/aws/lambda/tinqs-ci-exec` | Log group for deleted Lambda |
|
|
||||||
| CloudWatch `/ecs/tinqs-runner` | From Fargate era, no longer used |
|
|
||||||
|
|
||||||
## Webhook flow
|
|
||||||
|
|
||||||
```
|
|
||||||
Gitea (tinqs.com)
|
|
||||||
└─ per-repo webhook on push
|
|
||||||
└─ POST https://<api-gw>/dispatch
|
|
||||||
└─ Lambda tinqs-ci-dispatch
|
|
||||||
├─ Fetch .gitea/workflows/*.yml via Gitea API
|
|
||||||
├─ Evaluate triggers (branch + path filters)
|
|
||||||
├─ For each matched workflow:
|
|
||||||
│ ├─ Read runs-on label
|
|
||||||
│ └─ RunInstances (Spot, ephemeral)
|
|
||||||
└─ Track in DynamoDB
|
|
||||||
```
|
|
||||||
|
|
||||||
## Spot instance lifecycle
|
|
||||||
|
|
||||||
```
|
|
||||||
1. Lambda calls RunInstances (Spot, InstanceInitiatedShutdownBehavior=terminate)
|
|
||||||
2. User-data runs:
|
|
||||||
a. Configure git auth (url.insteadOf with GITEA_TOKEN)
|
|
||||||
b. act_runner register --ephemeral --labels <label>:host
|
|
||||||
c. act_runner daemon (blocks until job completes)
|
|
||||||
d. EXIT trap fires → shutdown -h now → instance terminates
|
|
||||||
3. DynamoDB record: running → completed (or timeout after 30 min cleanup)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Cleanup cron
|
|
||||||
|
|
||||||
The dispatcher Lambda also handles cleanup when invoked with empty body or `{"action":"cleanup"}`. Should be triggered by EventBridge every 5 minutes.
|
|
||||||
|
|
||||||
- Scans DynamoDB for runs older than 30 min with status=running
|
|
||||||
- Terminates matching EC2 instances
|
|
||||||
- Sweeps for orphan instances (tagged tinqs-ci, running > 30 min)
|
|
||||||
|
|
||||||
## Cost
|
|
||||||
|
|
||||||
| Component | Estimated monthly cost |
|
|
||||||
|-----------|----------------------|
|
|
||||||
| Spot instances (t3.small, ~10 min/build, ~5 builds/day) | ~$1-2 |
|
|
||||||
| Lambda (< 1000 invocations/month) | ~$0 (free tier) |
|
|
||||||
| DynamoDB (< 1 GB, low RCU/WCU) | ~$0 (free tier) |
|
|
||||||
| CloudWatch logs | ~$0.50 |
|
|
||||||
| **Total CI** | **~$2-3/month** |
|
|
||||||
|
|
||||||
## Common operations
|
|
||||||
|
|
||||||
### Rotate GITEA_TOKEN
|
|
||||||
|
|
||||||
1. Generate new token in Gitea: Settings → Applications → Generate Token
|
|
||||||
2. Update Lambda env: `aws lambda update-function-configuration --function-name tinqs-ci-dispatch --environment ...`
|
|
||||||
3. Old token is burned into running instances — they'll die within 30 min
|
|
||||||
|
|
||||||
### Rotate RUNNER_TOKEN
|
|
||||||
|
|
||||||
1. Gitea admin → Actions → Runners → Create new registration token
|
|
||||||
2. Update Lambda env var
|
|
||||||
3. Running instances keep their existing registration until they die
|
|
||||||
|
|
||||||
### Build a new AMI
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Launch from current AMI
|
|
||||||
aws ec2 run-instances --image-id ami-00a129385002e4de9 \
|
|
||||||
--instance-type t3.small --key-name <your-key> \
|
|
||||||
--region eu-west-1 --query 'Instances[0].InstanceId'
|
|
||||||
|
|
||||||
# SSH in, update tools
|
|
||||||
ssh ec2-user@<ip>
|
|
||||||
sudo yum update -y
|
|
||||||
# Install/update Go, Node, Docker, act_runner as needed
|
|
||||||
|
|
||||||
# Create new AMI
|
|
||||||
aws ec2 create-image --instance-id <id> --name tinqs-ci-runner-v3
|
|
||||||
|
|
||||||
# Update Lambda
|
|
||||||
aws lambda update-function-configuration --function-name tinqs-ci-dispatch \
|
|
||||||
--environment "Variables={...,RUNNER_AMI=ami-NEW,...}"
|
|
||||||
|
|
||||||
# Terminate build instance
|
|
||||||
aws ec2 terminate-instances --instance-id <id>
|
|
||||||
```
|
|
||||||
|
|
||||||
### Add CI to a new repo
|
|
||||||
|
|
||||||
1. Create `.gitea/workflows/<name>.yml` in the repo
|
|
||||||
2. Add per-repo webhook in Gitea: Settings → Webhooks → Add Webhook
|
|
||||||
- URL: Lambda API Gateway URL
|
|
||||||
- Events: Push
|
|
||||||
- Content type: application/json
|
|
||||||
3. Push a change that matches the workflow trigger
|
|
||||||
@@ -1,30 +0,0 @@
|
|||||||
# tinqs/ci — Status
|
|
||||||
|
|
||||||
## Done
|
|
||||||
|
|
||||||
- [x] Composite actions: checkout, setup-go, setup-node, setup-aws
|
|
||||||
- [x] Lambda dispatcher with Spot instance routing
|
|
||||||
- [x] Ephemeral runners (one job, self-terminate)
|
|
||||||
- [x] Git auth for private repos (url.insteadOf)
|
|
||||||
- [x] Local action cache (pre-clone to bare repo, instant resolution)
|
|
||||||
- [x] DynamoDB run tracking + cleanup cron
|
|
||||||
- [x] Runner image Dockerfiles: base, go, node, docker, deploy, godot
|
|
||||||
- [x] Zombie runner incident resolved (25 May 2026)
|
|
||||||
|
|
||||||
## Next
|
|
||||||
|
|
||||||
| Priority | Task | Impact |
|
|
||||||
|----------|------|--------|
|
|
||||||
| P1 | Pre-warm Go module + build cache in AMI | -30s build time |
|
|
||||||
| P1 | Automate AMI build (Packer or script) | Repeatable, no manual SSH |
|
|
||||||
| P2 | Internal DNS for git clones | Faster than public HTTPS |
|
|
||||||
| P2 | CloudWatch agent on runner AMI | Persistent logs after instance death |
|
|
||||||
| P3 | `tinqs/ci/deploy-ecs` action | ECS update-service wrapper |
|
|
||||||
| P3 | `tinqs/ci/deploy-s3` action | S3 sync + CloudFront invalidation |
|
|
||||||
| P3 | `tinqs/ci/notify` action | Post build status to Lobster GChat |
|
|
||||||
|
|
||||||
## Deleted (stale)
|
|
||||||
|
|
||||||
- `tinqs-ci-exec` Lambda — never successfully ran a build, removed 26 May
|
|
||||||
- `/ecs/tinqs-runner` CloudWatch log group — from Fargate era, removed 26 May
|
|
||||||
- Fargate runner service — scaled to 0, cluster still exists for tinqs-git ECS
|
|
||||||
@@ -1,32 +1,14 @@
|
|||||||
# tinqs/ci
|
# tinqs/ci
|
||||||
|
|
||||||
CI toolchain for Tinqs Studio — composite Gitea Actions and a Lambda dispatcher that orchestrates ephemeral Spot runners.
|
CI toolchain for Tinqs Studio — composite Gitea Actions and a Lambda dispatcher that orchestrates ephemeral EC2 Spot runners.
|
||||||
|
|
||||||
**This repo must stay public.** act_runner (go-git) clones action repos without auth. All other tinqs repos are private.
|
> ⚠️ **This repo must stay public.** `act_runner` (go-git) clones action repos without auth; every other tinqs repo is private. If this repo goes private, every `uses: tinqs/ci/...` step breaks.
|
||||||
|
|
||||||
## Architecture
|
|
||||||
|
|
||||||
```
|
```
|
||||||
Push → Gitea webhook → Lambda (tinqs-ci-dispatch) → EC2 Spot → act_runner → job → self-terminate
|
Push → Gitea webhook → Lambda (tinqs-ci-dispatch) → EC2 Spot → act_runner → job → self-terminate
|
||||||
```
|
```
|
||||||
|
|
||||||
Runners are ephemeral: one Spot instance per job, self-terminates on completion. Private repo clones are authenticated via `git config url.insteadOf` injected in the runner user-data.
|
## Using the actions
|
||||||
|
|
||||||
### Key design decisions
|
|
||||||
|
|
||||||
- **Ephemeral Spot instances** (not Fargate, not persistent runners) — cheapest, cleanest, no state to manage.
|
|
||||||
- **`--ephemeral` on `act_runner register`** — runner exits after one job, triggering `shutdown -h now` → instance terminates. Without this, runners pile up as zombies.
|
|
||||||
- **No local action cache** — act_runner uses go-git internally which ignores `~/.gitconfig`. The `url.insteadOf` trick only works for the git binary (used by checkout action).
|
|
||||||
- **`tinqs.com`** — Gitea's ROOT_URL is `tinqs.com`. The old `git.tinqs.com` subdomain is retired.
|
|
||||||
|
|
||||||
## Actions
|
|
||||||
|
|
||||||
| Action | What it does |
|
|
||||||
|--------|-------------|
|
|
||||||
| `tinqs/ci/checkout@v1` | Clone a repo from tinqs.com (sparse checkout, depth control, token auth) |
|
|
||||||
| `tinqs/ci/setup-go@v1` | Install Go (skips if pre-baked in AMI) |
|
|
||||||
| `tinqs/ci/setup-node@v1` | Install Node.js + pnpm (skips if pre-baked) |
|
|
||||||
| `tinqs/ci/setup-aws@v1` | Install AWS CLI + optional ECR login |
|
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
steps:
|
steps:
|
||||||
@@ -39,137 +21,24 @@ steps:
|
|||||||
ecr-login: 'true'
|
ecr-login: 'true'
|
||||||
```
|
```
|
||||||
|
|
||||||
## Dispatcher (Lambda)
|
| Action | What it does |
|
||||||
|
|--------|-------------|
|
||||||
|
| `tinqs/ci/checkout@v1` | Clone a repo from tinqs.com (sparse, depth, token auth) |
|
||||||
|
| `tinqs/ci/setup-go@v1` | Install Go (skips if pre-baked in AMI) |
|
||||||
|
| `tinqs/ci/setup-node@v1` | Install Node.js + pnpm (skips if pre-baked) |
|
||||||
|
| `tinqs/ci/setup-aws@v1` | Install AWS CLI + optional ECR login |
|
||||||
|
|
||||||
`orchestrator/dispatch/main.go` — receives Gitea webhooks, evaluates workflow triggers (branch + path filters), launches Spot instances with the right label.
|
## Layout
|
||||||
|
|
||||||
| Label | Instance | Use |
|
- `checkout/`, `setup-go/`, `setup-node/`, `setup-aws/` — composite actions
|
||||||
|-------|----------|-----|
|
- `orchestrator/dispatch/` — the dispatcher Lambda (`main.go`)
|
||||||
| `go` | t3.small | Go builds (tstudio, proxy, docgen) |
|
- `images/` — runner image Dockerfiles
|
||||||
| `docker` | t3.medium | Docker image builds (platform, bot) |
|
|
||||||
| `deploy` | t3.micro | S3 sync, ECS update |
|
|
||||||
| `node` | t3.medium | Frontend builds |
|
|
||||||
| `godot` | t3.medium | Game exports (future) |
|
|
||||||
|
|
||||||
Runner user-data flow: boot → git auth config → act_runner register (ephemeral) → daemon → job → exit → shutdown → terminate.
|
## 📖 Full docs → [`wiki/`](wiki/README.md)
|
||||||
|
|
||||||
## Runner Images
|
The team wiki lives in **[`wiki/`](wiki/README.md)** (plain markdown, rendered by Gitea):
|
||||||
|
|
||||||
Dockerfiles in `images/` — lean, purpose-built. Push to ECR with `images/build-all.sh v1`.
|
- [Architecture](wiki/Architecture.md) — design, dispatcher, labels, runner lifecycle
|
||||||
|
- [DevOps Reference](wiki/DevOps-Reference.md) — AWS resources, webhook flow, Spot lifecycle, env vars, cost
|
||||||
| Image | Contents |
|
- [Operations](wiki/Operations.md) — deploy the dispatcher, template deploy, rotate tokens, AMI, monitoring, incidents
|
||||||
|-------|----------|
|
- [Roadmap](wiki/Roadmap.md) — done / next
|
||||||
| `base` | Alpine + git + AWS CLI + SSH |
|
|
||||||
| `go` | base + Go 1.26 |
|
|
||||||
| `node` | base + Node 22 + pnpm |
|
|
||||||
| `docker` | docker:dind + Go + AWS CLI |
|
|
||||||
| `deploy` | base only (lightest) |
|
|
||||||
| `godot` | base + headless Godot 4.6 |
|
|
||||||
|
|
||||||
## Deploying the dispatcher
|
|
||||||
|
|
||||||
The dispatcher Lambda can't CI itself — deploy manually:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd orchestrator/dispatch
|
|
||||||
|
|
||||||
# Build
|
|
||||||
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -o bootstrap -ldflags "-s -w" .
|
|
||||||
|
|
||||||
# Zip
|
|
||||||
# Windows:
|
|
||||||
powershell -Command "Compress-Archive -Path bootstrap -DestinationPath function.zip -Force"
|
|
||||||
# Mac/Linux:
|
|
||||||
zip -j function.zip bootstrap
|
|
||||||
|
|
||||||
# Deploy
|
|
||||||
aws lambda update-function-code --region eu-west-1 \
|
|
||||||
--function-name tinqs-ci-dispatch \
|
|
||||||
--zip-file fileb://function.zip
|
|
||||||
|
|
||||||
# Trigger a test build
|
|
||||||
# Push any change to cmd/tstudio/ in tinqs/studio
|
|
||||||
```
|
|
||||||
|
|
||||||
## Lambda env vars
|
|
||||||
|
|
||||||
Configured in AWS console, not in code:
|
|
||||||
|
|
||||||
| Var | Purpose |
|
|
||||||
|-----|---------|
|
|
||||||
| `GITEA_URL` | `https://tinqs.com` |
|
|
||||||
| `GITEA_TOKEN` | API token — used for fetching workflows AND runner git auth |
|
|
||||||
| `RUNNER_TOKEN` | act_runner registration token (from Gitea admin → Runners) |
|
|
||||||
| `RUNNER_AMI` | Pre-baked AMI ID (Go, Node, Docker, act_runner installed) |
|
|
||||||
| `SUBNET` | VPC subnet for Spot instances |
|
|
||||||
| `SECURITY_GROUP` | SG allowing outbound HTTPS |
|
|
||||||
| `DDB_TABLE` | DynamoDB table for run tracking (`tinqs-ci-runs`) |
|
|
||||||
| `INSTANCE_PROFILE` | IAM role for runner instances (S3, ECR, ECS access) |
|
|
||||||
|
|
||||||
## Monitoring
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Zombie check (should be 0 except during active builds)
|
|
||||||
aws ec2 describe-instances --region eu-west-1 \
|
|
||||||
--filters "Name=tag:tinqs-ci,Values=true" "Name=instance-state-name,Values=running" \
|
|
||||||
--query 'Reservations[].Instances[].InstanceId'
|
|
||||||
|
|
||||||
# Lambda dispatch logs (use MSYS_NO_PATHCONV=1 on Windows/Git Bash)
|
|
||||||
MSYS_NO_PATHCONV=1 aws logs tail '/aws/lambda/tinqs-ci-dispatch' --region eu-west-1
|
|
||||||
|
|
||||||
# Build logs
|
|
||||||
TOKEN=<your-gitea-token>
|
|
||||||
curl -s "https://tinqs.com/api/v1/repos/tinqs/studio/actions/jobs/<JOB_ID>/logs" \
|
|
||||||
-H "Authorization: token $TOKEN"
|
|
||||||
|
|
||||||
# Runner instance logs (while instance is alive)
|
|
||||||
aws ssm send-command --region eu-west-1 --instance-ids <ID> \
|
|
||||||
--document-name "AWS-RunShellScript" \
|
|
||||||
--parameters 'commands=["cat /var/log/tinqs-ci.log"]'
|
|
||||||
|
|
||||||
# Stale DynamoDB runs
|
|
||||||
aws dynamodb scan --region eu-west-1 --table-name tinqs-ci-runs \
|
|
||||||
--filter-expression "#s = :r" \
|
|
||||||
--expression-attribute-names '{"#s":"status"}' \
|
|
||||||
--expression-attribute-values '{":r":{"S":"running"}}' \
|
|
||||||
--query Count
|
|
||||||
```
|
|
||||||
|
|
||||||
Full debug guide: `tinqs/docs/.cursor/skills/ci-pipeline-discipline/SKILL.md`
|
|
||||||
|
|
||||||
## Contributing
|
|
||||||
|
|
||||||
### Adding a new composite action
|
|
||||||
|
|
||||||
1. Create `<action-name>/action.yml` with `using: composite` and `shell: bash`
|
|
||||||
2. Keep it simple — no Node.js runtime, just bash
|
|
||||||
3. Add a `<action-name>/README.md` with inputs/outputs
|
|
||||||
4. Add to the Actions table in this README
|
|
||||||
5. Push to main — actions resolve via `@v1` (main branch)
|
|
||||||
|
|
||||||
### Modifying the dispatcher
|
|
||||||
|
|
||||||
1. Edit `orchestrator/dispatch/main.go`
|
|
||||||
2. Build: `go build .` (catches compile errors)
|
|
||||||
3. Deploy manually (see Deploying above)
|
|
||||||
4. Verify: push a change to `tinqs/studio` and watch the pipeline
|
|
||||||
|
|
||||||
### Adding a new runner label
|
|
||||||
|
|
||||||
1. Add entry to `labelToSpot` map in `main.go`
|
|
||||||
2. Create `images/<label>/Dockerfile` if needed
|
|
||||||
3. Build and push image: `cd images && ./build-all.sh v1`
|
|
||||||
4. Deploy updated Lambda
|
|
||||||
5. Add `runs-on: <label>` to the workflow that needs it
|
|
||||||
|
|
||||||
### Updating the AMI
|
|
||||||
|
|
||||||
1. Launch a t3.small from the current AMI (`RUNNER_AMI` env var)
|
|
||||||
2. SSH in, install/update tools
|
|
||||||
3. Create AMI: `aws ec2 create-image --instance-id <ID> --name tinqs-ci-runner-v<N>`
|
|
||||||
4. Update `RUNNER_AMI` Lambda env var
|
|
||||||
5. Terminate the build instance
|
|
||||||
|
|
||||||
## Incidents
|
|
||||||
|
|
||||||
- **25 May 2026**: 18 zombie runners DDoS-ing Gitea. Root cause: no `--ephemeral` on registration + no git auth after repos went private. Full post-mortem: `tinqs/internal/incidents/ci-zombie-runners-2026-05-25.md`
|
|
||||||
|
|||||||
@@ -0,0 +1,80 @@
|
|||||||
|
# Architecture
|
||||||
|
|
||||||
|
[← Home](README.md) · [DevOps Reference](DevOps-Reference.md) · [Operations](Operations.md) · [Roadmap](Roadmap.md)
|
||||||
|
|
||||||
|
```
|
||||||
|
Push → Gitea webhook → Lambda (tinqs-ci-dispatch) → EC2 Spot → act_runner → job → self-terminate
|
||||||
|
```
|
||||||
|
|
||||||
|
Runners are **ephemeral**: one Spot instance per job, self-terminating on completion. Private-repo clones are authenticated via `git config url.insteadOf` injected in the runner user-data.
|
||||||
|
|
||||||
|
## Key design decisions
|
||||||
|
|
||||||
|
- **Ephemeral Spot instances** (not Fargate, not persistent runners) — cheapest, cleanest, no state to manage.
|
||||||
|
- **`--ephemeral` on `act_runner register`** — the runner exits after one job, triggering `shutdown -h now` → the instance terminates. Without this, runners pile up as zombies (see the 25 May 2026 incident in [Operations](Operations.md)).
|
||||||
|
- **No local action cache** — `act_runner` uses go-git internally, which ignores `~/.gitconfig`. The `url.insteadOf` trick only works for the `git` binary (used by the `checkout` action), so action repos are cloned fresh each run. This is why `tinqs/ci` must stay public.
|
||||||
|
- **`tinqs.com`** — Gitea's `ROOT_URL` is `tinqs.com`. The old `git.tinqs.com` subdomain is retired.
|
||||||
|
|
||||||
|
## Composite actions
|
||||||
|
|
||||||
|
Bash-only composite actions (no Node.js runtime). Resolve via `@v1` (the main branch).
|
||||||
|
|
||||||
|
| Action | What it does |
|
||||||
|
|--------|-------------|
|
||||||
|
| `tinqs/ci/checkout@v1` | Clone a repo from tinqs.com (sparse checkout, depth control, token auth) |
|
||||||
|
| `tinqs/ci/setup-go@v1` | Install Go (skips if pre-baked in the AMI) |
|
||||||
|
| `tinqs/ci/setup-node@v1` | Install Node.js + pnpm (skips if pre-baked) |
|
||||||
|
| `tinqs/ci/setup-aws@v1` | Install AWS CLI + optional ECR login |
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
steps:
|
||||||
|
- uses: tinqs/ci/checkout@v1
|
||||||
|
with:
|
||||||
|
sparse: 'cmd/tstudio'
|
||||||
|
- uses: tinqs/ci/setup-go@v1
|
||||||
|
- uses: tinqs/ci/setup-aws@v1
|
||||||
|
with:
|
||||||
|
ecr-login: 'true'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Dispatcher (Lambda)
|
||||||
|
|
||||||
|
`orchestrator/dispatch/main.go` receives Gitea push webhooks, fetches `.gitea/workflows/*.yml` via the Gitea API, evaluates triggers (branch + path filters), reads each matched workflow's `runs-on` label, and launches a Spot instance with that label. Run state is tracked in DynamoDB.
|
||||||
|
|
||||||
|
Routing by label (`labelToSpot` map in `main.go`):
|
||||||
|
|
||||||
|
| Label | Instance | Use |
|
||||||
|
|-------|----------|-----|
|
||||||
|
| `go` | t3.small | Go builds (tstudio, proxy, docgen) |
|
||||||
|
| `docker` | t3.medium | Docker image builds (platform, bot) |
|
||||||
|
| `deploy` | t3.micro | S3 sync, CloudFront invalidation, SSM template deploy |
|
||||||
|
| `node` | t3.medium | Frontend builds |
|
||||||
|
| `godot` | t3.medium | Game exports (future) |
|
||||||
|
|
||||||
|
`runs-on: host` is skipped by the dispatcher (it's for a standing registered runner, not Spot).
|
||||||
|
|
||||||
|
> **Fixed 2026-06-07:** `deploy`-labelled jobs used to route to a separate executor Lambda (`tinqs-ci-exec`) that was deleted 26 May, so they silently hit a `[DRY RUN] Would invoke executor` no-op and never ran. They now fall through to the normal Spot path like every other label. A second bug — runner names derived from `runID[:12]` collided across same-commit deploys — was also fixed (names now use the full sanitised runID).
|
||||||
|
|
||||||
|
## Runner lifecycle (user-data)
|
||||||
|
|
||||||
|
```
|
||||||
|
boot → git auth config (url.insteadOf with GITEA_TOKEN)
|
||||||
|
→ act_runner register --ephemeral --labels <label>:host
|
||||||
|
→ act_runner daemon (blocks until job completes)
|
||||||
|
→ EXIT trap → shutdown -h now → instance terminates
|
||||||
|
```
|
||||||
|
|
||||||
|
## Runner images
|
||||||
|
|
||||||
|
Dockerfiles in `images/` — lean, purpose-built. Push to ECR with `images/build-all.sh v1`.
|
||||||
|
|
||||||
|
| Image | Contents |
|
||||||
|
|-------|----------|
|
||||||
|
| `base` | Alpine + git + AWS CLI + SSH |
|
||||||
|
| `go` | base + Go |
|
||||||
|
| `node` | base + Node + pnpm |
|
||||||
|
| `docker` | docker:dind + Go + AWS CLI |
|
||||||
|
| `deploy` | base only (lightest) |
|
||||||
|
| `godot` | base + headless Godot |
|
||||||
|
|
||||||
|
> Note: the live Spot runners boot from a **pre-baked AMI** (`RUNNER_AMI`, with Go/Node/Docker/act_runner installed), not these container images. The images exist for purpose-built runner variants; the AMI is the fast path.
|
||||||
@@ -0,0 +1,109 @@
|
|||||||
|
# DevOps Reference
|
||||||
|
|
||||||
|
[← Home](README.md) · [Architecture](Architecture.md) · [Operations](Operations.md) · [Roadmap](Roadmap.md)
|
||||||
|
|
||||||
|
## AWS resources (eu-west-1)
|
||||||
|
|
||||||
|
| Resource | Name/ID | Purpose |
|
||||||
|
|----------|---------|---------|
|
||||||
|
| Lambda | `tinqs-ci-dispatch` | Webhook handler + Spot launcher |
|
||||||
|
| DynamoDB | `tinqs-ci-runs` | Run tracking (repo, run_id, instance_id, status) |
|
||||||
|
| AMI | `tinqs-ci-runner-v2` (ami-00a129385002e4de9) | Pre-baked runner (Go, Node, Docker, act_runner) |
|
||||||
|
| Security Group | sg-030bf74b43d3faac7 | Runner SG (outbound HTTPS) |
|
||||||
|
| Subnet | subnet-04b5aeec9bfc4ec2c | Default VPC subnet |
|
||||||
|
| Instance Profile | `tinqs-ci-runner` → role `tinqs-git-task` | Runner IAM role (S3, ECR, SSM) |
|
||||||
|
| CloudWatch | /aws/lambda/tinqs-ci-dispatch | Dispatcher logs |
|
||||||
|
| API Gateway | `q4ohxovfr8…/webhook` | Receives the per-repo Gitea push webhook |
|
||||||
|
|
||||||
|
### Platform host (NOT CI — context)
|
||||||
|
|
||||||
|
| Resource | Name/ID | Purpose |
|
||||||
|
|----------|---------|---------|
|
||||||
|
| EC2 | `tinqs-prod-gitea` (i-0d085288f467083e0, t3.medium) | Runs tinqs.com as a single `docker` Gitea container |
|
||||||
|
| ALB | `tinqs-git` | Fronts the platform |
|
||||||
|
| ECR | `tinqs-git:latest` | Platform image (built by `build.yml` → CodeBuild) |
|
||||||
|
| RDS | `tinqs-prod` (PostgreSQL) | Platform DB |
|
||||||
|
|
||||||
|
The platform mounts host `/data`; `GITEA_CUSTOM=/data/gitea`, so **custom templates live at `/data/gitea/templates/`**. Template-only changes deploy here via SSM — see [Operations](Operations.md).
|
||||||
|
|
||||||
|
### Retired resources
|
||||||
|
|
||||||
|
| Resource | When / why |
|
||||||
|
|----------|------------|
|
||||||
|
| ECS Cluster `tinqs-git` | Deleted **2026-06-05** — platform moved to the `tinqs-prod-gitea` EC2 box |
|
||||||
|
| EFS `tinqs-git-repos` | Retired in the 2026-06-05 EC2 migration (repos now on instance `/data`) |
|
||||||
|
| Lambda `tinqs-ci-exec` | Deleted **26 May 2026** — never ran a build; deploy jobs go through Spot now |
|
||||||
|
| CloudWatch `/aws/lambda/tinqs-ci-exec`, `/ecs/tinqs-runner` | Log groups for the above / the Fargate era |
|
||||||
|
| Fargate runner service | Scaled to 0 then removed |
|
||||||
|
|
||||||
|
## Webhook flow
|
||||||
|
|
||||||
|
```
|
||||||
|
Gitea (tinqs.com)
|
||||||
|
└─ per-repo webhook on push
|
||||||
|
└─ POST https://<api-gw>/webhook
|
||||||
|
└─ Lambda tinqs-ci-dispatch
|
||||||
|
├─ Fetch .gitea/workflows/*.yml via Gitea API
|
||||||
|
├─ Evaluate triggers (branch + path filters)
|
||||||
|
├─ For each matched workflow:
|
||||||
|
│ ├─ Read runs-on label
|
||||||
|
│ └─ RunInstances (Spot, ephemeral) [host → skipped]
|
||||||
|
└─ Track in DynamoDB
|
||||||
|
```
|
||||||
|
|
||||||
|
## Spot instance lifecycle
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Lambda calls RunInstances (Spot, InstanceInitiatedShutdownBehavior=terminate)
|
||||||
|
2. User-data runs:
|
||||||
|
a. Configure git auth (url.insteadOf with GITEA_TOKEN)
|
||||||
|
b. act_runner register --ephemeral --labels <label>:host
|
||||||
|
c. act_runner daemon (blocks until job completes)
|
||||||
|
d. EXIT trap fires → shutdown -h now → instance terminates
|
||||||
|
3. DynamoDB record: running → completed (or timeout after 30 min cleanup)
|
||||||
|
```
|
||||||
|
|
||||||
|
Offline runners listed in Gitea admin → Actions → Runners are **normal** — they're spent ephemeral registrations, not a fault.
|
||||||
|
|
||||||
|
## Cleanup cron
|
||||||
|
|
||||||
|
The dispatcher Lambda also handles cleanup when invoked with an empty body or `{"action":"cleanup"}`. Triggered by EventBridge every 5 minutes.
|
||||||
|
|
||||||
|
- Scans DynamoDB for runs older than 30 min with `status=running`
|
||||||
|
- Terminates matching EC2 instances
|
||||||
|
- Sweeps for orphan instances (tagged `tinqs-ci`, running > 30 min)
|
||||||
|
|
||||||
|
## Lambda env vars
|
||||||
|
|
||||||
|
Configured in the AWS console, not in code:
|
||||||
|
|
||||||
|
| Var | Purpose |
|
||||||
|
|-----|---------|
|
||||||
|
| `GITEA_URL` | `https://tinqs.com` |
|
||||||
|
| `GITEA_TOKEN` | API token — fetches workflows AND provides runner git auth |
|
||||||
|
| `RUNNER_TOKEN` | act_runner registration token (Gitea admin → Runners) |
|
||||||
|
| `RUNNER_AMI` | Pre-baked AMI ID |
|
||||||
|
| `SUBNET` | VPC subnet for Spot instances |
|
||||||
|
| `SECURITY_GROUP` | SG allowing outbound HTTPS |
|
||||||
|
| `DDB_TABLE` | DynamoDB run-tracking table (`tinqs-ci-runs`) |
|
||||||
|
| `INSTANCE_PROFILE` | IAM instance profile for runners |
|
||||||
|
|
||||||
|
## Runner IAM role (`tinqs-git-task`)
|
||||||
|
|
||||||
|
Inline policies of note:
|
||||||
|
|
||||||
|
- `tinqs-ci-s3` — R/W on `tinqs-cli-releases`, `arikigame-com-website`, `docs.tinqs.com` *(corrected 2026-06-07: was the non-existent `arikigame.com`, which broke the arikigame deploy)*
|
||||||
|
- `tinqs-git-s3` — R/W on `tinqs-git-lfs`, `tinqs-git-preview`
|
||||||
|
- `tinqs-ci-deploy` — ECR push, CloudFront `CreateInvalidation`, (legacy ECS update)
|
||||||
|
- `tinqs-ci-ssm-deploy` — `ec2:DescribeInstances` + `ssm:SendCommand` **scoped to the `tinqs-prod-gitea` instance** (added 2026-06-07 for template deploys)
|
||||||
|
- `ssm-exec` — Session Manager channels · `ec2-self-terminate` — terminate own `tinqs-ci`-tagged instance
|
||||||
|
|
||||||
|
## Cost
|
||||||
|
|
||||||
|
| Component | Estimated monthly cost |
|
||||||
|
|-----------|----------------------|
|
||||||
|
| Spot instances (t3.small, ~10 min/build, ~5 builds/day) | ~$1–2 |
|
||||||
|
| Lambda (< 1000 invocations/month) | ~$0 (free tier) |
|
||||||
|
| DynamoDB (< 1 GB, low RCU/WCU) | ~$0 (free tier) |
|
||||||
|
| CloudWatch logs | ~$0.50 |
|
||||||
|
| **Total CI** | **~$2–3/month** |
|
||||||
@@ -0,0 +1,105 @@
|
|||||||
|
# Operations
|
||||||
|
|
||||||
|
[← Home](README.md) · [Architecture](Architecture.md) · [DevOps Reference](DevOps-Reference.md) · [Roadmap](Roadmap.md)
|
||||||
|
|
||||||
|
## Deploy the dispatcher
|
||||||
|
|
||||||
|
The dispatcher Lambda can't CI itself — deploy manually:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd orchestrator/dispatch
|
||||||
|
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -o bootstrap -ldflags "-s -w" .
|
||||||
|
# Windows: powershell -Command "Compress-Archive -Path bootstrap -DestinationPath function.zip -Force"
|
||||||
|
# Mac/Linux: zip -j function.zip bootstrap
|
||||||
|
aws lambda update-function-code --region eu-west-1 \
|
||||||
|
--function-name tinqs-ci-dispatch --zip-file fileb://function.zip
|
||||||
|
# Verify: push a change to cmd/tstudio/ in tinqs/studio and watch the pipeline
|
||||||
|
```
|
||||||
|
|
||||||
|
## Deploy templates to prod (no rebuild)
|
||||||
|
|
||||||
|
Template-only changes don't need a platform rebuild. `tinqs/studio/.gitea/workflows/deploy-templates.yml` (label `deploy`) tars `templates/` → `s3://tinqs-git-lfs/custom-templates.tar.gz`, then over **SSM** tells `tinqs-prod-gitea` to pull + extract into `/data/gitea/templates/` and `docker restart gitea`.
|
||||||
|
|
||||||
|
> Repointed to SSM/EC2 on **2026-06-07**. It previously ran `aws ecs update-service --cluster tinqs-git`, which failed with `ClusterNotFoundException` after the cluster was deleted on 06-05 — that's why the repo Wiki tab and theme CSS never went live. The runner role gained a scoped `ssm:SendCommand` (prod-gitea only).
|
||||||
|
|
||||||
|
Manual one-off (admin creds):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tar czf /tmp/custom-templates.tar.gz -C templates .
|
||||||
|
aws s3 cp /tmp/custom-templates.tar.gz s3://tinqs-git-lfs/custom-templates.tar.gz --region eu-west-1
|
||||||
|
IID=$(aws ec2 describe-instances --region eu-west-1 \
|
||||||
|
--filters "Name=tag:Name,Values=tinqs-prod-gitea" "Name=instance-state-name,Values=running" \
|
||||||
|
--query "Reservations[0].Instances[0].InstanceId" --output text)
|
||||||
|
aws ssm send-command --region eu-west-1 --instance-ids "$IID" \
|
||||||
|
--document-name AWS-RunShellScript \
|
||||||
|
--parameters 'commands=["aws s3 cp s3://tinqs-git-lfs/custom-templates.tar.gz /tmp/ct.tar.gz --region eu-west-1","tar xzf /tmp/ct.tar.gz -C /data/gitea/templates","docker restart gitea"]'
|
||||||
|
```
|
||||||
|
|
||||||
|
> Note: a template change does **not** bump the platform version string in the footer (that tracks the Go binary build). Unchanged footer ≠ failed deploy.
|
||||||
|
|
||||||
|
## Rotate `GITEA_TOKEN`
|
||||||
|
|
||||||
|
1. Generate a new token in Gitea: Settings → Applications → Generate Token
|
||||||
|
2. `aws lambda update-function-configuration --function-name tinqs-ci-dispatch --environment ...`
|
||||||
|
3. Old token is burned into running instances — they die within 30 min
|
||||||
|
|
||||||
|
## Rotate `RUNNER_TOKEN`
|
||||||
|
|
||||||
|
1. Gitea admin → Actions → Runners → Create new registration token
|
||||||
|
2. Update the Lambda env var
|
||||||
|
3. Running instances keep their existing registration until they die
|
||||||
|
|
||||||
|
## Build a new AMI
|
||||||
|
|
||||||
|
```bash
|
||||||
|
aws ec2 run-instances --image-id ami-00a129385002e4de9 \
|
||||||
|
--instance-type t3.small --key-name <your-key> --region eu-west-1 \
|
||||||
|
--query 'Instances[0].InstanceId'
|
||||||
|
# SSH in, update tools (Go, Node, Docker, act_runner), then:
|
||||||
|
aws ec2 create-image --instance-id <id> --name tinqs-ci-runner-v3
|
||||||
|
aws lambda update-function-configuration --function-name tinqs-ci-dispatch \
|
||||||
|
--environment "Variables={...,RUNNER_AMI=ami-NEW,...}"
|
||||||
|
aws ec2 terminate-instances --instance-id <id>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Add CI to a new repo
|
||||||
|
|
||||||
|
1. Create `.gitea/workflows/<name>.yml` in the repo
|
||||||
|
2. Add a per-repo webhook in Gitea: Settings → Webhooks → Add Webhook
|
||||||
|
- URL: the dispatcher API Gateway URL · Events: Push · Content type: `application/json`
|
||||||
|
3. Push a change matching the workflow trigger
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Zombie check (should be 0 except during active builds)
|
||||||
|
aws ec2 describe-instances --region eu-west-1 \
|
||||||
|
--filters "Name=tag:tinqs-ci,Values=true" "Name=instance-state-name,Values=running" \
|
||||||
|
--query 'Reservations[].Instances[].InstanceId'
|
||||||
|
|
||||||
|
# Dispatcher logs (MSYS_NO_PATHCONV=1 on Windows/Git Bash; or use PowerShell)
|
||||||
|
MSYS_NO_PATHCONV=1 aws logs tail '/aws/lambda/tinqs-ci-dispatch' --region eu-west-1
|
||||||
|
|
||||||
|
# Build/job logs
|
||||||
|
curl -s "https://tinqs.com/api/v1/repos/tinqs/studio/actions/jobs/<JOB_ID>/logs" \
|
||||||
|
-H "Authorization: token <gitea-token>"
|
||||||
|
|
||||||
|
# Stale DynamoDB runs
|
||||||
|
aws dynamodb scan --region eu-west-1 --table-name tinqs-ci-runs \
|
||||||
|
--filter-expression "#s = :r" \
|
||||||
|
--expression-attribute-names '{"#s":"status"}' \
|
||||||
|
--expression-attribute-values '{":r":{"S":"running"}}' --query Count
|
||||||
|
```
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
**New composite action:** create `<name>/action.yml` (`using: composite`, `shell: bash`), keep it bash-only, add a `<name>/README.md`, list it in [Architecture](Architecture.md), push to main (resolves via `@v1`).
|
||||||
|
|
||||||
|
**Modify the dispatcher:** edit `orchestrator/dispatch/main.go`, `go build .` to catch errors, deploy manually (above), verify with a push to `tinqs/studio`.
|
||||||
|
|
||||||
|
**New runner label:** add to `labelToSpot` in `main.go`, create `images/<label>/Dockerfile` if needed, build/push (`cd images && ./build-all.sh v1`), deploy the Lambda, add `runs-on: <label>` to the consuming workflow.
|
||||||
|
|
||||||
|
## Incidents
|
||||||
|
|
||||||
|
- **25 May 2026** — 18 zombie runners DDoS-ing Gitea. Root cause: no `--ephemeral` on registration + no git auth after repos went private. Fix: `--ephemeral` + `url.insteadOf` git auth in user-data.
|
||||||
|
- **07 Jun 2026** — all `runs-on: deploy` jobs silently dry-running (dead `tinqs-ci-exec` route) + arikigame IAM bucket mismatch + template deploy pointing at the deleted ECS cluster. All fixed; see [Architecture](Architecture.md) and the template-deploy note above.
|
||||||
@@ -0,0 +1,31 @@
|
|||||||
|
# tinqs/ci — CI Toolchain
|
||||||
|
|
||||||
|
> **📖 This is the team wiki.** Standard: the in-repo **`wiki/`** folder is the home for team/architecture docs in every repo (distinct from `.agents/` = agent operating context, and `docs/` = public product docs at tinqs.com/docs). Plain markdown, rendered by Gitea — no separate wiki repo, no build. Cross-link with `[Title](Page-Name.md)`.
|
||||||
|
|
||||||
|
**The CI system for Tinqs Studio: composite Gitea Actions + a Lambda dispatcher that launches ephemeral EC2 Spot runners, one per job.** Status baked in — ✅ live · 🔨 built · 📋 planned. Last updated 2026-06-07.
|
||||||
|
|
||||||
|
> ⚠️ **This repo must stay public.** `act_runner` (go-git) clones action repos without auth; every other tinqs repo is private. If `tinqs/ci` goes private, every workflow that does `uses: tinqs/ci/...` breaks.
|
||||||
|
|
||||||
|
```
|
||||||
|
Push → Gitea webhook → Lambda (tinqs-ci-dispatch) → EC2 Spot → act_runner → job → self-terminate
|
||||||
|
```
|
||||||
|
|
||||||
|
## Pages
|
||||||
|
|
||||||
|
- [Architecture](Architecture.md) — design decisions, the dispatcher, runner labels & images, runner lifecycle
|
||||||
|
- [DevOps Reference](DevOps-Reference.md) — AWS resources, webhook flow, Spot lifecycle, cleanup cron, cost, Lambda env vars
|
||||||
|
- [Operations](Operations.md) — deploy the dispatcher, rotate tokens, build an AMI, add CI to a repo, monitoring, incidents
|
||||||
|
- [Roadmap](Roadmap.md) — what's done, what's next
|
||||||
|
|
||||||
|
## Key facts
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| **Runners** | Ephemeral EC2 Spot, one per job, self-terminate (`--ephemeral` + `shutdown -h now`) |
|
||||||
|
| **Dispatcher** | `tinqs-ci-dispatch` Lambda (`orchestrator/dispatch/main.go`), Go, `provided.al2023` |
|
||||||
|
| **Routing** | Workflow `runs-on` label → Spot instance type (see [Architecture](Architecture.md)) |
|
||||||
|
| **Auth** | `GITEA_TOKEN` injected into runner user-data via `git config url.insteadOf` |
|
||||||
|
| **Region** | eu-west-1 |
|
||||||
|
| **Cost** | ~$2–3/month |
|
||||||
|
|
||||||
|
> **2026-06-05 — platform moved off ECS.** tinqs.com now runs as a single `docker` container on the standalone EC2 box **`tinqs-prod-gitea`** (behind ALB `tinqs-git`, image from ECR `tinqs-git:latest`, state on RDS `tinqs-prod` + local `/data`). The old ECS cluster `tinqs-git` and EFS `tinqs-git-repos` were retired. Any workflow that still referenced ECS (e.g. template deploy) was repointed at the EC2 host via SSM — see [Operations](Operations.md).
|
||||||
@@ -0,0 +1,33 @@
|
|||||||
|
# Roadmap
|
||||||
|
|
||||||
|
[← Home](README.md) · [Architecture](Architecture.md) · [DevOps Reference](DevOps-Reference.md) · [Operations](Operations.md)
|
||||||
|
|
||||||
|
## Done
|
||||||
|
|
||||||
|
- [x] Composite actions: `checkout`, `setup-go`, `setup-node`, `setup-aws`
|
||||||
|
- [x] Lambda dispatcher with Spot instance routing by `runs-on` label
|
||||||
|
- [x] Ephemeral runners (one job, self-terminate)
|
||||||
|
- [x] Git auth for private repos (`url.insteadOf`)
|
||||||
|
- [x] DynamoDB run tracking + cleanup cron
|
||||||
|
- [x] Runner image Dockerfiles: base, go, node, docker, deploy, godot
|
||||||
|
- [x] Zombie runner incident resolved (25 May 2026)
|
||||||
|
- [x] `deploy`-label jobs routed through Spot (was dead-Lambda dry-run) (07 Jun 2026)
|
||||||
|
- [x] Unique Spot runner names per dispatch (07 Jun 2026)
|
||||||
|
- [x] Template deploy repointed off deleted ECS → EC2 via SSM (07 Jun 2026)
|
||||||
|
|
||||||
|
## Next
|
||||||
|
|
||||||
|
| Priority | Task | Impact |
|
||||||
|
|----------|------|--------|
|
||||||
|
| P1 | Pre-warm Go module + build cache in the AMI | −30s build time |
|
||||||
|
| P1 | Automate AMI build (Packer or script) | Repeatable, no manual SSH |
|
||||||
|
| P2 | Internal DNS for git clones | Faster than public HTTPS |
|
||||||
|
| P2 | CloudWatch agent on the runner AMI | Persistent logs after instance death |
|
||||||
|
| P3 | `tinqs/ci/deploy-s3` action | S3 sync + CloudFront invalidation wrapper |
|
||||||
|
| P3 | `tinqs/ci/deploy-ssm` action | Reusable SSM-to-prod deploy (generalise the template-deploy step) |
|
||||||
|
| P3 | `tinqs/ci/notify` action | Post build status to GChat |
|
||||||
|
|
||||||
|
## Watch / cleanup
|
||||||
|
|
||||||
|
- **Repo size** — `tinqs/studio` now commits the arikigame site assets (~75 MB) as regular files because the CI `checkout` does no `git lfs pull`. If this grows, add `git lfs pull` to the checkout action, then LFS-track `web/arikigame/public/img/**`.
|
||||||
|
- **DEVOPS doc drift** — keep this wiki current when AWS topology changes (the ECS→EC2 move went unnoticed in docs for two days and broke deploys).
|
||||||
Reference in New Issue
Block a user