Files
ci/wiki/Operations.md
ozan 33f967e42e docs: convert ci docs to the in-repo wiki/ standard + fix stale ECS facts
Adopt the team wiki convention (in-repo wiki/ folder, plain markdown) used in
tinqs/studio. Convert DEVOPS.md + PLAN.md and the heavy parts of README.md
into cross-linked wiki pages: Home, Architecture, DevOps-Reference,
Operations, Roadmap. Root README slimmed to a repo intro pointing at wiki/.

Corrects stale topology while converting:
- ECS cluster tinqs-git / EFS tinqs-git-repos retired 2026-06-05; platform now
  the standalone EC2 box tinqs-prod-gitea (ALB tinqs-git, ECR image, RDS).
- Records this session's fixes: deploy-label dry-run route, runner-name
  collisions, arikigame IAM bucket, and template deploy repointed ECS→EC2/SSM.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 20:43:05 +01:00

106 lines
5.3 KiB
Markdown

# Operations
[← Home](README.md) · [Architecture](Architecture.md) · [DevOps Reference](DevOps-Reference.md) · [Roadmap](Roadmap.md)
## Deploy the dispatcher
The dispatcher Lambda can't CI itself — deploy manually:
```bash
cd orchestrator/dispatch
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -o bootstrap -ldflags "-s -w" .
# Windows: powershell -Command "Compress-Archive -Path bootstrap -DestinationPath function.zip -Force"
# Mac/Linux: zip -j function.zip bootstrap
aws lambda update-function-code --region eu-west-1 \
--function-name tinqs-ci-dispatch --zip-file fileb://function.zip
# Verify: push a change to cmd/tstudio/ in tinqs/studio and watch the pipeline
```
## Deploy templates to prod (no rebuild)
Template-only changes don't need a platform rebuild. `tinqs/studio/.gitea/workflows/deploy-templates.yml` (label `deploy`) tars `templates/``s3://tinqs-git-lfs/custom-templates.tar.gz`, then over **SSM** tells `tinqs-prod-gitea` to pull + extract into `/data/gitea/templates/` and `docker restart gitea`.
> Repointed to SSM/EC2 on **2026-06-07**. It previously ran `aws ecs update-service --cluster tinqs-git`, which failed with `ClusterNotFoundException` after the cluster was deleted on 06-05 — that's why the repo Wiki tab and theme CSS never went live. The runner role gained a scoped `ssm:SendCommand` (prod-gitea only).
Manual one-off (admin creds):
```bash
tar czf /tmp/custom-templates.tar.gz -C templates .
aws s3 cp /tmp/custom-templates.tar.gz s3://tinqs-git-lfs/custom-templates.tar.gz --region eu-west-1
IID=$(aws ec2 describe-instances --region eu-west-1 \
--filters "Name=tag:Name,Values=tinqs-prod-gitea" "Name=instance-state-name,Values=running" \
--query "Reservations[0].Instances[0].InstanceId" --output text)
aws ssm send-command --region eu-west-1 --instance-ids "$IID" \
--document-name AWS-RunShellScript \
--parameters 'commands=["aws s3 cp s3://tinqs-git-lfs/custom-templates.tar.gz /tmp/ct.tar.gz --region eu-west-1","tar xzf /tmp/ct.tar.gz -C /data/gitea/templates","docker restart gitea"]'
```
> Note: a template change does **not** bump the platform version string in the footer (that tracks the Go binary build). Unchanged footer ≠ failed deploy.
## Rotate `GITEA_TOKEN`
1. Generate a new token in Gitea: Settings → Applications → Generate Token
2. `aws lambda update-function-configuration --function-name tinqs-ci-dispatch --environment ...`
3. Old token is burned into running instances — they die within 30 min
## Rotate `RUNNER_TOKEN`
1. Gitea admin → Actions → Runners → Create new registration token
2. Update the Lambda env var
3. Running instances keep their existing registration until they die
## Build a new AMI
```bash
aws ec2 run-instances --image-id ami-00a129385002e4de9 \
--instance-type t3.small --key-name <your-key> --region eu-west-1 \
--query 'Instances[0].InstanceId'
# SSH in, update tools (Go, Node, Docker, act_runner), then:
aws ec2 create-image --instance-id <id> --name tinqs-ci-runner-v3
aws lambda update-function-configuration --function-name tinqs-ci-dispatch \
--environment "Variables={...,RUNNER_AMI=ami-NEW,...}"
aws ec2 terminate-instances --instance-id <id>
```
## Add CI to a new repo
1. Create `.gitea/workflows/<name>.yml` in the repo
2. Add a per-repo webhook in Gitea: Settings → Webhooks → Add Webhook
- URL: the dispatcher API Gateway URL · Events: Push · Content type: `application/json`
3. Push a change matching the workflow trigger
## Monitoring
```bash
# Zombie check (should be 0 except during active builds)
aws ec2 describe-instances --region eu-west-1 \
--filters "Name=tag:tinqs-ci,Values=true" "Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[].InstanceId'
# Dispatcher logs (MSYS_NO_PATHCONV=1 on Windows/Git Bash; or use PowerShell)
MSYS_NO_PATHCONV=1 aws logs tail '/aws/lambda/tinqs-ci-dispatch' --region eu-west-1
# Build/job logs
curl -s "https://tinqs.com/api/v1/repos/tinqs/studio/actions/jobs/<JOB_ID>/logs" \
-H "Authorization: token <gitea-token>"
# Stale DynamoDB runs
aws dynamodb scan --region eu-west-1 --table-name tinqs-ci-runs \
--filter-expression "#s = :r" \
--expression-attribute-names '{"#s":"status"}' \
--expression-attribute-values '{":r":{"S":"running"}}' --query Count
```
## Contributing
**New composite action:** create `<name>/action.yml` (`using: composite`, `shell: bash`), keep it bash-only, add a `<name>/README.md`, list it in [Architecture](Architecture.md), push to main (resolves via `@v1`).
**Modify the dispatcher:** edit `orchestrator/dispatch/main.go`, `go build .` to catch errors, deploy manually (above), verify with a push to `tinqs/studio`.
**New runner label:** add to `labelToSpot` in `main.go`, create `images/<label>/Dockerfile` if needed, build/push (`cd images && ./build-all.sh v1`), deploy the Lambda, add `runs-on: <label>` to the consuming workflow.
## Incidents
- **25 May 2026** — 18 zombie runners DDoS-ing Gitea. Root cause: no `--ephemeral` on registration + no git auth after repos went private. Fix: `--ephemeral` + `url.insteadOf` git auth in user-data.
- **07 Jun 2026** — all `runs-on: deploy` jobs silently dry-running (dead `tinqs-ci-exec` route) + arikigame IAM bucket mismatch + template deploy pointing at the deleted ECS cluster. All fixed; see [Architecture](Architecture.md) and the template-deploy note above.