Files
ci/wiki/Operations.md
ozan 33f967e42e docs: convert ci docs to the in-repo wiki/ standard + fix stale ECS facts
Adopt the team wiki convention (in-repo wiki/ folder, plain markdown) used in
tinqs/studio. Convert DEVOPS.md + PLAN.md and the heavy parts of README.md
into cross-linked wiki pages: Home, Architecture, DevOps-Reference,
Operations, Roadmap. Root README slimmed to a repo intro pointing at wiki/.

Corrects stale topology while converting:
- ECS cluster tinqs-git / EFS tinqs-git-repos retired 2026-06-05; platform now
  the standalone EC2 box tinqs-prod-gitea (ALB tinqs-git, ECR image, RDS).
- Records this session's fixes: deploy-label dry-run route, runner-name
  collisions, arikigame IAM bucket, and template deploy repointed ECS→EC2/SSM.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 20:43:05 +01:00

5.3 KiB

Operations

← Home · Architecture · DevOps Reference · Roadmap

Deploy the dispatcher

The dispatcher Lambda can't CI itself — deploy manually:

cd orchestrator/dispatch
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -o bootstrap -ldflags "-s -w" .
# Windows:  powershell -Command "Compress-Archive -Path bootstrap -DestinationPath function.zip -Force"
# Mac/Linux: zip -j function.zip bootstrap
aws lambda update-function-code --region eu-west-1 \
  --function-name tinqs-ci-dispatch --zip-file fileb://function.zip
# Verify: push a change to cmd/tstudio/ in tinqs/studio and watch the pipeline

Deploy templates to prod (no rebuild)

Template-only changes don't need a platform rebuild. tinqs/studio/.gitea/workflows/deploy-templates.yml (label deploy) tars templates/s3://tinqs-git-lfs/custom-templates.tar.gz, then over SSM tells tinqs-prod-gitea to pull + extract into /data/gitea/templates/ and docker restart gitea.

Repointed to SSM/EC2 on 2026-06-07. It previously ran aws ecs update-service --cluster tinqs-git, which failed with ClusterNotFoundException after the cluster was deleted on 06-05 — that's why the repo Wiki tab and theme CSS never went live. The runner role gained a scoped ssm:SendCommand (prod-gitea only).

Manual one-off (admin creds):

tar czf /tmp/custom-templates.tar.gz -C templates .
aws s3 cp /tmp/custom-templates.tar.gz s3://tinqs-git-lfs/custom-templates.tar.gz --region eu-west-1
IID=$(aws ec2 describe-instances --region eu-west-1 \
  --filters "Name=tag:Name,Values=tinqs-prod-gitea" "Name=instance-state-name,Values=running" \
  --query "Reservations[0].Instances[0].InstanceId" --output text)
aws ssm send-command --region eu-west-1 --instance-ids "$IID" \
  --document-name AWS-RunShellScript \
  --parameters 'commands=["aws s3 cp s3://tinqs-git-lfs/custom-templates.tar.gz /tmp/ct.tar.gz --region eu-west-1","tar xzf /tmp/ct.tar.gz -C /data/gitea/templates","docker restart gitea"]'

Note: a template change does not bump the platform version string in the footer (that tracks the Go binary build). Unchanged footer ≠ failed deploy.

Rotate GITEA_TOKEN

  1. Generate a new token in Gitea: Settings → Applications → Generate Token
  2. aws lambda update-function-configuration --function-name tinqs-ci-dispatch --environment ...
  3. Old token is burned into running instances — they die within 30 min

Rotate RUNNER_TOKEN

  1. Gitea admin → Actions → Runners → Create new registration token
  2. Update the Lambda env var
  3. Running instances keep their existing registration until they die

Build a new AMI

aws ec2 run-instances --image-id ami-00a129385002e4de9 \
  --instance-type t3.small --key-name <your-key> --region eu-west-1 \
  --query 'Instances[0].InstanceId'
# SSH in, update tools (Go, Node, Docker, act_runner), then:
aws ec2 create-image --instance-id <id> --name tinqs-ci-runner-v3
aws lambda update-function-configuration --function-name tinqs-ci-dispatch \
  --environment "Variables={...,RUNNER_AMI=ami-NEW,...}"
aws ec2 terminate-instances --instance-id <id>

Add CI to a new repo

  1. Create .gitea/workflows/<name>.yml in the repo
  2. Add a per-repo webhook in Gitea: Settings → Webhooks → Add Webhook
    • URL: the dispatcher API Gateway URL · Events: Push · Content type: application/json
  3. Push a change matching the workflow trigger

Monitoring

# Zombie check (should be 0 except during active builds)
aws ec2 describe-instances --region eu-west-1 \
  --filters "Name=tag:tinqs-ci,Values=true" "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].InstanceId'

# Dispatcher logs (MSYS_NO_PATHCONV=1 on Windows/Git Bash; or use PowerShell)
MSYS_NO_PATHCONV=1 aws logs tail '/aws/lambda/tinqs-ci-dispatch' --region eu-west-1

# Build/job logs
curl -s "https://tinqs.com/api/v1/repos/tinqs/studio/actions/jobs/<JOB_ID>/logs" \
  -H "Authorization: token <gitea-token>"

# Stale DynamoDB runs
aws dynamodb scan --region eu-west-1 --table-name tinqs-ci-runs \
  --filter-expression "#s = :r" \
  --expression-attribute-names '{"#s":"status"}' \
  --expression-attribute-values '{":r":{"S":"running"}}' --query Count

Contributing

New composite action: create <name>/action.yml (using: composite, shell: bash), keep it bash-only, add a <name>/README.md, list it in Architecture, push to main (resolves via @v1).

Modify the dispatcher: edit orchestrator/dispatch/main.go, go build . to catch errors, deploy manually (above), verify with a push to tinqs/studio.

New runner label: add to labelToSpot in main.go, create images/<label>/Dockerfile if needed, build/push (cd images && ./build-all.sh v1), deploy the Lambda, add runs-on: <label> to the consuming workflow.

Incidents

  • 25 May 2026 — 18 zombie runners DDoS-ing Gitea. Root cause: no --ephemeral on registration + no git auth after repos went private. Fix: --ephemeral + url.insteadOf git auth in user-data.
  • 07 Jun 2026 — all runs-on: deploy jobs silently dry-running (dead tinqs-ci-exec route) + arikigame IAM bucket mismatch + template deploy pointing at the deleted ECS cluster. All fixed; see Architecture and the template-deploy note above.