Files
ci/wiki/DevOps-Reference.md
ozan 33f967e42e docs: convert ci docs to the in-repo wiki/ standard + fix stale ECS facts
Adopt the team wiki convention (in-repo wiki/ folder, plain markdown) used in
tinqs/studio. Convert DEVOPS.md + PLAN.md and the heavy parts of README.md
into cross-linked wiki pages: Home, Architecture, DevOps-Reference,
Operations, Roadmap. Root README slimmed to a repo intro pointing at wiki/.

Corrects stale topology while converting:
- ECS cluster tinqs-git / EFS tinqs-git-repos retired 2026-06-05; platform now
  the standalone EC2 box tinqs-prod-gitea (ALB tinqs-git, ECR image, RDS).
- Records this session's fixes: deploy-label dry-run route, runner-name
  collisions, arikigame IAM bucket, and template deploy repointed ECS→EC2/SSM.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 20:43:05 +01:00

110 lines
4.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DevOps Reference
[← Home](README.md) · [Architecture](Architecture.md) · [Operations](Operations.md) · [Roadmap](Roadmap.md)
## AWS resources (eu-west-1)
| Resource | Name/ID | Purpose |
|----------|---------|---------|
| Lambda | `tinqs-ci-dispatch` | Webhook handler + Spot launcher |
| DynamoDB | `tinqs-ci-runs` | Run tracking (repo, run_id, instance_id, status) |
| AMI | `tinqs-ci-runner-v2` (ami-00a129385002e4de9) | Pre-baked runner (Go, Node, Docker, act_runner) |
| Security Group | sg-030bf74b43d3faac7 | Runner SG (outbound HTTPS) |
| Subnet | subnet-04b5aeec9bfc4ec2c | Default VPC subnet |
| Instance Profile | `tinqs-ci-runner` → role `tinqs-git-task` | Runner IAM role (S3, ECR, SSM) |
| CloudWatch | /aws/lambda/tinqs-ci-dispatch | Dispatcher logs |
| API Gateway | `q4ohxovfr8…/webhook` | Receives the per-repo Gitea push webhook |
### Platform host (NOT CI — context)
| Resource | Name/ID | Purpose |
|----------|---------|---------|
| EC2 | `tinqs-prod-gitea` (i-0d085288f467083e0, t3.medium) | Runs tinqs.com as a single `docker` Gitea container |
| ALB | `tinqs-git` | Fronts the platform |
| ECR | `tinqs-git:latest` | Platform image (built by `build.yml` → CodeBuild) |
| RDS | `tinqs-prod` (PostgreSQL) | Platform DB |
The platform mounts host `/data`; `GITEA_CUSTOM=/data/gitea`, so **custom templates live at `/data/gitea/templates/`**. Template-only changes deploy here via SSM — see [Operations](Operations.md).
### Retired resources
| Resource | When / why |
|----------|------------|
| ECS Cluster `tinqs-git` | Deleted **2026-06-05** — platform moved to the `tinqs-prod-gitea` EC2 box |
| EFS `tinqs-git-repos` | Retired in the 2026-06-05 EC2 migration (repos now on instance `/data`) |
| Lambda `tinqs-ci-exec` | Deleted **26 May 2026** — never ran a build; deploy jobs go through Spot now |
| CloudWatch `/aws/lambda/tinqs-ci-exec`, `/ecs/tinqs-runner` | Log groups for the above / the Fargate era |
| Fargate runner service | Scaled to 0 then removed |
## Webhook flow
```
Gitea (tinqs.com)
└─ per-repo webhook on push
└─ POST https://<api-gw>/webhook
└─ Lambda tinqs-ci-dispatch
├─ Fetch .gitea/workflows/*.yml via Gitea API
├─ Evaluate triggers (branch + path filters)
├─ For each matched workflow:
│ ├─ Read runs-on label
│ └─ RunInstances (Spot, ephemeral) [host → skipped]
└─ Track in DynamoDB
```
## Spot instance lifecycle
```
1. Lambda calls RunInstances (Spot, InstanceInitiatedShutdownBehavior=terminate)
2. User-data runs:
a. Configure git auth (url.insteadOf with GITEA_TOKEN)
b. act_runner register --ephemeral --labels <label>:host
c. act_runner daemon (blocks until job completes)
d. EXIT trap fires → shutdown -h now → instance terminates
3. DynamoDB record: running → completed (or timeout after 30 min cleanup)
```
Offline runners listed in Gitea admin → Actions → Runners are **normal** — they're spent ephemeral registrations, not a fault.
## Cleanup cron
The dispatcher Lambda also handles cleanup when invoked with an empty body or `{"action":"cleanup"}`. Triggered by EventBridge every 5 minutes.
- Scans DynamoDB for runs older than 30 min with `status=running`
- Terminates matching EC2 instances
- Sweeps for orphan instances (tagged `tinqs-ci`, running > 30 min)
## Lambda env vars
Configured in the AWS console, not in code:
| Var | Purpose |
|-----|---------|
| `GITEA_URL` | `https://tinqs.com` |
| `GITEA_TOKEN` | API token — fetches workflows AND provides runner git auth |
| `RUNNER_TOKEN` | act_runner registration token (Gitea admin → Runners) |
| `RUNNER_AMI` | Pre-baked AMI ID |
| `SUBNET` | VPC subnet for Spot instances |
| `SECURITY_GROUP` | SG allowing outbound HTTPS |
| `DDB_TABLE` | DynamoDB run-tracking table (`tinqs-ci-runs`) |
| `INSTANCE_PROFILE` | IAM instance profile for runners |
## Runner IAM role (`tinqs-git-task`)
Inline policies of note:
- `tinqs-ci-s3` — R/W on `tinqs-cli-releases`, `arikigame-com-website`, `docs.tinqs.com` *(corrected 2026-06-07: was the non-existent `arikigame.com`, which broke the arikigame deploy)*
- `tinqs-git-s3` — R/W on `tinqs-git-lfs`, `tinqs-git-preview`
- `tinqs-ci-deploy` — ECR push, CloudFront `CreateInvalidation`, (legacy ECS update)
- `tinqs-ci-ssm-deploy``ec2:DescribeInstances` + `ssm:SendCommand` **scoped to the `tinqs-prod-gitea` instance** (added 2026-06-07 for template deploys)
- `ssm-exec` — Session Manager channels · `ec2-self-terminate` — terminate own `tinqs-ci`-tagged instance
## Cost
| Component | Estimated monthly cost |
|-----------|----------------------|
| Spot instances (t3.small, ~10 min/build, ~5 builds/day) | ~$12 |
| Lambda (< 1000 invocations/month) | ~$0 (free tier) |
| DynamoDB (< 1 GB, low RCU/WCU) | ~$0 (free tier) |
| CloudWatch logs | ~$0.50 |
| **Total CI** | **~$23/month** |