Deployment Overview
Date: 2026-05-14 Scope: How code becomes production at Rōvn, environments, CI/CD pipeline, migrations, marketing-site deploys, observability, incident response, rollback, secrets rotation. Posture: LIVE for production deploy pipeline (S3 → CodeBuild → ECR → ECS Fargate). PARTIAL for fully isolated staging environment.
1. Environments
| Environment | Domains | Purpose | Status |
|---|---|---|---|
| Production | rovn.to, passport.rovn.to, app.rovn.to, passport.rovn.to/mcp |
The live product surface, workers, facilities, partners | LIVE |
| Investor portal | (separate Cloudflare Pages project, gated) | Diligence document distribution | LIVE |
| Staging | TBD (pre-design-partner) | Mirror of prod for pre-launchStage03.1 Company Overview · pre-launch by design, zero paying customers, zero signed pilots or design partners validation | PARTIAL |
| Local dev | localhost:8000 + docker-compose |
Engineer workstations | LIVE |
The production deploy pipeline is the same one used for the investor portal and marketing site. Staging is intentionally PARTIAL today: pre-launchStage03.1 Company Overview · pre-launch by design, zero paying customers, zero signed pilots or design partners, the design-partner pilots run on production with feature flags gating who sees what. A fully isolated staging environment lights up when Pilot tier flips from "design partner" to "GA pilot."
2. Production deploy pipeline (S3 → CodeBuild → ECR → ECS)
This is the path documented in memory log reference_rovn_deploy_mechanic.md and verified end-to-end on each production deploy (current production revision: rovn-passport-api:288).
Engineer pushes commit to main branch
│
▼
GitHub Actions: build, lint, unit tests
│
▼
Source zip uploaded to S3 (CodeBuild source bucket)
│
▼
CodeBuild: Docker image build
- multi-stage Dockerfile
- python3.11-slim base
- non-root user
- dependency cache layer
│
▼
ECR: push image tagged prod-YYYYMMDDhhmmss[-suffix]
│
▼
ECS task definition: register new revision
- image ARN bumped to new ECR tag
- secret refs unchanged
- env vars unchanged
- cpu / memory unchanged unless explicit
│
▼
ECS service: update-service to new task def revision
│
▼
Rolling deploy across AZs (multi-AZ)
- new tasks come up
- ALB health checks must pass
- old tasks drained gracefully
│
▼
Health check on /health passes → traffic shifts
│
▼
Old task definition retained (rollback target)
Critical operational rule (per memory log): force-new-deployment alone is a no-op. Every prod deploy registers a new task definition revision. The deploy mechanic was verified end-to-end most recently on the 2026-05-27 production deploy: rovn-passport-api:288 / prod-202605270526-ai-competitive-fix (git tag prod-ai-competitive-fix-2026-05-27).
3. Database migrations
- Migration tooling: custom Alembic-style runner (
apply_migrations.pyinrovn-platform/migrations/). - Schema state today: 89+ migration files numbered sequentially (plus the
2026_04_14_audit_log_harden.sqlhotfix). The full list is in the migrations folder. - Forward-only. No
DOWNmigrations in production. Reversal happens by writing a new forward migration. - Idempotent. Every migration is wrapped to be safely re-runnable (e.g.,
CREATE TABLE IF NOT EXISTS,ALTER TABLE ... ADD COLUMN IF NOT EXISTS). - Order rule: migrations run before the new ECS task definition takes traffic. Deploy script blocks on migration completion + integrity check.
- Pre-deploy snapshot: every migration deploy is preceded by a manual RDS snapshot (named with the migration filename and timestamp). Snapshot retention is 90 days for these manual snapshots.
- PHI columns. Migrations that touch PHI columns require two-engineer review per repo CODEOWNERS rule.
4. Marketing site and investor portal
- Marketing site (
rovn.to): Cloudflare Pages, projectrovn-design. Build on push to main branch. Zero PHI surface. - App route shell (
rovn.to/login,/signup/*,/nurse,/hospital,/facility, etc.): also served from therovn-designCloudflare Pages project per the 2026-05-11 unified-domain note. Routes call the FastAPI backend atpassport.rovn.tofor data. - Investor portal: separate Cloudflare Pages project, separate domain, gated. Distributes diligence docs.
- DNS: Cloudflare-managed for
rovn.toandapp.rovn.to. Apexrovn.toandpassport.rovn.toresolve to AWS-side services through Cloudflare orange-cloud for marketing, gray-cloud (DNS-only) forpassport.rovn.toso PHI traffic does not route via Cloudflare edge.
5. Monitoring and observability
| Layer | Tool | What we watch |
|---|---|---|
| Application errors | Sentry | Unhandled exceptions, deploy regressions |
| Logs | CloudWatch Logs (structured JSON) | Per-request logs, PHI scrubbed before write |
| Metrics | CloudWatch Metrics | ECS service health, RDS CPU / connections, ALB 5xx, request RPS |
| Alarms | CloudWatch Alarms → Slack + PagerDuty | 5xx > 1% / 5 min, RDS CPU > 85% sustained, ECS running tasks < desired |
| Distributed trace | AWS X-Ray | 10% sample steady-state; 100% on /admin/* and /audit/* |
| Synthetic | CloudWatch Synthetics canary | Hits /health every 30s from us-east-2 |
| Compliance | Drata | SOC 2 evidence collection, control drift |
| Cost | AWS Budgets + Anthropic API dashboards | Per-tenant token + infra spend |
P0 alarms wake the on-call rotation. P1 alarms Slack-only during business hours.
6. Incident response
- On-call rotation: Giles (primary) · Christian (backup) · engineering on-call (extended-hours tertiary). PagerDuty schedules and overrides documented in
RUNBOOK.md. - P0 definition: customer-impacting outage, data-integrity event, suspected PHI exposure, suspected security incident. Escalate within 5 minutes.
- P1 definition: degraded but not down. Ack within 15 minutes during business hours.
- Post-mortem discipline. Every P0 gets a written post-mortem within 24 hours, distributed to the team and (when relevant) to affected design partners. Drafts use a blameless template.
- Status page: TARGET. Pre-launch, communication runs through direct partner contact.
7. Rollback
Two rollback paths exist:
-
ECS rollback (most common). Update ECS service to the previous task definition revision. Image ARN reverts to the prior
prod-*tag. Traffic shifts back on health-check pass. RTO ~3 minutes from decision to traffic. -
Database migration reversal. Because migrations are forward-only, a "rollback" is a new forward migration that undoes the prior change. For non-PHI columns, this is the standard path. For PHI columns, additional review is required.
Operational rule: never roll back the ECS task without first confirming whether a migration ran in the prior deploy. If a migration changed the schema in a way that the prior image cannot read, the rollback is a forward migration plus ECS rollback. The deploy script logs each migration in audit_log, which is the source of truth.
8. Secrets rotation
- Store: AWS Secrets Manager only. No secrets in source. No secrets in env-var task-definition fields (refs to Secrets Manager ARNs only).
- Rotation cadence:
- Database master keys: 90 days
- Anthropic API key: 90 days
- Persona, Checkr, WorkOS, Stripe API keys: 90 days
- JWT signing keys: 180 days, rolling (old key remains valid for in-flight tokens)
- MCP server outbound + inbound tokens: 90 days
- Drift monitoring: Drata watches for new IAM principals granted
secretsmanager:GetSecretValue; deviations from the allow-list page ops. - Audit: every rotation event writes to
audit_logand Slack-notifies the security channel.
9. CodeBuild + Docker hygiene
- Image base:
python:3.11-slimpinned by SHA digest. - Non-root user: all containers run as a non-root UID.
- Read-only root FS in production task def (writable
tmpfsmount only). - Image scanning: ECR image scan on push (basic) + scheduled Snyk scan in CI.
- Dependency lockfile:
uv.lock/requirements.lockcommitted; CI fails on lockfile drift.
10. Deploy authorization
Per memory log reference_rovn_deploy_auth.md:
- IAM user
claude-deployis the only programmatic identity (besides break-glass humans) authorized to ECR push and ECS service update. - Scope:
PowerUser + IAMFull + ArtifactSync. Trimmed to deploy-only at the SCP level. - Protocol: preview → confirm → execute → verify → log on every prod write.
- Audit chain captures every deploy event.
11. What this overview does not claim
- We do not claim a fully isolated staging environment today, PARTIAL (feature-flagged production for pre-launchStage03.1 Company Overview · pre-launch by design, zero paying customers, zero signed pilots or design partners).
- We do not claim public-facing status page is live, TARGET.
- We do not claim automated cross-region failover, DR is multi-AZ active + cross-region cold standby; failover is documented but manual.
- We do not claim fully automated dependency-update PRs (Renovate / Dependabot), partial coverage today; full coverage on the post-close roadmap.
End of overview.