Runbook
Date: 2026-05-14 Status: Drafted; not yet exercised against a paying-customer incident (no paying production traffic).
Full operational runbook to be expanded as platform engineering embedded ops and on-call rotation formalize post-close. This document captures the deploy + rollback + incident triage posture today.
1. Deploy
Standard deploy path
- Local commit to
main(or feature branch → PR merge) - CI uploads source ZIP to S3 source bucket
- CodeBuild runs
buildspec.yml→ builds Docker image → tagsprod-<datetime>-<feature>→ pushes to ECR - New ECS task definition revision registered with new image tag
- ECS service update with new task def revision (blue/green)
- Health check passes → traffic shifts
- Old task definition retained for rollback
Critical rule (from reference_rovn_deploy_mechanic.md)
force-new-deployment alone is a no-op. Always register a new task definition revision after CodeBuild. Verified pattern: task def :86 → prod-20260510220219-codex-design-system-v2.
2. Rollback
- Identify last known-good task definition revision in ECS
aws ecs update-service --task-definition <known-good-revision>(do not use force-new-deployment on its own)- Watch CloudWatch alarms + Sentry error rate for 10 minutes
- Confirm
/healthreturning 200 from new tasks - Drain old tasks once new tasks are stable
3. Health
- ECS task health:
/healthendpoint (200 = healthy) - Database: RDS CloudWatch alarms on CPU, connections, replication lag
- Sentry smoke: admin-only
/debug/sentryroute raises a controlled exception to verify wiring - Synthetic monitor: Target , to add Datadog / CloudWatch Synthetics on key surfaces
4. Incident severity matrix
| Severity | Definition | Response time | Owner |
|---|---|---|---|
| Sev 1 | Customer-impacting outage or PHI exposure suspected | Immediate (<15 min) | Founder rotation, engineering on-call |
| Sev 2 | Degraded service or partial functionality loss | <2 hours | Engineering on-call |
| Sev 3 | Internal-only or workaround available | <24 hours | Engineering daytime |
| Sev 4 | Cosmetic / minor | Next sprint | Engineering backlog |
Founder rotation per INCIDENT_RESPONSE.md: Giles primary, Christian backup, engineering on-call as extended-hours tertiary.
5. Critical playbooks
5.1: Source authority API down
- Verify with vendor status page (NPDB / DEA / Nursys / state board)
- If vendor side: queue requests via
reverify_scheduler.pyretry queue - If our side: check adapter exception in Sentry + CloudWatch
- Notify any inflight customer requests via
monitoring_actionssurface - Post-mortem within 24h
5.2: Database connection saturation
- CloudWatch alarm fires on RDS connections > 80% of cap
- Check ECS task count vs DB pool config
- Scale RDS instance or apply connection pool tuning
- Add Postgres slow-query log inspection
5.3: Anthropic API outage
ai_gateway.pydetects executor failures- Fall back to Bedrock route (target capability , not yet shipped end-to-end)
- If both unavailable: queue executor calls; surface "verification in progress" on affected workflows
- Notify design-partner customers per BAA terms
5.4: Cognito / WorkOS auth outage
- Auth surfaces will 5xx; cached sessions remain valid for token TTL
- Hospital admins use
admin_authfallback path - Notify customers per SLA (TARGET, formal SLA post-Series A)
5.5: S3 Object Lock alarm
- Object Lock is set to COMPLIANCE mode + 7-year retention
- No remediation needed for "tampering attempt", Object Lock will block
- CloudTrail event review: who attempted what?
- Treat as Sev 1 if attempt originates from inside our account
5.6: Sentry / observability outage
- Sentry down ≠ production down
- CloudWatch metrics remain authoritative
- No customer impact
6. Secrets handling
- All vendor secrets in AWS Secrets Manager
- IAM least-privilege per ECS task role
- Rotation cadence: 90 days for vendor API keys, immediate on suspected exposure
- platform engineering partner under NDA: secrets rotation playbook to be formalized Month 3 of pilot ops
7. Backups
- RDS automated daily snapshots, 30-day retention; point-in-time recovery (PITR) enabled
- S3 audit bucket: Object Lock COMPLIANCE 7-year, no need for separate backup
- S3 source-receipt bucket: versioning enabled; lifecycle policy moves to Glacier after 90 days
- Application code: GitHub + ECR image history
8. On-call
Current posture (pre-Series A): - Founder rotation (Giles / Christian) - engineering on-call for production fires - CloudWatch alarms + Sentry email + PagerDuty (planned post-Series A formalization)
Target posture (post-Series A): - Formal PagerDuty rotation - Tiered escalation - SLO-driven alerting (not just threshold alarms)
9. Post-mortem template
| Field | Description |
|---|---|
| Incident date | When did the incident start/end |
| Impact | Which customers, which surfaces, what duration |
| Root cause | Technical cause |
| Detection | How we found out |
| Response | Timeline of actions |
| Resolution | What fixed it |
| Lessons | What we learned |
| Action items | Owner + due date |
Blameless format. No naming individuals in lessons section.
10. Communication
- Internal: founder channel for Sev 1 / Sev 2
- Customer: per BAA terms (60-day notification window for breaches; faster for service incidents per individual customer SLA)
- Investor: post-incident summary in monthly update if Sev 1 or Sev 2
End of runbook.