Rōvn · Investor Room
AI agent: checking…
All sections
Technical & Architecture

Runbook

Diligence noticeWorking state of Rōvn as of 2026-06-24 · Pre-launch by designSee 09 for receipts →
AI Diligence Console

Runbook

Date: 2026-05-14 Status: Drafted; not yet exercised against a paying-customer incident (no paying production traffic).

Full operational runbook to be expanded as platform engineering embedded ops and on-call rotation formalize post-close. This document captures the deploy + rollback + incident triage posture today.


1. Deploy

Standard deploy path

  1. Local commit to main (or feature branch → PR merge)
  2. CI uploads source ZIP to S3 source bucket
  3. CodeBuild runs buildspec.yml → builds Docker image → tags prod-<datetime>-<feature> → pushes to ECR
  4. New ECS task definition revision registered with new image tag
  5. ECS service update with new task def revision (blue/green)
  6. Health check passes → traffic shifts
  7. Old task definition retained for rollback

Critical rule (from reference_rovn_deploy_mechanic.md)

force-new-deployment alone is a no-op. Always register a new task definition revision after CodeBuild. Verified pattern: task def :86prod-20260510220219-codex-design-system-v2.


2. Rollback

  1. Identify last known-good task definition revision in ECS
  2. aws ecs update-service --task-definition <known-good-revision> (do not use force-new-deployment on its own)
  3. Watch CloudWatch alarms + Sentry error rate for 10 minutes
  4. Confirm /health returning 200 from new tasks
  5. Drain old tasks once new tasks are stable

3. Health

  • ECS task health: /health endpoint (200 = healthy)
  • Database: RDS CloudWatch alarms on CPU, connections, replication lag
  • Sentry smoke: admin-only /debug/sentry route raises a controlled exception to verify wiring
  • Synthetic monitor: Target , to add Datadog / CloudWatch Synthetics on key surfaces

4. Incident severity matrix

Severity Definition Response time Owner
Sev 1 Customer-impacting outage or PHI exposure suspected Immediate (<15 min) Founder rotation, engineering on-call
Sev 2 Degraded service or partial functionality loss <2 hours Engineering on-call
Sev 3 Internal-only or workaround available <24 hours Engineering daytime
Sev 4 Cosmetic / minor Next sprint Engineering backlog

Founder rotation per INCIDENT_RESPONSE.md: Giles primary, Christian backup, engineering on-call as extended-hours tertiary.


5. Critical playbooks

5.1: Source authority API down

  1. Verify with vendor status page (NPDB / DEA / Nursys / state board)
  2. If vendor side: queue requests via reverify_scheduler.py retry queue
  3. If our side: check adapter exception in Sentry + CloudWatch
  4. Notify any inflight customer requests via monitoring_actions surface
  5. Post-mortem within 24h

5.2: Database connection saturation

  1. CloudWatch alarm fires on RDS connections > 80% of cap
  2. Check ECS task count vs DB pool config
  3. Scale RDS instance or apply connection pool tuning
  4. Add Postgres slow-query log inspection

5.3: Anthropic API outage

  1. ai_gateway.py detects executor failures
  2. Fall back to Bedrock route (target capability , not yet shipped end-to-end)
  3. If both unavailable: queue executor calls; surface "verification in progress" on affected workflows
  4. Notify design-partner customers per BAA terms

5.4: Cognito / WorkOS auth outage

  1. Auth surfaces will 5xx; cached sessions remain valid for token TTL
  2. Hospital admins use admin_auth fallback path
  3. Notify customers per SLA (TARGET, formal SLA post-Series A)

5.5: S3 Object Lock alarm

  1. Object Lock is set to COMPLIANCE mode + 7-year retention
  2. No remediation needed for "tampering attempt", Object Lock will block
  3. CloudTrail event review: who attempted what?
  4. Treat as Sev 1 if attempt originates from inside our account

5.6: Sentry / observability outage

  1. Sentry down ≠ production down
  2. CloudWatch metrics remain authoritative
  3. No customer impact

6. Secrets handling

  • All vendor secrets in AWS Secrets Manager
  • IAM least-privilege per ECS task role
  • Rotation cadence: 90 days for vendor API keys, immediate on suspected exposure
  • platform engineering partner under NDA: secrets rotation playbook to be formalized Month 3 of pilot ops

7. Backups

  • RDS automated daily snapshots, 30-day retention; point-in-time recovery (PITR) enabled
  • S3 audit bucket: Object Lock COMPLIANCE 7-year, no need for separate backup
  • S3 source-receipt bucket: versioning enabled; lifecycle policy moves to Glacier after 90 days
  • Application code: GitHub + ECR image history

8. On-call

Current posture (pre-Series A): - Founder rotation (Giles / Christian) - engineering on-call for production fires - CloudWatch alarms + Sentry email + PagerDuty (planned post-Series A formalization)

Target posture (post-Series A): - Formal PagerDuty rotation - Tiered escalation - SLO-driven alerting (not just threshold alarms)


9. Post-mortem template

Field Description
Incident date When did the incident start/end
Impact Which customers, which surfaces, what duration
Root cause Technical cause
Detection How we found out
Response Timeline of actions
Resolution What fixed it
Lessons What we learned
Action items Owner + due date

Blameless format. No naming individuals in lessons section.


10. Communication

  • Internal: founder channel for Sev 1 / Sev 2
  • Customer: per BAA terms (60-day notification window for breaches; faster for service incidents per individual customer SLA)
  • Investor: post-incident summary in monthly update if Sev 1 or Sev 2

End of runbook.

Ask the AI agent about this section, the raise, compliance posture, or any cross-document question. Grounded in Rōvn's deep context, with on-page source citations.

AI queries route through AWS BedrockAI provider chain07.3 AI Architecture · AWS Bedrock under BAA → Anthropic Claude Haiku 4.5 under BAA → Rōvn ECS under BAA · Anthropic Claude (Haiku 4.5)Model identity07.3 AI Architecture · Haiku 4.5 chosen for cost + latency + BAA chain under BAA · zero-data-retention posture · no PHI in prompts.