Technical & Architecture

Runbook

AI Diligence Console

Runbook

Date: 2026-05-14 Status: Drafted; not yet exercised against a paying-customer incident (no paying production traffic).

Full operational runbook to be expanded as platform engineering embedded ops and on-call rotation formalize post-close. This document captures the deploy + rollback + incident triage posture today.

1. Deploy

Standard deploy path

Local commit to main (or feature branch → PR merge)
CI uploads source ZIP to S3 source bucket
CodeBuild runs buildspec.yml → builds Docker image → tags prod-<datetime>-<feature> → pushes to ECR
New ECS task definition revision registered with new image tag
ECS service update with new task def revision (blue/green)
Health check passes → traffic shifts
Old task definition retained for rollback

Critical rule (from `reference_rovn_deploy_mechanic.md`)

force-new-deployment alone is a no-op. Always register a new task definition revision after CodeBuild. Verified pattern: task def :86 → prod-20260510220219-codex-design-system-v2.

2. Rollback

Identify last known-good task definition revision in ECS
aws ecs update-service --task-definition <known-good-revision> (do not use force-new-deployment on its own)
Watch CloudWatch alarms + Sentry error rate for 10 minutes
Confirm /health returning 200 from new tasks
Drain old tasks once new tasks are stable

3. Health

ECS task health: /health endpoint (200 = healthy)
Database: RDS CloudWatch alarms on CPU, connections, replication lag
Sentry smoke: admin-only /debug/sentry route raises a controlled exception to verify wiring
Synthetic monitor: Target , to add Datadog / CloudWatch Synthetics on key surfaces

4. Incident severity matrix

Severity	Definition	Response time	Owner
Sev 1	Customer-impacting outage or PHI exposure suspected	Immediate (<15 min)	Founder rotation, engineering on-call
Sev 2	Degraded service or partial functionality loss	<2 hours	Engineering on-call
Sev 3	Internal-only or workaround available	<24 hours	Engineering daytime
Sev 4	Cosmetic / minor	Next sprint	Engineering backlog

Founder rotation per INCIDENT_RESPONSE.md: Giles primary, Christian backup, engineering on-call as extended-hours tertiary.

5. Critical playbooks

5.1: Source authority API down

Verify with vendor status page (NPDB / DEA / Nursys / state board)
If vendor side: queue requests via reverify_scheduler.py retry queue
If our side: check adapter exception in Sentry + CloudWatch
Notify any inflight customer requests via monitoring_actions surface
Post-mortem within 24h

5.2: Database connection saturation

CloudWatch alarm fires on RDS connections > 80% of cap
Check ECS task count vs DB pool config
Scale RDS instance or apply connection pool tuning
Add Postgres slow-query log inspection

5.3: Anthropic API outage

ai_gateway.py detects executor failures
Fall back to Bedrock route (target capability , not yet shipped end-to-end)
If both unavailable: queue executor calls; surface "verification in progress" on affected workflows
Notify design-partner customers per BAA terms

5.4: Cognito / WorkOS auth outage

Auth surfaces will 5xx; cached sessions remain valid for token TTL
Hospital admins use admin_auth fallback path
Notify customers per SLA (TARGET, formal SLA post-Series A)

5.5: S3 Object Lock alarm

Object Lock is set to COMPLIANCE mode + 7-year retention
No remediation needed for "tampering attempt", Object Lock will block
CloudTrail event review: who attempted what?
Treat as Sev 1 if attempt originates from inside our account

5.6: Sentry / observability outage

Sentry down ≠ production down
CloudWatch metrics remain authoritative
No customer impact

6. Secrets handling

All vendor secrets in AWS Secrets Manager
IAM least-privilege per ECS task role
Rotation cadence: 90 days for vendor API keys, immediate on suspected exposure
platform engineering partner under NDA: secrets rotation playbook to be formalized Month 3 of pilot ops

7. Backups

RDS automated daily snapshots, 30-day retention; point-in-time recovery (PITR) enabled
S3 audit bucket: Object Lock COMPLIANCE 7-year, no need for separate backup
S3 source-receipt bucket: versioning enabled; lifecycle policy moves to Glacier after 90 days
Application code: GitHub + ECR image history

8. On-call

Current posture (pre-Series A): - Founder rotation (Giles / Christian) - engineering on-call for production fires - CloudWatch alarms + Sentry email + PagerDuty (planned post-Series A formalization)

Target posture (post-Series A): - Formal PagerDuty rotation - Tiered escalation - SLO-driven alerting (not just threshold alarms)

9. Post-mortem template

Field	Description
Incident date	When did the incident start/end
Impact	Which customers, which surfaces, what duration
Root cause	Technical cause
Detection	How we found out
Response	Timeline of actions
Resolution	What fixed it
Lessons	What we learned
Action items	Owner + due date

Blameless format. No naming individuals in lessons section.

10. Communication

Internal: founder channel for Sev 1 / Sev 2
Customer: per BAA terms (60-day notification window for breaches; faster for service incidents per individual customer SLA)
Investor: post-incident summary in monthly update if Sev 1 or Sev 2

End of runbook.

Ask the AI agent about this section, the raise, compliance posture, or any cross-document question. Grounded in Rōvn's deep context, with on-page source citations.

AI queries route through AWS Bedrock under BAA · Anthropic Claude (Haiku 4.5) under BAA · zero-data-retention posture · no PHI in prompts.

Runbook

Runbook

1. Deploy

Standard deploy path

Critical rule (from reference_rovn_deploy_mechanic.md)

2. Rollback

3. Health

4. Incident severity matrix

5. Critical playbooks

5.1: Source authority API down

5.2: Database connection saturation

5.3: Anthropic API outage

5.4: Cognito / WorkOS auth outage

5.5: S3 Object Lock alarm

5.6: Sentry / observability outage

6. Secrets handling

7. Backups

8. On-call

9. Post-mortem template

10. Communication

Critical rule (from `reference_rovn_deploy_mechanic.md`)