Business Continuity & Disaster Recovery Plan
Owner: Security Officer · Approved by leadership · Version 1.0 · Effective 27 May 2026 · Next review 27 May 2027 · Tested nightly via dr-tests/
1. Purpose
This plan describes how Glassbreak maintains continuity of service during disruption and recovers from disaster events. It states our recovery objectives, the continuity strategy that delivers them, the criteria for activating recovery procedures, and the cadence at which the plan is tested.
2. Scope
This plan covers the production Glassbreak service, including all surfaces under glassbreak.io, glassbreak.cloud, glass-break.com, the per-vertical PostgreSQL databases, the per-vertical object storage, the audit log, and the supporting observability and CI/CD pipelines.
3. Recovery objectives
- RTO (Recovery Time Objective): 15 minutes for cross-vertical failover of the primary
glassbreak.iosurface. Failover is automated via Fastly health-checked routing; no human action is required for a single-vertical failure. - RPO (Recovery Point Objective): 5 minutes for data replicated across verticals via the HMAC-signed sync transport.
- RPO for quorum-recoverable secrets: zero in a single-vertical failure. The Shamir-share model means that provided the quorum threshold can still be assembled on the surviving vertical, the secret is recoverable without data loss.
- Backup retention: 35 days rolling for per-vertical encrypted backups.
- Backup integrity verification: daily via DR scenario 22 (a real round-trip restore exercising the entire backup chain).
4. Continuity strategy
4.1 Multi-cloud architecture
The platform runs across two independent compute stacks served through three domains. Either stack can serve traffic alone if the other is unavailable. Detailed topology is in docs/architecture.md.
- Stack A: AWS Lambda Function URLs in
us-east-1with Neon Postgres inaws-us-east-1and AWS S3 for static and blob storage. - Stack B: Scaleway Functions in
fr-parwith Scaleway Managed Database infr-parand Scaleway Object Storage for static and blob storage. - Routing: Fastly fronts the primary surface with health-checked failover. Direct-backup surfaces (
glassbreak.cloud,glass-break.com) bypass Fastly entirely, surviving Fastly outages.
4.2 Cross-vertical synchronisation
- State-changing writes propagate over HTTPS with HMAC signatures between the verticals; no shared queue or shared database.
- HMAC keys rotate on a documented schedule (
docs/operator-jwt-per-vertical.md). - Sync messages are idempotent and tolerate out-of-order or duplicate delivery (DR scenarios 11, 12, 17).
- Asymmetric partitions are detected and handled with documented convergence semantics (DR scenarios 14, 15).
4.3 Backup
- Each vertical takes encrypted database backups on a cloud-provider-managed schedule (Neon point-in-time recovery up to 30 days; Scaleway managed-database backups daily with 7-day retention plus weekly + monthly long-term snapshots).
- Object storage relies on cloud-provider versioning and cross-region durability guarantees.
- Backup integrity is verified nightly via DR scenario 22 — a real round-trip restore that re-runs application-level queries against the restored copy.
4.4 Configuration and source-code recovery
- All infrastructure is declared in OpenTofu and stored in the GitHub repository. The cloud account state can be reconstructed by running
tofu applyagainst a fresh provider credential. - Source code is mirrored across GitHub (primary) and developer workstations.
- Secret material is held in approved secret stores; the rotation runbook for each secret class is documented in
docs/.
5. Activation
5.1 Detection
- Automated detection: Fastly health-checks failing a backend trigger automatic failover within seconds.
- Observability detection: SLO breaches (5xx rate, p95 latency, apex probe) page the on-call responder.
- 30-minute smoke-test heartbeat detects sustained outages on any production surface and opens a deduplicated GitHub issue.
- Manual detection: workforce or customer reports.
5.2 Activation triggers
This plan is activated when any one of the following is true:
- Both verticals are unavailable for more than 5 minutes.
- Any single vertical is unavailable and cross-vertical failover has failed to restore service.
- Data corruption is suspected or confirmed on a production database.
- A region-wide event affects either AWS
us-east-1or Scalewayfr-par. - The Incident Commander declares activation under the Incident Response Policy.
5.3 Activation procedure
- The Incident Commander declares activation, names the Technical Lead and Communications Lead.
- The Technical Lead opens the runbook for the failure mode (vertical-down, region-down, data-corruption, etc.).
- The Communications Lead updates the public status page within 1 hour and notifies affected customers within the SLA set in the IR policy.
- The Scribe maintains the timeline.
6. Recovery procedures by scenario
6.1 Single-vertical compute outage
Automatic. Fastly fails over to the surviving vertical within health-check latency. Customers on direct-backup surfaces (glassbreak.cloud if Scaleway is alive, or glass-break.com if AWS is alive) are unaffected. Cross-vertical sync resumes when the failed vertical recovers. Tested in DR scenario 2.
6.2 Single-vertical database outage
The vertical's compute remains up but cannot serve reads or writes. Fastly routes traffic away. The surviving vertical continues to serve. On database recovery, the failed vertical replays sync messages from the survivor. Tested in DR scenario 6.
6.3 Both verticals unavailable
The hard case. Activation criteria met. Steps:
- Public status page updated immediately.
- Customers notified within 1 hour.
- If the cause is a shared dependency, mitigation focuses on that dependency.
- If the cause is independent failures, recovery is per-vertical in parallel, with the faster-recovered vertical resuming service first.
- The future
glassbreak.devhard-disconnect path (Phase 8 of the architecture target) is the mitigation for concurrent AWS + Scaleway outages.
6.4 Data corruption on a single vertical
- Isolate the affected database from sync to prevent propagation.
- Restore from the most recent verified backup (point-in-time if Neon; daily snapshot if Scaleway).
- Replay missing sync messages from the surviving vertical.
- Verify quorum-recoverable secrets via the in-product diagnostic before declaring recovery.
6.5 Data corruption affecting both verticals
Extremely unlikely given independent storage. If it occurs:
- Take both surfaces read-only.
- Determine the source of corruption (application bug, sync bug, malicious action).
- Restore the most recent verified backups on both verticals.
- Reconcile manually; replay validated user actions from audit log entries that survived corruption.
- Communicate transparently with affected customers.
6.6 Loss of Fastly (edge / CDN)
Customers on glassbreak.io may experience degraded routing until Fastly is restored. Direct-backup surfaces are unaffected. Customers can be directed to the backup surfaces via the public status page. Tested in DR scenario 5.
6.7 Loss of a DNS authority
Each domain has dual DNS authority. Loss of a single authority does not unresolve any domain. No activation required; document in the incident register and replace the failed authority on Glassbreak's normal schedule.
6.8 Loss of a cryptographic signing key
Per the per-vertical JWT runbook (docs/operator-jwt-per-vertical.md): the compromised kid is dropped from the verify map across all verticals, token_version is bumped to invalidate outstanding tokens, the operator re-issues sessions, and the incident is logged. Tested in DR scenario 7.
6.9 Loss of a sub-processor
- Stripe — payment processing degrades; in-app billing UI surfaces an outage banner; subscription state remains consistent until Stripe recovers.
- Postmark / SES (email) — email-dependent flows (verification, password reset) are queued; users informed via the status page.
- Twilio (SMS) — SMS-dependent flows are queued; users informed via the status page.
- Grafana Cloud — telemetry ingestion stops; smoke tests continue locally; alerting is degraded.
Each sub-processor failure has a runbook entry in docs/observability.md.
7. Testing
- Continuous: 22 DR scenarios in
dr-tests/run in CI nightly against a real two-vertical Postgres + Hono topology. Coverage spans sync convergence, vertical kill, TURN failover, secret quorum recovery, edge bypass, DB recovery, JWT rotation grace, refresh-token reuse detection, CGN call, outbox storm, out-of-order / duplicate sync, HMAC tamper, cross-vertical conflict, asymmetric partition, audit chain integrity, cron idempotency, rate-limit resilience, secrets rotation, migration safety, replica lag, and backup integrity. - Quarterly: a tabletop walk-through of one scenario by the on-call responder, exercising the communications path end-to-end without impacting production.
- Annual: a full activation drill including the public status-page update flow (in test mode) and customer-notification template review.
8. Roles
- Incident Commander — declares activation, coordinates response, declares all-clear.
- Technical Lead — drives recovery procedures.
- Communications Lead — owns customer and supervisory-authority communications.
- Scribe — maintains the activation timeline for the post-mortem.
9. Communication during activation
- Public status page updated within 1 hour of activation; then at least every 2 hours until resolution.
- Direct customer email within 24 hours for SEV-1; within 72 hours for personal-data breaches per the DPA.
- A summary post-mortem published within 14 days for any activation event lasting more than 1 hour.
10. Records
Activation events are recorded in the incident register. Drill outcomes (quarterly and annual) are recorded in the same register tagged as drills. Records are retained for at least 5 years.
11. Review
This plan is reviewed at least annually and after every activation event or material change to the architecture. The next scheduled review is 27 May 2027.
12. Related documents
- Information Security Policy
- Incident Response Policy
- Data Processing Agreement
docs/architecture.md— full topologydocs/observability.md— alert runbooksdocs/operator-jwt-per-vertical.md— JWT rotation runbookdr-tests/— the 22 DR scenarios that exercise this plan
Counter-signed PDF copy available on request to compliance@glassbreak.io.