Business Continuity & Disaster Recovery Plan

Owner: Security Officer · Approved by leadership · Version 1.0 · Effective 27 May 2026 · Next review 27 May 2027 · Tested nightly via dr-tests/

1. Purpose

This plan describes how Glassbreak maintains continuity of service during disruption and recovers from disaster events. It states our recovery objectives, the continuity strategy that delivers them, the criteria for activating recovery procedures, and the cadence at which the plan is tested.

2. Scope

This plan covers the production Glassbreak service, including all surfaces under glassbreak.io, glassbreak.cloud, glass-break.com, the per-box in-box PostgreSQL databases, the per-box object storage, the audit log, and the supporting observability and CI/CD pipelines.

3. Recovery objectives

RTO (Recovery Time Objective): 15 minutes for cross-box failover of the primary glassbreak.io surface. Failover is automated (Patroni promotes the standby; Fastly health-checked routing shifts traffic); no human action is required for a single-box failure.
RPO (Recovery Point Objective): near-zero for data streamed to the hot standby via native PostgreSQL streaming replication (Patroni), bounded by replication lag.
RPO for quorum-recoverable secrets: zero in a single-box failure. The Shamir-share model means that provided the quorum threshold can still be assembled on the surviving box, the secret is recoverable without data loss.
Recovery model: a continuously-replicated hot standby with automatic failover is the primary recovery path. Off-box encrypted backup snapshots with defined retention are being rolled out.
Disaster-recovery verification: nightly via the DR test suite in CI (scenario 22 runs a real round-trip restore).

4. Continuity strategy

4.1 Multi-cloud architecture

The platform runs across two always-on VMs served through three domains. Either box can serve traffic alone if the other is unavailable. Detailed topology is in docs/architecture.md.

Box A: an always-on AWS VM in us-east-1 running a Docker Compose stack (Caddy + Node API + in-box PostgreSQL) with block storage for static and blob storage.
Box B: an always-on Scaleway VM in fr-par running the same stack (Caddy + Node API + in-box PostgreSQL) with Scaleway block/object storage.
Routing: Fastly fronts the primary surface with health-checked failover. Direct surfaces (glassbreak.cloud, glass-break.com) bypass Fastly entirely, surviving Fastly outages.

4.2 Cross-box replication

State-changing writes go to the current primary and stream to the hot standby via native PostgreSQL streaming replication managed by Patroni + etcd (a small etcd arbiter provides the quorum tiebreaker). The AWS (us-east-1) box currently runs the primary.
Replication traffic runs over a Rosenpass-secured post-quantum WireGuard mesh.
On primary failure, Patroni promotes the standby automatically; Caddy forwards writes to the new primary (DR scenarios exercise the promotion path).
Replica lag is monitored; recovery of a failed box re-syncs it from the current primary before it rejoins as a standby.

4.3 Backup

The primary recovery path is the continuously-replicated hot standby with automatic failover. Off-box encrypted backup snapshots with defined retention are being rolled out.
Object storage relies on cloud-provider versioning and cross-region durability guarantees.
Disaster-recovery scenarios are verified nightly in CI (DR scenario 22 runs a real round-trip restore that re-runs application-level queries against the restored copy).

4.4 Configuration and source-code recovery

All infrastructure is declared in OpenTofu and stored in the GitHub repository. The cloud account state can be reconstructed by running tofu apply against a fresh provider credential.
Source code is mirrored across GitHub (primary) and developer workstations.
Secret material is held in approved secret stores; the rotation runbook for each secret class is documented in docs/.

5. Activation

5.1 Detection

Automated detection: Fastly health-checks failing a backend trigger automatic failover within seconds.
Observability detection: SLO breaches (5xx rate, p95 latency, apex probe) page the on-call responder.
30-minute smoke-test heartbeat detects sustained outages on any production surface and opens a deduplicated GitHub issue.
Manual detection: workforce or customer reports.

5.2 Activation triggers

This plan is activated when any one of the following is true:

Both boxes are unavailable for more than 5 minutes.
Any single box is unavailable and cross-box failover has failed to restore service.
Data corruption is suspected or confirmed on a production database.
A region-wide event affects either AWS us-east-1 or Scaleway fr-par.
The Incident Commander declares activation under the Incident Response Policy.

5.3 Activation procedure

The Incident Commander declares activation, names the Technical Lead and Communications Lead.
The Technical Lead opens the runbook for the failure mode (vertical-down, region-down, data-corruption, etc.).
The Communications Lead updates the public status page within 1 hour and notifies affected customers within the SLA set in the IR policy.
The Scribe maintains the timeline.

6. Recovery procedures by scenario

6.1 Single-box compute outage

Automatic. Fastly fails over to the surviving box within health-check latency; if the failed box held the primary, Patroni promotes the standby. Customers on direct surfaces (glassbreak.cloud if the Scaleway box is alive, or glass-break.com if the AWS box is alive) are unaffected. Streaming replication resumes when the failed box recovers. Tested in DR scenario 2.

6.2 Single-box database outage

The box's compute remains up but its PostgreSQL cannot serve reads or writes. If it held the primary, Patroni fails the writer over to the standby; Fastly routes traffic away. The surviving box continues to serve. On recovery, the failed box re-syncs from the current primary via streaming replication. Tested in DR scenario 6.

6.3 Both boxes unavailable

The hard case. Activation criteria met. Steps:

Public status page updated immediately.
Customers notified within 1 hour.
If the cause is a shared dependency, mitigation focuses on that dependency.
If the cause is independent failures, recovery is per-box in parallel, with the faster-recovered box resuming service first.
The planned third box (Microsoft Azure) is the mitigation for concurrent AWS + Scaleway outages.

6.4 Data corruption on a single box

Isolate the affected PostgreSQL from replication to prevent propagation.
Fail the writer over to the healthy box; recover the affected box from the current primary, or from an off-box snapshot taken before the corruption.
Re-sync the recovered box from the primary via streaming replication.
Verify quorum-recoverable secrets via the in-product diagnostic before declaring recovery.

6.5 Data corruption affecting both boxes

Extremely unlikely, but since the two boxes are one replicated cluster, logical corruption can propagate. If it occurs:

Take both surfaces read-only.
Determine the source of corruption (application bug, replication issue, malicious action).
Restore from the most recent verified off-box snapshot predating the corruption.
Reconcile manually; replay validated user actions from audit log entries that survived corruption.
Communicate transparently with affected customers.

6.6 Loss of Fastly (edge / CDN)

Customers on glassbreak.io may experience degraded routing until Fastly is restored. Direct surfaces are unaffected. Customers can be directed to the direct surfaces via the public status page. Tested in DR scenario 5.

6.7 Loss of a DNS authority

Each domain has dual DNS authority. Loss of a single authority does not unresolve any domain. No activation required; document in the incident register and replace the failed authority on Glassbreak's normal schedule.

6.8 Loss of a cryptographic signing key

Per the per-vertical JWT runbook (docs/operator-jwt-per-vertical.md): the compromised kid is dropped from the verify map across all verticals, token_version is bumped to invalidate outstanding tokens, the operator re-issues sessions, and the incident is logged. Tested in DR scenario 7.

6.9 Loss of a sub-processor

Stripe — payment processing degrades; in-app billing UI surfaces an outage banner; subscription state remains consistent until Stripe recovers.
Amazon SES / Scaleway TEM (email) — email-dependent flows (verification, password reset) are queued; users informed via the status page.
Twilio (SMS) — SMS-dependent flows are queued; users informed via the status page.
Grafana Cloud — telemetry ingestion stops; smoke tests continue locally; alerting is degraded.

Each sub-processor failure has a runbook entry in docs/observability.md.

7. Testing

Continuous: 22 DR scenarios in dr-tests/ run in CI nightly against a real two-box PostgreSQL topology. Coverage spans replication convergence, box kill, primary-failover promotion, TURN failover, secret quorum recovery, edge bypass, DB recovery, JWT rotation grace, refresh-token reuse detection, CGN call, audit chain integrity, cron idempotency, rate-limit resilience, secrets rotation, migration safety, replica lag, and backup integrity.
Quarterly: a tabletop walk-through of one scenario by the on-call responder, exercising the communications path end-to-end without impacting production.
Annual: a full activation drill including the public status-page update flow (in test mode) and customer-notification template review.

8. Roles

Incident Commander — declares activation, coordinates response, declares all-clear.
Technical Lead — drives recovery procedures.
Communications Lead — owns customer and supervisory-authority communications.
Scribe — maintains the activation timeline for the post-mortem.

9. Communication during activation

Public status page updated within 1 hour of activation; then at least every 2 hours until resolution.
Direct customer email within 24 hours for SEV-1; within 72 hours for personal-data breaches per the DPA.
A summary post-mortem published within 14 days for any activation event lasting more than 1 hour.

10. Records

Activation events are recorded in the incident register. Drill outcomes (quarterly and annual) are recorded in the same register tagged as drills. Records are retained for at least 5 years.

11. Review

This plan is reviewed at least annually and after every activation event or material change to the architecture. The next scheduled review is 27 May 2027.

12. Related documents

Information Security Policy
Incident Response Policy
Data Processing Agreement
docs/architecture.md — full topology
docs/observability.md — alert runbooks
docs/operator-jwt-per-vertical.md — JWT rotation runbook
dr-tests/ — the 22 DR scenarios that exercise this plan

Counter-signed PDF copy available on request to compliance@glassbreak.io.