Disaster Recovery Plan¶
Overview¶
This document defines the disaster recovery (DR) procedures for HeliosDB-Lite, ensuring rapid recovery from catastrophic failures while minimizing data loss.
Recovery Objectives¶
| Metric | Target | Description |
|---|---|---|
| RTO | < 5 minutes | Time to restore service |
| RPO | < 1 minute | Maximum acceptable data loss |
| MTTR | < 15 minutes | Mean time to recovery |
Disaster Scenarios¶
Tier 1: Component Failure¶
| Scenario | RTO | RPO | Recovery Method |
|---|---|---|---|
| Single disk failure | 0 | 0 | RAID rebuild |
| Node failure | < 1 min | 0 | Automatic failover |
| Network partition | < 30 sec | 0 | Re-routing |
Tier 2: Service Disruption¶
| Scenario | RTO | RPO | Recovery Method |
|---|---|---|---|
| Database corruption | < 5 min | < 1 min | Branch restore |
| Ransomware | < 30 min | < 1 hour | Clean rebuild + backup |
| Accidental deletion | < 2 min | 0 | Time-travel recovery |
Tier 3: Site Failure¶
| Scenario | RTO | RPO | Recovery Method |
|---|---|---|---|
| Data center outage | < 15 min | < 5 min | Cross-region failover |
| Regional disaster | < 1 hour | < 30 min | DR site activation |
| Complete data loss | < 4 hours | < 24 hours | Backup restoration |
HeliosDB-Specific Recovery Features¶
Branch-Based Recovery¶
HeliosDB's unique branching model enables instant recovery:
-- Create recovery point (automatic or manual)
CREATE BRANCH recovery_point_20260124;
-- Restore to recovery point
CHECKOUT BRANCH 'recovery_point_20260124';
-- Or use time-travel for precise recovery
SELECT * FROM orders AS OF '2026-01-24T12:00:00Z';
-- Create new branch from point-in-time
CREATE BRANCH recovered_data AS OF '2026-01-24T12:00:00Z';
Time-Travel Recovery¶
-- View data at specific point in time
SELECT * FROM users AS OF '2026-01-24T11:59:00Z';
-- Compare before/after corruption
SELECT
before.id, before.balance AS old_balance, after.balance AS new_balance
FROM users AS OF '2026-01-24T11:59:00Z' before
JOIN users after ON before.id = after.id
WHERE before.balance != after.balance;
-- Restore specific rows
INSERT INTO users
SELECT * FROM users AS OF '2026-01-24T11:59:00Z'
WHERE id IN (SELECT id FROM corrupted_ids);
WAL-Based Recovery¶
# Continuous WAL archiving
heliosdb-wal archive \
--source /var/lib/heliosdb/wal \
--destination s3://backup-bucket/wal
# Point-in-time recovery using WAL
heliosdb-restore \
--base-backup /backups/base_20260123 \
--wal-archive s3://backup-bucket/wal \
--target-time "2026-01-24T12:00:00Z"
Backup Strategy¶
Backup Types¶
| Type | Frequency | Retention | Method |
|---|---|---|---|
| Continuous WAL | Real-time | 7 days | Streaming to S3 |
| Incremental | Hourly | 7 days | Branch snapshot |
| Full backup | Daily | 30 days | Complete export |
| Archive | Monthly | 1 year | Compressed archive |
Backup Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ Backup Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Primary ──WAL──▶ WAL Archive (S3) ──▶ DR Region │
│ │ │
│ ├──hourly──▶ Incremental Backup ──▶ Hot Storage │
│ │ │
│ ├──daily───▶ Full Backup ────────▶ Warm Storage │
│ │ │
│ └──monthly─▶ Archive ────────────▶ Cold Storage (Glacier) │
│ │
└─────────────────────────────────────────────────────────────────┘
Backup Verification¶
# Automated backup verification (daily)
heliosdb-backup verify \
--backup /backups/latest \
--checksum \
--test-restore
# Monthly restore drill
heliosdb-restore \
--backup /backups/monthly_20260101 \
--target /tmp/restore_test \
--verify-queries /tests/verification_queries.sql
Recovery Procedures¶
Procedure 1: Single Node Recovery¶
Trigger: Node unresponsive, health check failing
Steps: 1. Verify node is truly failed (not network issue) 2. Automatic failover to replica (< 30 seconds) 3. Promote replica to primary 4. Spin up new replica from backup 5. Verify replication is synchronized
# Manual failover if automatic fails
heliosdb-ha failover --force --promote replica-2
# Verify new primary
heliosdb-cli status --cluster
# Rebuild failed node
heliosdb-node rebuild \
--from-backup latest \
--join-cluster production
Procedure 2: Database Corruption Recovery¶
Trigger: Data integrity check failure, application errors
Steps: 1. Stop writes to prevent further damage 2. Identify corruption scope 3. Use time-travel to find clean point 4. Restore from clean point 5. Replay valid transactions
-- Step 1: Stop writes
ALTER SYSTEM SET default_transaction_read_only = on;
-- Step 2: Identify corruption
SELECT heliosdb_check_integrity('public.orders');
-- Step 3: Find clean point
SELECT timestamp
FROM heliosdb_branch_history
WHERE integrity_check = 'passed'
ORDER BY timestamp DESC LIMIT 1;
-- Step 4: Restore (creates new branch)
CREATE BRANCH recovery FROM 'main' AS OF '2026-01-24T11:00:00Z';
CHECKOUT BRANCH 'recovery';
-- Step 5: Merge clean branch back
-- (After verification)
MERGE BRANCH 'recovery' INTO 'main';
Procedure 3: Complete Site Failover¶
Trigger: Primary data center unavailable
Steps: 1. Confirm primary site failure 2. Activate DR site 3. Update DNS/load balancer 4. Verify DR site functionality 5. Notify stakeholders
# Step 1: Verify primary failure
heliosdb-monitor check-site primary --timeout 60
# Step 2: Activate DR
heliosdb-dr activate --site dr-west --confirm
# Step 3: Update routing
heliosdb-dns update --record db.example.com --target dr-west.example.com
# Step 4: Verify
heliosdb-cli --host dr-west.example.com status
# Step 5: Send notifications
heliosdb-notify send --template dr-activation --recipients ops-team
Procedure 4: Ransomware Recovery¶
Trigger: Ransomware detection, encrypted files
Steps: 1. Isolate affected systems immediately 2. Preserve evidence for investigation 3. Identify clean backup (before infection) 4. Rebuild from clean backup 5. Restore data, verify integrity 6. Strengthen security controls
# Step 1: Network isolation
iptables -A INPUT -j DROP
iptables -A OUTPUT -j DROP
# Step 2: Evidence preservation
heliosdb-forensics capture --full-system --output /secure/evidence
# Step 3: Identify clean backup
heliosdb-backup list --before "2026-01-20" --verify-clean
# Step 4: Clean rebuild
# (On isolated network)
heliosdb-restore --backup /secure/clean_backup_20260119 --new-server
# Step 5: Verify and reconnect
# (After security review)
DR Site Configuration¶
Active-Passive Setup¶
# Primary site configuration
[replication]
mode = "streaming"
primary = true
archive_command = "heliosdb-wal archive %p s3://backup/wal/%f"
[replication.standby]
host = "dr-west.internal"
port = 5432
sync_mode = "async" # or "sync" for zero RPO
Active-Active Setup (Multi-Region)¶
# Multi-region configuration
[cluster]
name = "production"
mode = "multi-primary"
[[cluster.nodes]]
name = "us-east"
host = "db-east.example.com"
region = "us-east-1"
priority = 1
[[cluster.nodes]]
name = "us-west"
host = "db-west.example.com"
region = "us-west-2"
priority = 2
[cluster.conflict_resolution]
strategy = "last-write-wins"
vector_clock = true
Testing Schedule¶
| Test Type | Frequency | Duration | Scope |
|---|---|---|---|
| Backup verification | Daily | Automated | All backups |
| Component failover | Weekly | 15 min | Individual nodes |
| Site failover | Monthly | 2 hours | Full DR drill |
| Full DR simulation | Quarterly | 4 hours | Complete scenario |
Test Checklist¶
- [ ] Backup integrity verified
- [ ] Recovery scripts executed successfully
- [ ] RTO/RPO targets met
- [ ] All applications reconnected
- [ ] Data integrity confirmed
- [ ] Performance acceptable
- [ ] Documentation updated
Monitoring & Alerting¶
DR Health Checks¶
# Monitoring configuration
alerts:
- name: replication_lag
condition: lag > 60s
severity: warning
- name: replication_lag_critical
condition: lag > 300s
severity: critical
- name: backup_age
condition: last_backup > 24h
severity: critical
- name: dr_site_health
condition: dr_site_unreachable
severity: critical
Contacts¶
Escalation Path¶
| Level | Contact | When |
|---|---|---|
| L1 | On-call engineer | Initial response |
| L2 | Database team lead | Escalation > 15 min |
| L3 | VP Engineering | Site-wide failure |
| L4 | Executive team | Customer impact > 1 hour |
Emergency Contacts¶
- Primary On-Call: [PagerDuty rotation]
- Database Team: db-team@heliosdb.io
- Emergency Line: +1-xxx-xxx-xxxx