Disaster Recovery Plan¶

Overview¶

This document defines the disaster recovery (DR) procedures for HeliosDB-Lite, ensuring rapid recovery from catastrophic failures while minimizing data loss.

Recovery Objectives¶

Metric	Target	Description
RTO	< 5 minutes	Time to restore service
RPO	< 1 minute	Maximum acceptable data loss
MTTR	< 15 minutes	Mean time to recovery

Disaster Scenarios¶

Tier 1: Component Failure¶

Scenario	RTO	Recovery Method
Single disk failure	0	RAID rebuild
Node failure	< 1 min	Automatic failover
Network partition	< 30 sec	Re-routing

Tier 2: Service Disruption¶

Scenario	RTO	RPO	Recovery Method
Database corruption	< 5 min	< 1 min	Branch restore
Ransomware	< 30 min	< 1 hour	Clean rebuild + backup
Accidental deletion	< 2 min	0	Time-travel recovery

Tier 3: Site Failure¶

Scenario	RTO	RPO	Recovery Method
Data center outage	< 15 min	< 5 min	Cross-region failover
Regional disaster	< 1 hour	< 30 min	DR site activation
Complete data loss	< 4 hours	< 24 hours	Backup restoration

HeliosDB-Specific Recovery Features¶

Branch-Based Recovery¶

HeliosDB's unique branching model enables instant recovery:

-- Create recovery point (automatic or manual)
CREATE BRANCH recovery_point_20260124;

-- Restore to recovery point
CHECKOUT BRANCH 'recovery_point_20260124';

-- Or use time-travel for precise recovery
SELECT * FROM orders AS OF '2026-01-24T12:00:00Z';

-- Create new branch from point-in-time
CREATE BRANCH recovered_data AS OF '2026-01-24T12:00:00Z';

Time-Travel Recovery¶

-- View data at specific point in time
SELECT * FROM users AS OF '2026-01-24T11:59:00Z';

-- Compare before/after corruption
SELECT
    before.id, before.balance AS old_balance, after.balance AS new_balance
FROM users AS OF '2026-01-24T11:59:00Z' before
JOIN users after ON before.id = after.id
WHERE before.balance != after.balance;

-- Restore specific rows
INSERT INTO users
SELECT * FROM users AS OF '2026-01-24T11:59:00Z'
WHERE id IN (SELECT id FROM corrupted_ids);

WAL-Based Recovery¶

# Continuous WAL archiving
heliosdb-wal archive \
    --source /var/lib/heliosdb/wal \
    --destination s3://backup-bucket/wal

# Point-in-time recovery using WAL
heliosdb-restore \
    --base-backup /backups/base_20260123 \
    --wal-archive s3://backup-bucket/wal \
    --target-time "2026-01-24T12:00:00Z"

Backup Strategy¶

Backup Types¶

Type	Frequency	Retention	Method
Continuous WAL	Real-time	7 days	Streaming to S3
Incremental	Hourly	7 days	Branch snapshot
Full backup	Daily	30 days	Complete export
Archive	Monthly	1 year	Compressed archive

Backup Architecture¶

┌─────────────────────────────────────────────────────────────────┐
│                     Backup Architecture                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Primary ──WAL──▶ WAL Archive (S3) ──▶ DR Region              │
│     │                                                          │
│     ├──hourly──▶ Incremental Backup ──▶ Hot Storage           │
│     │                                                          │
│     ├──daily───▶ Full Backup ────────▶ Warm Storage           │
│     │                                                          │
│     └──monthly─▶ Archive ────────────▶ Cold Storage (Glacier) │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Backup Verification¶

# Automated backup verification (daily)
heliosdb-backup verify \
    --backup /backups/latest \
    --checksum \
    --test-restore

# Monthly restore drill
heliosdb-restore \
    --backup /backups/monthly_20260101 \
    --target /tmp/restore_test \
    --verify-queries /tests/verification_queries.sql

Recovery Procedures¶

Procedure 1: Single Node Recovery¶

Trigger: Node unresponsive, health check failing

Steps: 1. Verify node is truly failed (not network issue) 2. Automatic failover to replica (< 30 seconds) 3. Promote replica to primary 4. Spin up new replica from backup 5. Verify replication is synchronized

# Manual failover if automatic fails
heliosdb-ha failover --force --promote replica-2

# Verify new primary
heliosdb-cli status --cluster

# Rebuild failed node
heliosdb-node rebuild \
    --from-backup latest \
    --join-cluster production

Procedure 2: Database Corruption Recovery¶

Trigger: Data integrity check failure, application errors

Steps: 1. Stop writes to prevent further damage 2. Identify corruption scope 3. Use time-travel to find clean point 4. Restore from clean point 5. Replay valid transactions

-- Step 1: Stop writes
ALTER SYSTEM SET default_transaction_read_only = on;

-- Step 2: Identify corruption
SELECT heliosdb_check_integrity('public.orders');

-- Step 3: Find clean point
SELECT timestamp
FROM heliosdb_branch_history
WHERE integrity_check = 'passed'
ORDER BY timestamp DESC LIMIT 1;

-- Step 4: Restore (creates new branch)
CREATE BRANCH recovery FROM 'main' AS OF '2026-01-24T11:00:00Z';
CHECKOUT BRANCH 'recovery';

-- Step 5: Merge clean branch back
-- (After verification)
MERGE BRANCH 'recovery' INTO 'main';

Procedure 3: Complete Site Failover¶

Trigger: Primary data center unavailable

Steps: 1. Confirm primary site failure 2. Activate DR site 3. Update DNS/load balancer 4. Verify DR site functionality 5. Notify stakeholders

# Step 1: Verify primary failure
heliosdb-monitor check-site primary --timeout 60

# Step 2: Activate DR
heliosdb-dr activate --site dr-west --confirm

# Step 3: Update routing
heliosdb-dns update --record db.example.com --target dr-west.example.com

# Step 4: Verify
heliosdb-cli --host dr-west.example.com status

# Step 5: Send notifications
heliosdb-notify send --template dr-activation --recipients ops-team

Procedure 4: Ransomware Recovery¶

Trigger: Ransomware detection, encrypted files

Steps: 1. Isolate affected systems immediately 2. Preserve evidence for investigation 3. Identify clean backup (before infection) 4. Rebuild from clean backup 5. Restore data, verify integrity 6. Strengthen security controls

# Step 1: Network isolation
iptables -A INPUT -j DROP
iptables -A OUTPUT -j DROP

# Step 2: Evidence preservation
heliosdb-forensics capture --full-system --output /secure/evidence

# Step 3: Identify clean backup
heliosdb-backup list --before "2026-01-20" --verify-clean

# Step 4: Clean rebuild
# (On isolated network)
heliosdb-restore --backup /secure/clean_backup_20260119 --new-server

# Step 5: Verify and reconnect
# (After security review)

DR Site Configuration¶

Active-Passive Setup¶

# Primary site configuration
[replication]
mode = "streaming"
primary = true
archive_command = "heliosdb-wal archive %p s3://backup/wal/%f"

[replication.standby]
host = "dr-west.internal"
port = 5432
sync_mode = "async"  # or "sync" for zero RPO

Active-Active Setup (Multi-Region)¶

# Multi-region configuration
[cluster]
name = "production"
mode = "multi-primary"

[[cluster.nodes]]
name = "us-east"
host = "db-east.example.com"
region = "us-east-1"
priority = 1

[[cluster.nodes]]
name = "us-west"
host = "db-west.example.com"
region = "us-west-2"
priority = 2

[cluster.conflict_resolution]
strategy = "last-write-wins"
vector_clock = true

Testing Schedule¶

Test Type	Frequency	Duration	Scope
Backup verification	Daily	Automated	All backups
Component failover	Weekly	15 min	Individual nodes
Site failover	Monthly	2 hours	Full DR drill
Full DR simulation	Quarterly	4 hours	Complete scenario

Test Checklist¶

[ ] Backup integrity verified
[ ] Recovery scripts executed successfully
[ ] RTO/RPO targets met
[ ] All applications reconnected
[ ] Data integrity confirmed
[ ] Performance acceptable
[ ] Documentation updated

Monitoring & Alerting¶

DR Health Checks¶

# Monitoring configuration
alerts:
  - name: replication_lag
    condition: lag > 60s
    severity: warning

  - name: replication_lag_critical
    condition: lag > 300s
    severity: critical

  - name: backup_age
    condition: last_backup > 24h
    severity: critical

  - name: dr_site_health
    condition: dr_site_unreachable
    severity: critical

Contacts¶

Escalation Path¶

Level	Contact	When
L1	On-call engineer	Initial response
L2	Database team lead	Escalation > 15 min
L3	VP Engineering	Site-wide failure
L4	Executive team	Customer impact > 1 hour

Emergency Contacts¶

Primary On-Call: [PagerDuty rotation]
Database Team: db-team@heliosdb.io
Emergency Line: +1-xxx-xxx-xxxx