Skip to content

Disaster Recovery Plan

Overview

This document defines the disaster recovery (DR) procedures for HeliosDB-Lite, ensuring rapid recovery from catastrophic failures while minimizing data loss.

Recovery Objectives

Metric Target Description
RTO < 5 minutes Time to restore service
RPO < 1 minute Maximum acceptable data loss
MTTR < 15 minutes Mean time to recovery

Disaster Scenarios

Tier 1: Component Failure

Scenario RTO RPO Recovery Method
Single disk failure 0 0 RAID rebuild
Node failure < 1 min 0 Automatic failover
Network partition < 30 sec 0 Re-routing

Tier 2: Service Disruption

Scenario RTO RPO Recovery Method
Database corruption < 5 min < 1 min Branch restore
Ransomware < 30 min < 1 hour Clean rebuild + backup
Accidental deletion < 2 min 0 Time-travel recovery

Tier 3: Site Failure

Scenario RTO RPO Recovery Method
Data center outage < 15 min < 5 min Cross-region failover
Regional disaster < 1 hour < 30 min DR site activation
Complete data loss < 4 hours < 24 hours Backup restoration

HeliosDB-Specific Recovery Features

Branch-Based Recovery

HeliosDB's unique branching model enables instant recovery:

-- Create recovery point (automatic or manual)
CREATE BRANCH recovery_point_20260124;

-- Restore to recovery point
CHECKOUT BRANCH 'recovery_point_20260124';

-- Or use time-travel for precise recovery
SELECT * FROM orders AS OF '2026-01-24T12:00:00Z';

-- Create new branch from point-in-time
CREATE BRANCH recovered_data AS OF '2026-01-24T12:00:00Z';

Time-Travel Recovery

-- View data at specific point in time
SELECT * FROM users AS OF '2026-01-24T11:59:00Z';

-- Compare before/after corruption
SELECT
    before.id, before.balance AS old_balance, after.balance AS new_balance
FROM users AS OF '2026-01-24T11:59:00Z' before
JOIN users after ON before.id = after.id
WHERE before.balance != after.balance;

-- Restore specific rows
INSERT INTO users
SELECT * FROM users AS OF '2026-01-24T11:59:00Z'
WHERE id IN (SELECT id FROM corrupted_ids);

WAL-Based Recovery

# Continuous WAL archiving
heliosdb-wal archive \
    --source /var/lib/heliosdb/wal \
    --destination s3://backup-bucket/wal

# Point-in-time recovery using WAL
heliosdb-restore \
    --base-backup /backups/base_20260123 \
    --wal-archive s3://backup-bucket/wal \
    --target-time "2026-01-24T12:00:00Z"

Backup Strategy

Backup Types

Type Frequency Retention Method
Continuous WAL Real-time 7 days Streaming to S3
Incremental Hourly 7 days Branch snapshot
Full backup Daily 30 days Complete export
Archive Monthly 1 year Compressed archive

Backup Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Backup Architecture                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Primary ──WAL──▶ WAL Archive (S3) ──▶ DR Region              │
│     │                                                          │
│     ├──hourly──▶ Incremental Backup ──▶ Hot Storage           │
│     │                                                          │
│     ├──daily───▶ Full Backup ────────▶ Warm Storage           │
│     │                                                          │
│     └──monthly─▶ Archive ────────────▶ Cold Storage (Glacier) │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Backup Verification

# Automated backup verification (daily)
heliosdb-backup verify \
    --backup /backups/latest \
    --checksum \
    --test-restore

# Monthly restore drill
heliosdb-restore \
    --backup /backups/monthly_20260101 \
    --target /tmp/restore_test \
    --verify-queries /tests/verification_queries.sql

Recovery Procedures

Procedure 1: Single Node Recovery

Trigger: Node unresponsive, health check failing

Steps: 1. Verify node is truly failed (not network issue) 2. Automatic failover to replica (< 30 seconds) 3. Promote replica to primary 4. Spin up new replica from backup 5. Verify replication is synchronized

# Manual failover if automatic fails
heliosdb-ha failover --force --promote replica-2

# Verify new primary
heliosdb-cli status --cluster

# Rebuild failed node
heliosdb-node rebuild \
    --from-backup latest \
    --join-cluster production

Procedure 2: Database Corruption Recovery

Trigger: Data integrity check failure, application errors

Steps: 1. Stop writes to prevent further damage 2. Identify corruption scope 3. Use time-travel to find clean point 4. Restore from clean point 5. Replay valid transactions

-- Step 1: Stop writes
ALTER SYSTEM SET default_transaction_read_only = on;

-- Step 2: Identify corruption
SELECT heliosdb_check_integrity('public.orders');

-- Step 3: Find clean point
SELECT timestamp
FROM heliosdb_branch_history
WHERE integrity_check = 'passed'
ORDER BY timestamp DESC LIMIT 1;

-- Step 4: Restore (creates new branch)
CREATE BRANCH recovery FROM 'main' AS OF '2026-01-24T11:00:00Z';
CHECKOUT BRANCH 'recovery';

-- Step 5: Merge clean branch back
-- (After verification)
MERGE BRANCH 'recovery' INTO 'main';

Procedure 3: Complete Site Failover

Trigger: Primary data center unavailable

Steps: 1. Confirm primary site failure 2. Activate DR site 3. Update DNS/load balancer 4. Verify DR site functionality 5. Notify stakeholders

# Step 1: Verify primary failure
heliosdb-monitor check-site primary --timeout 60

# Step 2: Activate DR
heliosdb-dr activate --site dr-west --confirm

# Step 3: Update routing
heliosdb-dns update --record db.example.com --target dr-west.example.com

# Step 4: Verify
heliosdb-cli --host dr-west.example.com status

# Step 5: Send notifications
heliosdb-notify send --template dr-activation --recipients ops-team

Procedure 4: Ransomware Recovery

Trigger: Ransomware detection, encrypted files

Steps: 1. Isolate affected systems immediately 2. Preserve evidence for investigation 3. Identify clean backup (before infection) 4. Rebuild from clean backup 5. Restore data, verify integrity 6. Strengthen security controls

# Step 1: Network isolation
iptables -A INPUT -j DROP
iptables -A OUTPUT -j DROP

# Step 2: Evidence preservation
heliosdb-forensics capture --full-system --output /secure/evidence

# Step 3: Identify clean backup
heliosdb-backup list --before "2026-01-20" --verify-clean

# Step 4: Clean rebuild
# (On isolated network)
heliosdb-restore --backup /secure/clean_backup_20260119 --new-server

# Step 5: Verify and reconnect
# (After security review)

DR Site Configuration

Active-Passive Setup

# Primary site configuration
[replication]
mode = "streaming"
primary = true
archive_command = "heliosdb-wal archive %p s3://backup/wal/%f"

[replication.standby]
host = "dr-west.internal"
port = 5432
sync_mode = "async"  # or "sync" for zero RPO

Active-Active Setup (Multi-Region)

# Multi-region configuration
[cluster]
name = "production"
mode = "multi-primary"

[[cluster.nodes]]
name = "us-east"
host = "db-east.example.com"
region = "us-east-1"
priority = 1

[[cluster.nodes]]
name = "us-west"
host = "db-west.example.com"
region = "us-west-2"
priority = 2

[cluster.conflict_resolution]
strategy = "last-write-wins"
vector_clock = true

Testing Schedule

Test Type Frequency Duration Scope
Backup verification Daily Automated All backups
Component failover Weekly 15 min Individual nodes
Site failover Monthly 2 hours Full DR drill
Full DR simulation Quarterly 4 hours Complete scenario

Test Checklist

  • [ ] Backup integrity verified
  • [ ] Recovery scripts executed successfully
  • [ ] RTO/RPO targets met
  • [ ] All applications reconnected
  • [ ] Data integrity confirmed
  • [ ] Performance acceptable
  • [ ] Documentation updated

Monitoring & Alerting

DR Health Checks

# Monitoring configuration
alerts:
  - name: replication_lag
    condition: lag > 60s
    severity: warning

  - name: replication_lag_critical
    condition: lag > 300s
    severity: critical

  - name: backup_age
    condition: last_backup > 24h
    severity: critical

  - name: dr_site_health
    condition: dr_site_unreachable
    severity: critical

Contacts

Escalation Path

Level Contact When
L1 On-call engineer Initial response
L2 Database team lead Escalation > 15 min
L3 VP Engineering Site-wide failure
L4 Executive team Customer impact > 1 hour

Emergency Contacts

  • Primary On-Call: [PagerDuty rotation]
  • Database Team: db-team@heliosdb.io
  • Emergency Line: +1-xxx-xxx-xxxx