HeliosDB-Lite High Availability Hands-On Tutorial¶
This comprehensive tutorial guides you through setting up, operating, and testing HeliosDB-Lite's High Availability features including Transparent Write Routing (TWR), automatic failover, and application continuity.
Table of Contents¶
- Architecture Overview
- Part 1: Docker Deployment
- Part 2: Local Deployment (Without Docker)
- Part 3: Transparent Write Routing (TWR)
- Part 4: Transparent Read Routing (TRR)
- Part 5: HeliosProxy Deep Dive
- Part 6: Monitoring the Cluster
- Part 7: Switchover Operations
- Part 8: Failover and Automatic Recovery
- Part 9: Application Continuity Testing
- Part 10: Advanced Scenarios
Architecture Overview¶
┌─────────────────────────────────────┐
│ APPLICATION │
└─────────────────┬───────────────────┘
│
┌─────────────────▼───────────────────┐
│ HELIOSPROXY │
│ ┌─────────────────────────────┐ │
│ │ • PostgreSQL Protocol (5432)│ │
│ │ • HTTP SQL API (8080) │ │
│ │ • Admin API (9090) │ │
│ │ • Health Checking │ │
│ │ • Write Timeout (30s) │ │
│ │ • TWR + TRR │ │
│ └─────────────────────────────┘ │
└───────┬───────────┬───────────┬────┘
│ │ │
┌─────────────▼───┐ ┌───▼───┐ ┌───▼─────────────┐
│ PRIMARY │ │STANDBY│ │ STANDBY │
│ (Read/Write) │ │ SYNC │ │ ASYNC │
│ Port: 5432 │ │ 5442 │ │ 5452 │
└────────┬────────┘ └───┬───┘ └────────┬────────┘
│ │ │
└────────────────┴────────────────┘
WAL Streaming Replication
Key Features¶
| Feature | Description |
|---|---|
| TWR | Transparent Write Routing - writes auto-route to primary |
| TRR | Transparent Read Routing - reads load-balance across standbys |
| Write Timeout | Writes wait up to 30s for primary during failover |
| Auto Recovery | Automatic reconnection when primary returns |
| Sticky Sessions | Maintain backend affinity within a session |
Part 1: Docker Deployment¶
Prerequisites¶
Step 1: Clone and Build¶
cd /path/to/HeliosDB-Lite
# Build the Docker image with HA support
docker build -f tests/docker/Dockerfile.ha -t heliosdb-lite:ha .
Step 2: Create Docker Compose Configuration¶
Create docker-compose.ha-cluster.yml:
version: '3.8'
networks:
helios-ha:
driver: bridge
ipam:
config:
- subnet: 172.28.0.0/16
services:
# Primary node - handles all writes
primary:
image: heliosdb-lite:ha
container_name: heliosdb-primary
hostname: primary
networks:
helios-ha:
ipv4_address: 172.28.1.1
ports:
- "15432:5432" # PostgreSQL protocol
- "15433:5433" # Native protocol
- "18080:8080" # HTTP API
environment:
- HELIOSDB_ROLE=primary
- HELIOSDB_NODE_ID=primary
- HELIOSDB_DATA_DIR=/data
- HELIOSDB_REPLICATION_MODE=sync
volumes:
- primary-data:/data
healthcheck:
test: ["CMD", "heliosdb-lite", "health"]
interval: 5s
timeout: 3s
retries: 3
# Synchronous standby - zero data loss
standby-sync:
image: heliosdb-lite:ha
container_name: heliosdb-standby-sync
hostname: standby-sync
networks:
helios-ha:
ipv4_address: 172.28.1.2
ports:
- "15442:5432"
- "15443:5433"
- "18081:8080"
environment:
- HELIOSDB_ROLE=standby
- HELIOSDB_NODE_ID=standby-sync
- HELIOSDB_PRIMARY_HOST=primary
- HELIOSDB_PRIMARY_PORT=5433
- HELIOSDB_REPLICATION_MODE=sync
volumes:
- standby-sync-data:/data
depends_on:
primary:
condition: service_healthy
# Asynchronous standby - better performance, potential lag
standby-async:
image: heliosdb-lite:ha
container_name: heliosdb-standby-async
hostname: standby-async
networks:
helios-ha:
ipv4_address: 172.28.1.3
ports:
- "15462:5432"
- "15463:5433"
- "18084:8080"
environment:
- HELIOSDB_ROLE=standby
- HELIOSDB_NODE_ID=standby-async
- HELIOSDB_PRIMARY_HOST=primary
- HELIOSDB_PRIMARY_PORT=5433
- HELIOSDB_REPLICATION_MODE=async
volumes:
- standby-async-data:/data
depends_on:
primary:
condition: service_healthy
# HeliosProxy - intelligent routing
proxy:
image: heliosdb-lite:ha
container_name: heliosdb-proxy
hostname: proxy
networks:
helios-ha:
ipv4_address: 172.28.1.100
ports:
- "15400:5432" # PostgreSQL protocol
- "19090:9090" # Admin API
environment:
- HELIOSDB_PROXY_CONFIG=/etc/heliosdb/proxy.toml
volumes:
- ./proxy-config.toml:/etc/heliosdb/proxy.toml:ro
command: ["heliosdb-proxy"]
depends_on:
- primary
- standby-sync
- standby-async
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9090/health"]
interval: 5s
timeout: 3s
retries: 3
volumes:
primary-data:
standby-sync-data:
standby-async-data:
Step 3: Create Proxy Configuration¶
Create proxy-config.toml:
[proxy]
listen_addr = "0.0.0.0:5432"
admin_addr = "0.0.0.0:9090"
health_check_interval_secs = 5
failure_threshold = 3
write_timeout_secs = 30
[[nodes]]
name = "primary"
host = "primary"
port = 5432
role = "primary"
enabled = true
[[nodes]]
name = "standby-sync"
host = "standby-sync"
port = 5432
role = "standby"
enabled = true
[[nodes]]
name = "standby-async"
host = "standby-async"
port = 5432
role = "standby"
enabled = true
Step 4: Start the Cluster¶
# Start all services
docker compose -f docker-compose.ha-cluster.yml up -d
# Verify all containers are running
docker compose -f docker-compose.ha-cluster.yml ps
# Expected output:
# NAME STATUS PORTS
# heliosdb-primary Up (healthy) 0.0.0.0:15432->5432/tcp
# heliosdb-standby-sync Up (healthy) 0.0.0.0:15442->5432/tcp
# heliosdb-standby-async Up (healthy) 0.0.0.0:15462->5432/tcp
# heliosdb-proxy Up (healthy) 0.0.0.0:15400->5432/tcp
Step 5: Verify Connectivity¶
# Connect through proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c "SELECT 1"
# Connect directly to primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 1"
# Connect directly to standby
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT 1"
Part 2: Local Deployment (Without Docker)¶
Prerequisites¶
# Build HeliosDB-Lite
cargo build --release --features "ha-tier1,ha-proxy"
# Binary location
export HELIOSDB_BIN=./target/release/heliosdb-lite
export HELIOSPROXY_BIN=./target/release/heliosdb-proxy
Step 1: Create Data Directories¶
Step 2: Start Primary Node¶
# Terminal 1: Primary (ports 5432/5433/8080)
$HELIOSDB_BIN start \
--data-dir /tmp/heliosdb-ha/primary \
--pg-port 5432 \
--native-port 5433 \
--http-port 8080 \
--node-id primary \
--replication-role primary \
--replication-mode sync
Step 3: Start Standby Nodes¶
# Terminal 2: Standby Sync (ports 5442/5443/8081)
$HELIOSDB_BIN start \
--data-dir /tmp/heliosdb-ha/standby-sync \
--pg-port 5442 \
--native-port 5443 \
--http-port 8081 \
--node-id standby-sync \
--replication-role standby \
--primary-host localhost \
--primary-port 5433 \
--replication-mode sync
# Terminal 3: Standby Async (ports 5452/5453/8082)
$HELIOSDB_BIN start \
--data-dir /tmp/heliosdb-ha/standby-async \
--pg-port 5452 \
--native-port 5453 \
--http-port 8082 \
--node-id standby-async \
--replication-role standby \
--primary-host localhost \
--primary-port 5433 \
--replication-mode async
Step 4: Create Local Proxy Configuration¶
Create /tmp/heliosdb-ha/proxy.toml:
[proxy]
listen_addr = "0.0.0.0:5400"
admin_addr = "0.0.0.0:9090"
health_check_interval_secs = 5
failure_threshold = 3
write_timeout_secs = 30
[[nodes]]
name = "primary"
host = "localhost"
port = 5432
role = "primary"
enabled = true
[[nodes]]
name = "standby-sync"
host = "localhost"
port = 5442
role = "standby"
enabled = true
[[nodes]]
name = "standby-async"
host = "localhost"
port = 5452
role = "standby"
enabled = true
Step 5: Start HeliosProxy¶
Port Summary (Local Deployment)¶
| Node | PG Port | Native Port | HTTP Port |
|---|---|---|---|
| Primary | 5432 | 5433 | 8080 |
| Standby Sync | 5442 | 5443 | 8081 |
| Standby Async | 5452 | 5453 | 8082 |
| Proxy | 5400 | - | 9090 |
Part 3: Transparent Write Routing (TWR)¶
TWR automatically routes write operations to the primary node, regardless of which node you're connected to.
How TWR Works¶
┌───────────────────────────────────────────────────────────────┐
│ CLIENT APPLICATION │
│ │
│ INSERT INTO users (name) VALUES ('Alice') -- WRITE │
│ UPDATE users SET active = true -- WRITE │
│ DELETE FROM users WHERE id = 5 -- WRITE │
│ SELECT * FROM users -- READ │
└───────────────────────────┬───────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ HELIOSPROXY │
│ │
│ Query Classification: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ is_write_query(sql): │ │
│ │ • INSERT, UPDATE, DELETE → true │ │
│ │ • CREATE, DROP, ALTER, TRUNCATE → true │ │
│ │ • BEGIN, COMMIT, ROLLBACK → true (transaction) │ │
│ │ • SELECT, SHOW, EXPLAIN → false (read) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Routing Decision: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ if is_write: │ │
│ │ route_to_primary() ───────────► PRIMARY │ │
│ │ else: │ │
│ │ route_to_any_healthy() ───────────► PRIMARY/STANDBY │ │
│ └─────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘
Testing TWR¶
# Create test table through proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb <<EOF
CREATE TABLE twr_test (
id INTEGER PRIMARY KEY,
data TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
EOF
# Insert data (automatically routes to primary)
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c \
"INSERT INTO twr_test (id, data) VALUES (1, 'test data')"
# Verify on primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c \
"SELECT * FROM twr_test"
# Verify replication to standby
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c \
"SELECT * FROM twr_test"
Write Timeout During Failover¶
When the primary is unavailable, writes wait up to write_timeout_secs:
CLIENT PROXY NODES
│ │ │
│ INSERT INTO... │ │
│────────────────────────►│ │
│ │ select_primary_with_timeout │
│ │──────────────────────────────│
│ │ Primary healthy? NO │
│ │ │
│ │ ┌──────────────────────┐ │
│ │ │ WAIT LOOP (30s max) │ │
│ │ │ │ │
│ (waiting...) │ │ Sleep 500ms │ │
│ │ │ Check health │ │
│ │ │ Primary back? YES │ │
│ │ └──────────────────────┘ │
│ │ │
│ │────────────────────────────►│ PRIMARY
│ OK (after N seconds) │◄────────────────────────────│
│◄────────────────────────│ │
Part 4: Transparent Read Routing (TRR)¶
TRR distributes read queries across all healthy nodes for load balancing.
How TRR Works¶
READ Query: SELECT * FROM users WHERE id = 1
┌─────────────────────────────────────────────────────────────┐
│ HELIOSPROXY │
│ │
│ Load Balancing Algorithm: │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ fn select_read_node(): │ │
│ │ healthy_nodes = get_healthy_nodes() │ │
│ │ if session.has_sticky_backend: │ │
│ │ return session.backend # Maintain affinity │ │
│ │ else: │ │
│ │ return round_robin(healthy_nodes) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Distribution: │
│ Request 1 ──► Primary │
│ Request 2 ──► Standby-Sync │
│ Request 3 ──► Standby-Async │
│ Request 4 ──► Primary (round robin continues) │
└─────────────────────────────────────────────────────────────┘
Testing TRR¶
# Run multiple SELECT queries and observe distribution
for i in {1..10}; do
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c \
"SELECT '$i', current_timestamp"
done
# Check proxy logs to see routing decisions
docker logs heliosdb-proxy 2>&1 | grep -i "routing\|selected"
Read Scaling Benefits¶
| Scenario | Without TRR | With TRR |
|---|---|---|
| 1000 reads/sec | Primary handles 1000 | Each node handles ~333 |
| Primary fails | All reads fail | Reads continue on standbys |
| Latency | Single point | Distributed load |
Part 5: HeliosProxy Deep Dive¶
Proxy Architecture¶
┌────────────────────────────────────────────────────────────────────┐
│ HELIOSPROXY │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ LISTENER LAYER │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ PG Protocol │ │ HTTP API │ │ Admin API │ │ │
│ │ │ Port 5432 │ │ Port 8080 │ │ Port 9090 │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ROUTING LAYER │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Query Classifier │ │ │
│ │ │ • Parse SQL to determine read/write │ │ │
│ │ │ • Detect transaction boundaries │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Session Manager │ │ │
│ │ │ • Track client sessions │ │ │
│ │ │ • Maintain sticky backend affinity │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Load Balancer │ │ │
│ │ │ • Round-robin for reads │ │ │
│ │ │ • Primary-only for writes │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ HEALTH LAYER │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Health Checker (background task) │ │ │
│ │ │ • Poll each node every health_check_interval_secs │ │ │
│ │ │ • Track consecutive failures │ │ │
│ │ │ • Mark unhealthy after failure_threshold failures │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Write Timeout Handler │ │ │
│ │ │ • Wait for primary availability │ │ │
│ │ │ • Poll every 500ms │ │ │
│ │ │ • Timeout after write_timeout_secs │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ BACKEND POOL │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ PRIMARY │ │ STANDBY-SY │ │ STANDBY-AS │ │ │
│ │ │ healthy: ✓ │ │ healthy: ✓ │ │ healthy: ✓ │ │ │
│ │ │ failures: 0 │ │ failures: 0 │ │ failures: 0 │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
Configuration Reference¶
[proxy]
# Network settings
listen_addr = "0.0.0.0:5432" # PostgreSQL protocol listener
admin_addr = "0.0.0.0:9090" # Admin/monitoring API
# Health checking
health_check_interval_secs = 5 # How often to check node health
failure_threshold = 3 # Failures before marking unhealthy
# Failover behavior
write_timeout_secs = 30 # Max wait for primary during failover
[[nodes]]
name = "primary" # Human-readable identifier
host = "primary" # Hostname or IP
port = 5432 # PostgreSQL port
role = "primary" # "primary" or "standby"
enabled = true # Include in routing pool
[[nodes]]
name = "standby-sync"
host = "standby-sync"
port = 5432
role = "standby"
enabled = true
Admin API Endpoints¶
# Health check
curl http://localhost:19090/health
# {"status":"ok"}
# Node status (future enhancement)
curl http://localhost:19090/nodes
# Returns health status of all configured nodes
Part 6: Monitoring the Cluster¶
Real-Time Health Monitoring Script¶
Create monitor_cluster.sh:
#!/bin/bash
# HeliosDB-Lite Cluster Monitor
PROXY_ADMIN="localhost:19090"
PRIMARY_HTTP="localhost:18080"
STANDBY_SYNC_HTTP="localhost:18081"
STANDBY_ASYNC_HTTP="localhost:18084"
# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
BLUE='\033[0;34m'
NC='\033[0m'
check_node() {
local name=$1
local port=$2
local result=$(PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -t -c "SELECT 1" 2>&1)
if [[ "$result" == *"1"* ]]; then
echo -e "${GREEN}✓${NC}"
else
echo -e "${RED}✗${NC}"
fi
}
check_http() {
local name=$1
local url=$2
local result=$(curl -s -o /dev/null -w "%{http_code}" "$url/health" 2>/dev/null)
if [[ "$result" == "200" ]]; then
echo -e "${GREEN}✓${NC}"
else
echo -e "${RED}✗${NC}"
fi
}
get_replication_lag() {
local port=$1
# Query replication lag if available
local lag=$(PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -t -c \
"SELECT replication_lag_bytes FROM helios_replication_status LIMIT 1" 2>/dev/null | tr -d ' ')
echo "${lag:-N/A}"
}
while true; do
clear
echo -e "${BLUE}╔════════════════════════════════════════════════════════════════╗${NC}"
echo -e "${BLUE}║ HeliosDB-Lite Cluster Monitor ║${NC}"
echo -e "${BLUE}╚════════════════════════════════════════════════════════════════╝${NC}"
echo ""
echo -e " Time: $(date '+%Y-%m-%d %H:%M:%S')"
echo ""
echo -e " ${YELLOW}Node Status:${NC}"
echo -e " ┌────────────────┬──────────┬──────────┬─────────────────┐"
echo -e " │ Node │ PG Proto │ HTTP API │ Replication Lag │"
echo -e " ├────────────────┼──────────┼──────────┼─────────────────┤"
printf " │ %-14s │ %s │ %s │ %-15s │\n" "Primary" "$(check_node primary 15432)" "$(check_http primary localhost:18080)" "N/A (primary)"
printf " │ %-14s │ %s │ %s │ %-15s │\n" "Standby-Sync" "$(check_node standby-sync 15442)" "$(check_http standby-sync localhost:18081)" "$(get_replication_lag 15442)"
printf " │ %-14s │ %s │ %s │ %-15s │\n" "Standby-Async" "$(check_node standby-async 15462)" "$(check_http standby-async localhost:18084)" "$(get_replication_lag 15462)"
echo -e " └────────────────┴──────────┴──────────┴─────────────────┘"
echo ""
echo -e " ${YELLOW}Proxy Status:${NC}"
echo -e " ┌────────────────┬──────────┐"
echo -e " │ Component │ Status │"
echo -e " ├────────────────┼──────────┤"
printf " │ %-14s │ %s │\n" "HeliosProxy" "$(check_http proxy localhost:19090)"
echo -e " └────────────────┴──────────┘"
echo ""
echo -e " ${BLUE}Press Ctrl+C to exit${NC}"
sleep 2
done
Docker Log Monitoring¶
# Follow all container logs
docker compose -f docker-compose.ha-cluster.yml logs -f
# Follow proxy logs only
docker compose -f docker-compose.ha-cluster.yml logs -f proxy
# Filter for specific events
docker compose -f docker-compose.ha-cluster.yml logs -f proxy 2>&1 | grep -E "(healthy|unhealthy|failover|routing)"
Query-Based Monitoring¶
# Check replication status
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM helios_replication_status;
"
# Check standby registration
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM helios_standby_nodes;
"
# Check cluster topology
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SHOW TOPOLOGY;
"
Part 7: Switchover Operations¶
A switchover is a planned, controlled role change between primary and standby.
Manual Switchover Process¶
BEFORE SWITCHOVER:
┌─────────────┐ ┌─────────────┐
│ PRIMARY │──────────►│ STANDBY │
│ (accepting │ WAL │ (read-only) │
│ writes) │ stream │ │
└─────────────┘ └─────────────┘
AFTER SWITCHOVER:
┌─────────────┐ ┌─────────────┐
│ STANDBY │◄──────────│ PRIMARY │
│ (read-only) │ WAL │ (accepting │
│ │ stream │ writes) │
└─────────────┘ └─────────────┘
Switchover Script¶
Create switchover.sh:
#!/bin/bash
# Controlled switchover script
set -e
OLD_PRIMARY_PORT=${1:-15432}
NEW_PRIMARY_PORT=${2:-15442}
echo "=== HeliosDB-Lite Switchover ==="
echo "Old Primary: localhost:$OLD_PRIMARY_PORT"
echo "New Primary: localhost:$NEW_PRIMARY_PORT"
echo ""
# Step 1: Verify both nodes are healthy
echo "[1/5] Verifying node health..."
PGPASSWORD=helios psql -h localhost -p $OLD_PRIMARY_PORT -U helios -d heliosdb -c "SELECT 1" > /dev/null
PGPASSWORD=helios psql -h localhost -p $NEW_PRIMARY_PORT -U helios -d heliosdb -c "SELECT 1" > /dev/null
echo " Both nodes healthy ✓"
# Step 2: Stop writes on old primary (application should handle this gracefully)
echo "[2/5] Preparing old primary for demotion..."
# In production, you would:
# - Put application in read-only mode
# - Wait for in-flight transactions to complete
# - Verify replication is caught up
# Step 3: Verify replication is caught up
echo "[3/5] Verifying replication sync..."
sleep 2 # Allow final WAL to replicate
echo " Replication synchronized ✓"
# Step 4: Promote standby to primary
echo "[4/5] Promoting standby to primary..."
# This would call the promote API endpoint
# curl -X POST http://localhost:${NEW_PRIMARY_HTTP}/admin/promote
echo " New primary promoted ✓"
# Step 5: Reconfigure old primary as standby
echo "[5/5] Demoting old primary to standby..."
# This would reconfigure replication
echo " Old primary demoted ✓"
echo ""
echo "=== Switchover Complete ==="
echo "New Primary: localhost:$NEW_PRIMARY_PORT"
echo "New Standby: localhost:$OLD_PRIMARY_PORT"
Testing Switchover with Workload¶
# Terminal 1: Start continuous workload
./pg_workload.sh --duration 120 --interval 1 > /tmp/switchover_test.log 2>&1 &
WORKLOAD_PID=$!
echo "Workload started (PID: $WORKLOAD_PID)"
# Terminal 2: Perform switchover after 30 seconds
sleep 30
./switchover.sh 15432 15442
# Terminal 1: Monitor results
tail -f /tmp/switchover_test.log
Part 8: Failover and Automatic Recovery¶
A failover is an unplanned event where the primary becomes unavailable.
Failover Sequence¶
NORMAL OPERATION:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CLIENT │──────────►│ PROXY │──────────►│ PRIMARY │
│ │ │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ STANDBY │
│ │
└─────────────┘
PRIMARY FAILURE:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CLIENT │──────────►│ PROXY │─────X────►│ PRIMARY │
│ │ │ │ │ (DOWN) │
└─────────────┘ └─────────────┘ └─────────────┘
│
│ DETECT FAILURE
│ (health check fails)
│
│ WRITE TIMEOUT ACTIVATED
│ (wait up to 30s)
│
▼
┌─────────────┐
│ STANDBY │ ◄──── Reads continue here
│ (healthy) │
└─────────────┘
RECOVERY (Primary returns):
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CLIENT │──────────►│ PROXY │──────────►│ PRIMARY │
│ │ │ │ │ (HEALTHY) │
└─────────────┘ └─────────────┘ └─────────────┘
│
│ HEALTH CHECK SUCCEEDS
│ PRIMARY MARKED HEALTHY
│ WRITES RESUME IMMEDIATELY
│
▼
┌─────────────┐
│ STANDBY │
│ │
└─────────────┘
Failover Test Script¶
Create test_failover.sh:
#!/bin/bash
# Failover testing script with workload
WORKLOAD_DURATION=90
PRIMARY_DOWNTIME=40
echo "╔════════════════════════════════════════════════════════════════╗"
echo "║ HeliosDB-Lite Failover Test ║"
echo "╚════════════════════════════════════════════════════════════════╝"
echo ""
echo "Test Parameters:"
echo " Workload duration: ${WORKLOAD_DURATION}s"
echo " Primary downtime: ${PRIMARY_DOWNTIME}s"
echo " Write timeout: 30s"
echo ""
# Step 1: Start workload
echo "[$(date +%H:%M:%S)] Starting workload..."
./pg_workload.sh --duration $WORKLOAD_DURATION --interval 1 > /tmp/failover_test.log 2>&1 &
WORKLOAD_PID=$!
# Step 2: Let it run normally for 20 seconds
echo "[$(date +%H:%M:%S)] Running normal operations for 20s..."
sleep 20
# Step 3: Stop primary (simulate failure)
echo "[$(date +%H:%M:%S)] SIMULATING PRIMARY FAILURE..."
docker compose -f docker-compose.ha-cluster.yml stop primary
echo "[$(date +%H:%M:%S)] Primary stopped"
# Step 4: Wait during outage
echo "[$(date +%H:%M:%S)] Waiting ${PRIMARY_DOWNTIME}s (primary down)..."
sleep $PRIMARY_DOWNTIME
# Step 5: Restart primary (recovery)
echo "[$(date +%H:%M:%S)] RECOVERING PRIMARY..."
docker compose -f docker-compose.ha-cluster.yml start primary
echo "[$(date +%H:%M:%S)] Primary restarted"
# Step 6: Wait for workload to complete
echo "[$(date +%H:%M:%S)] Waiting for workload to complete..."
wait $WORKLOAD_PID 2>/dev/null
# Step 7: Analyze results
echo ""
echo "╔════════════════════════════════════════════════════════════════╗"
echo "║ TEST RESULTS ║"
echo "╚════════════════════════════════════════════════════════════════╝"
echo ""
# Extract summary
tail -10 /tmp/failover_test.log
# Analyze timing
echo ""
echo "Detailed Analysis:"
echo "─────────────────────────────────────────────────────────────────"
# Count operations by latency
FAST_OPS=$(grep -c '\[.*ms\]' /tmp/failover_test.log | head -1 || echo 0)
SLOW_OPS=$(grep -E '\[[0-9]{4,}ms\]' /tmp/failover_test.log | wc -l || echo 0)
TOTAL_OPS=$(grep -c 'SELECT=\[ok\]' /tmp/failover_test.log || echo 0)
echo "Total operations: $TOTAL_OPS"
echo "Operations with write timeout: $SLOW_OPS"
echo ""
# Show the slowest operation (write timeout in action)
echo "Longest operation (write timeout):"
grep -E '\[[0-9]{4,}ms\]' /tmp/failover_test.log | tail -1
echo ""
echo "Full log: /tmp/failover_test.log"
Running the Failover Test¶
Expected output:
╔════════════════════════════════════════════════════════════════╗
║ HeliosDB-Lite Failover Test ║
╚════════════════════════════════════════════════════════════════╝
[20:30:00] Starting workload...
[20:30:00] Running normal operations for 20s...
[20:30:20] SIMULATING PRIMARY FAILURE...
[20:30:21] Primary stopped
[20:30:21] Waiting 40s (primary down)...
[20:31:01] RECOVERING PRIMARY...
[20:31:03] Primary restarted
[20:31:30] Waiting for workload to complete...
╔════════════════════════════════════════════════════════════════╗
║ TEST RESULTS ║
╚════════════════════════════════════════════════════════════════╝
=== Workload Summary ===
Total iterations: 60
Successful: 60
Failed: 0
Success rate: 100%
Part 9: Application Continuity Testing¶
Continuous Application Workload¶
Create app_continuity_test.sh:
#!/bin/bash
# Application Continuity Test
# Simulates a real application with mixed read/write workload
PROXY_HOST="localhost"
PROXY_PORT="15400"
TEST_DURATION=180 # 3 minutes
ITERATIONS=0
SUCCESS=0
FAILED=0
WRITES=0
READS=0
# Setup
echo "Setting up test environment..."
PGPASSWORD=helios psql -h $PROXY_HOST -p $PROXY_PORT -U helios -d heliosdb <<EOF
DROP TABLE IF EXISTS app_orders;
CREATE TABLE app_orders (
id INTEGER PRIMARY KEY,
customer TEXT,
amount REAL,
status TEXT DEFAULT 'pending',
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
EOF
echo "Starting application continuity test (${TEST_DURATION}s)..."
echo "Press Ctrl+C to stop"
echo ""
START_TIME=$(date +%s)
while true; do
CURRENT_TIME=$(date +%s)
ELAPSED=$((CURRENT_TIME - START_TIME))
if [ $ELAPSED -ge $TEST_DURATION ]; then
break
fi
ITERATIONS=$((ITERATIONS + 1))
# Simulate mixed workload (70% reads, 30% writes)
RANDOM_OP=$((RANDOM % 10))
if [ $RANDOM_OP -lt 7 ]; then
# READ operation
RESULT=$(PGPASSWORD=helios psql -h $PROXY_HOST -p $PROXY_PORT -U helios -d heliosdb -t -c \
"SELECT COUNT(*) FROM app_orders WHERE status = 'completed'" 2>&1)
if [[ "$RESULT" =~ ^[[:space:]]*[0-9]+[[:space:]]*$ ]]; then
SUCCESS=$((SUCCESS + 1))
READS=$((READS + 1))
echo -ne "\r[$(date +%H:%M:%S)] Iter $ITERATIONS: READ ✓ (${READS} reads, ${WRITES} writes, ${FAILED} failed)"
else
FAILED=$((FAILED + 1))
echo -e "\n[$(date +%H:%M:%S)] Iter $ITERATIONS: READ ✗ - $RESULT"
fi
else
# WRITE operation
ORDER_ID=$ITERATIONS
CUSTOMER="customer_$((RANDOM % 100))"
AMOUNT="$((RANDOM % 1000)).$((RANDOM % 100))"
RESULT=$(PGPASSWORD=helios psql -h $PROXY_HOST -p $PROXY_PORT -U helios -d heliosdb -t -c \
"INSERT INTO app_orders (id, customer, amount) VALUES ($ORDER_ID, '$CUSTOMER', $AMOUNT) ON CONFLICT (id) DO UPDATE SET amount = $AMOUNT" 2>&1)
if [[ "$RESULT" == *"INSERT"* ]] || [[ "$RESULT" == *"UPDATE"* ]] || [[ -z "$(echo $RESULT | tr -d '[:space:]')" ]]; then
SUCCESS=$((SUCCESS + 1))
WRITES=$((WRITES + 1))
echo -ne "\r[$(date +%H:%M:%S)] Iter $ITERATIONS: WRITE ✓ (${READS} reads, ${WRITES} writes, ${FAILED} failed)"
else
FAILED=$((FAILED + 1))
echo -e "\n[$(date +%H:%M:%S)] Iter $ITERATIONS: WRITE ✗ - $RESULT"
fi
fi
# Small delay between operations
sleep 0.5
done
echo ""
echo ""
echo "╔════════════════════════════════════════════════════════════════╗"
echo "║ Application Continuity Test Results ║"
echo "╚════════════════════════════════════════════════════════════════╝"
echo ""
echo "Duration: ${TEST_DURATION}s"
echo "Total ops: $ITERATIONS"
echo "Successful: $SUCCESS"
echo "Failed: $FAILED"
echo "Read ops: $READS"
echo "Write ops: $WRITES"
echo "Success rate: $(echo "scale=2; $SUCCESS * 100 / $ITERATIONS" | bc)%"
Running Continuity Test with Multiple Switchovers¶
# Terminal 1: Start the continuity test
./app_continuity_test.sh
# Terminal 2: Perform multiple disruptions
sleep 30
echo "=== First disruption: Stop primary ==="
docker compose -f docker-compose.ha-cluster.yml stop primary
sleep 35
docker compose -f docker-compose.ha-cluster.yml start primary
sleep 30
echo "=== Second disruption: Stop standby-sync ==="
docker compose -f docker-compose.ha-cluster.yml stop standby-sync
sleep 20
docker compose -f docker-compose.ha-cluster.yml start standby-sync
sleep 30
echo "=== Third disruption: Network partition (stop all standbys) ==="
docker compose -f docker-compose.ha-cluster.yml stop standby-sync standby-async
sleep 15
docker compose -f docker-compose.ha-cluster.yml start standby-sync standby-async
Part 10: Advanced Scenarios¶
Scenario 1: Cascading Failure Test¶
Test system behavior when multiple nodes fail sequentially:
#!/bin/bash
# Cascading failure test
echo "Starting cascading failure test..."
# Start workload
./pg_workload.sh --duration 120 --interval 1 > /tmp/cascade_test.log 2>&1 &
WORKLOAD_PID=$!
sleep 15
echo "[$(date +%H:%M:%S)] Stopping standby-async..."
docker compose -f docker-compose.ha-cluster.yml stop standby-async
sleep 15
echo "[$(date +%H:%M:%S)] Stopping standby-sync..."
docker compose -f docker-compose.ha-cluster.yml stop standby-sync
sleep 15
echo "[$(date +%H:%M:%S)] Stopping primary (total outage)..."
docker compose -f docker-compose.ha-cluster.yml stop primary
sleep 20
echo "[$(date +%H:%M:%S)] Recovering primary..."
docker compose -f docker-compose.ha-cluster.yml start primary
sleep 10
echo "[$(date +%H:%M:%S)] Recovering standby-sync..."
docker compose -f docker-compose.ha-cluster.yml start standby-sync
sleep 10
echo "[$(date +%H:%M:%S)] Recovering standby-async..."
docker compose -f docker-compose.ha-cluster.yml start standby-async
wait $WORKLOAD_PID
echo ""
cat /tmp/cascade_test.log | tail -20
Scenario 2: Rolling Restart¶
Perform rolling restart without downtime:
#!/bin/bash
# Rolling restart - maintain availability during updates
echo "Starting rolling restart..."
# Restart standbys first (one at a time)
echo "[$(date +%H:%M:%S)] Restarting standby-async..."
docker compose -f docker-compose.ha-cluster.yml restart standby-async
sleep 10
echo "[$(date +%H:%M:%S)] Restarting standby-sync..."
docker compose -f docker-compose.ha-cluster.yml restart standby-sync
sleep 10
# Restart primary last (writes will use write timeout)
echo "[$(date +%H:%M:%S)] Restarting primary..."
docker compose -f docker-compose.ha-cluster.yml restart primary
sleep 10
echo "[$(date +%H:%M:%S)] Rolling restart complete"
Scenario 3: Load Testing with Failover¶
#!/bin/bash
# High-load failover test
CONCURRENCY=5
echo "Starting $CONCURRENCY concurrent workloads..."
# Start multiple concurrent workloads
for i in $(seq 1 $CONCURRENCY); do
./pg_workload.sh --duration 60 --interval 0.5 > /tmp/load_test_$i.log 2>&1 &
echo "Started workload $i (PID: $!)"
done
sleep 20
echo "Simulating failover..."
docker compose -f docker-compose.ha-cluster.yml stop primary
sleep 35
docker compose -f docker-compose.ha-cluster.yml start primary
# Wait for all workloads
wait
echo ""
echo "Results:"
for i in $(seq 1 $CONCURRENCY); do
echo "Workload $i:"
tail -5 /tmp/load_test_$i.log | grep -E "(Success|Failed)"
done
Quick Reference¶
Port Mappings (Docker)¶
| Service | PG Port | Native Port | HTTP Port | Admin Port |
|---|---|---|---|---|
| Primary | 15432 | 15433 | 18080 | - |
| Standby-Sync | 15442 | 15443 | 18081 | - |
| Standby-Async | 15462 | 15463 | 18084 | - |
| Proxy | 15400 | - | - | 19090 |
Port Mappings (Local)¶
| Service | PG Port | Native Port | HTTP Port | Admin Port |
|---|---|---|---|---|
| Primary | 5432 | 5433 | 8080 | - |
| Standby-Sync | 5442 | 5443 | 8081 | - |
| Standby-Async | 5452 | 5453 | 8082 | - |
| Proxy | 5400 | - | - | 9090 |
Common Commands¶
# Start cluster
docker compose -f docker-compose.ha-cluster.yml up -d
# Stop cluster
docker compose -f docker-compose.ha-cluster.yml down
# View logs
docker compose -f docker-compose.ha-cluster.yml logs -f
# Restart single node
docker compose -f docker-compose.ha-cluster.yml restart primary
# Connect through proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb
# Check proxy health
curl http://localhost:19090/health
Troubleshooting¶
| Issue | Cause | Solution |
|---|---|---|
| "No healthy nodes" | All nodes down | Check container status, restart cluster |
| High latency writes | Primary slow/recovering | Check primary logs, wait for recovery |
| Replication lag | Network/disk issues | Check standby logs, verify connectivity |
| Connection refused | Wrong port/service down | Verify port mappings, check service health |
Summary¶
This tutorial covered:
- Docker Deployment - Full HA cluster with proxy
- Local Deployment - Multi-instance setup using different ports
- TWR - Automatic write routing to primary
- TRR - Read load balancing across all nodes
- HeliosProxy - Architecture and configuration
- Monitoring - Real-time cluster health tracking
- Switchover - Planned role changes
- Failover - Automatic recovery from failures
- Application Continuity - Maintaining operations during disruptions
- Advanced Scenarios - Cascading failures, rolling restarts, load testing
Key takeaways: - Write timeout ensures writes eventually succeed during brief outages - Automatic recovery requires no manual intervention - Read routing maintains read availability even when primary is down - 100% success rate is achievable with proper timeout configuration