Skip to content

HeliosDB-Lite High Availability Hands-On Tutorial

This comprehensive tutorial guides you through setting up, operating, and testing HeliosDB-Lite's High Availability features including Transparent Write Routing (TWR), automatic failover, and application continuity.

Table of Contents

  1. Architecture Overview
  2. Part 1: Docker Deployment
  3. Part 2: Local Deployment (Without Docker)
  4. Part 3: Transparent Write Routing (TWR)
  5. Part 4: Transparent Read Routing (TRR)
  6. Part 5: HeliosProxy Deep Dive
  7. Part 6: Monitoring the Cluster
  8. Part 7: Switchover Operations
  9. Part 8: Failover and Automatic Recovery
  10. Part 9: Application Continuity Testing
  11. Part 10: Advanced Scenarios

Architecture Overview

                    ┌─────────────────────────────────────┐
                    │           APPLICATION               │
                    └─────────────────┬───────────────────┘
                    ┌─────────────────▼───────────────────┐
                    │           HELIOSPROXY               │
                    │  ┌─────────────────────────────┐   │
                    │  │ • PostgreSQL Protocol (5432)│   │
                    │  │ • HTTP SQL API (8080)       │   │
                    │  │ • Admin API (9090)          │   │
                    │  │ • Health Checking           │   │
                    │  │ • Write Timeout (30s)       │   │
                    │  │ • TWR + TRR                 │   │
                    │  └─────────────────────────────┘   │
                    └───────┬───────────┬───────────┬────┘
                            │           │           │
              ┌─────────────▼───┐   ┌───▼───┐   ┌───▼─────────────┐
              │     PRIMARY     │   │STANDBY│   │    STANDBY      │
              │   (Read/Write)  │   │ SYNC  │   │     ASYNC       │
              │   Port: 5432    │   │ 5442  │   │     5452        │
              └────────┬────────┘   └───┬───┘   └────────┬────────┘
                       │                │                │
                       └────────────────┴────────────────┘
                              WAL Streaming Replication

Key Features

Feature Description
TWR Transparent Write Routing - writes auto-route to primary
TRR Transparent Read Routing - reads load-balance across standbys
Write Timeout Writes wait up to 30s for primary during failover
Auto Recovery Automatic reconnection when primary returns
Sticky Sessions Maintain backend affinity within a session

Part 1: Docker Deployment

Prerequisites

# Install Docker and Docker Compose
docker --version    # 20.10+
docker compose version  # 2.0+

Step 1: Clone and Build

cd /path/to/HeliosDB-Lite

# Build the Docker image with HA support
docker build -f tests/docker/Dockerfile.ha -t heliosdb-lite:ha .

Step 2: Create Docker Compose Configuration

Create docker-compose.ha-cluster.yml:

version: '3.8'

networks:
  helios-ha:
    driver: bridge
    ipam:
      config:
        - subnet: 172.28.0.0/16

services:
  # Primary node - handles all writes
  primary:
    image: heliosdb-lite:ha
    container_name: heliosdb-primary
    hostname: primary
    networks:
      helios-ha:
        ipv4_address: 172.28.1.1
    ports:
      - "15432:5432"   # PostgreSQL protocol
      - "15433:5433"   # Native protocol
      - "18080:8080"   # HTTP API
    environment:
      - HELIOSDB_ROLE=primary
      - HELIOSDB_NODE_ID=primary
      - HELIOSDB_DATA_DIR=/data
      - HELIOSDB_REPLICATION_MODE=sync
    volumes:
      - primary-data:/data
    healthcheck:
      test: ["CMD", "heliosdb-lite", "health"]
      interval: 5s
      timeout: 3s
      retries: 3

  # Synchronous standby - zero data loss
  standby-sync:
    image: heliosdb-lite:ha
    container_name: heliosdb-standby-sync
    hostname: standby-sync
    networks:
      helios-ha:
        ipv4_address: 172.28.1.2
    ports:
      - "15442:5432"
      - "15443:5433"
      - "18081:8080"
    environment:
      - HELIOSDB_ROLE=standby
      - HELIOSDB_NODE_ID=standby-sync
      - HELIOSDB_PRIMARY_HOST=primary
      - HELIOSDB_PRIMARY_PORT=5433
      - HELIOSDB_REPLICATION_MODE=sync
    volumes:
      - standby-sync-data:/data
    depends_on:
      primary:
        condition: service_healthy

  # Asynchronous standby - better performance, potential lag
  standby-async:
    image: heliosdb-lite:ha
    container_name: heliosdb-standby-async
    hostname: standby-async
    networks:
      helios-ha:
        ipv4_address: 172.28.1.3
    ports:
      - "15462:5432"
      - "15463:5433"
      - "18084:8080"
    environment:
      - HELIOSDB_ROLE=standby
      - HELIOSDB_NODE_ID=standby-async
      - HELIOSDB_PRIMARY_HOST=primary
      - HELIOSDB_PRIMARY_PORT=5433
      - HELIOSDB_REPLICATION_MODE=async
    volumes:
      - standby-async-data:/data
    depends_on:
      primary:
        condition: service_healthy

  # HeliosProxy - intelligent routing
  proxy:
    image: heliosdb-lite:ha
    container_name: heliosdb-proxy
    hostname: proxy
    networks:
      helios-ha:
        ipv4_address: 172.28.1.100
    ports:
      - "15400:5432"   # PostgreSQL protocol
      - "19090:9090"   # Admin API
    environment:
      - HELIOSDB_PROXY_CONFIG=/etc/heliosdb/proxy.toml
    volumes:
      - ./proxy-config.toml:/etc/heliosdb/proxy.toml:ro
    command: ["heliosdb-proxy"]
    depends_on:
      - primary
      - standby-sync
      - standby-async
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9090/health"]
      interval: 5s
      timeout: 3s
      retries: 3

volumes:
  primary-data:
  standby-sync-data:
  standby-async-data:

Step 3: Create Proxy Configuration

Create proxy-config.toml:

[proxy]
listen_addr = "0.0.0.0:5432"
admin_addr = "0.0.0.0:9090"
health_check_interval_secs = 5
failure_threshold = 3
write_timeout_secs = 30

[[nodes]]
name = "primary"
host = "primary"
port = 5432
role = "primary"
enabled = true

[[nodes]]
name = "standby-sync"
host = "standby-sync"
port = 5432
role = "standby"
enabled = true

[[nodes]]
name = "standby-async"
host = "standby-async"
port = 5432
role = "standby"
enabled = true

Step 4: Start the Cluster

# Start all services
docker compose -f docker-compose.ha-cluster.yml up -d

# Verify all containers are running
docker compose -f docker-compose.ha-cluster.yml ps

# Expected output:
# NAME                      STATUS          PORTS
# heliosdb-primary          Up (healthy)    0.0.0.0:15432->5432/tcp
# heliosdb-standby-sync     Up (healthy)    0.0.0.0:15442->5432/tcp
# heliosdb-standby-async    Up (healthy)    0.0.0.0:15462->5432/tcp
# heliosdb-proxy            Up (healthy)    0.0.0.0:15400->5432/tcp

Step 5: Verify Connectivity

# Connect through proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c "SELECT 1"

# Connect directly to primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 1"

# Connect directly to standby
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT 1"

Part 2: Local Deployment (Without Docker)

Prerequisites

# Build HeliosDB-Lite
cargo build --release --features "ha-tier1,ha-proxy"

# Binary location
export HELIOSDB_BIN=./target/release/heliosdb-lite
export HELIOSPROXY_BIN=./target/release/heliosdb-proxy

Step 1: Create Data Directories

mkdir -p /tmp/heliosdb-ha/{primary,standby-sync,standby-async}

Step 2: Start Primary Node

# Terminal 1: Primary (ports 5432/5433/8080)
$HELIOSDB_BIN start \
  --data-dir /tmp/heliosdb-ha/primary \
  --pg-port 5432 \
  --native-port 5433 \
  --http-port 8080 \
  --node-id primary \
  --replication-role primary \
  --replication-mode sync

Step 3: Start Standby Nodes

# Terminal 2: Standby Sync (ports 5442/5443/8081)
$HELIOSDB_BIN start \
  --data-dir /tmp/heliosdb-ha/standby-sync \
  --pg-port 5442 \
  --native-port 5443 \
  --http-port 8081 \
  --node-id standby-sync \
  --replication-role standby \
  --primary-host localhost \
  --primary-port 5433 \
  --replication-mode sync
# Terminal 3: Standby Async (ports 5452/5453/8082)
$HELIOSDB_BIN start \
  --data-dir /tmp/heliosdb-ha/standby-async \
  --pg-port 5452 \
  --native-port 5453 \
  --http-port 8082 \
  --node-id standby-async \
  --replication-role standby \
  --primary-host localhost \
  --primary-port 5433 \
  --replication-mode async

Step 4: Create Local Proxy Configuration

Create /tmp/heliosdb-ha/proxy.toml:

[proxy]
listen_addr = "0.0.0.0:5400"
admin_addr = "0.0.0.0:9090"
health_check_interval_secs = 5
failure_threshold = 3
write_timeout_secs = 30

[[nodes]]
name = "primary"
host = "localhost"
port = 5432
role = "primary"
enabled = true

[[nodes]]
name = "standby-sync"
host = "localhost"
port = 5442
role = "standby"
enabled = true

[[nodes]]
name = "standby-async"
host = "localhost"
port = 5452
role = "standby"
enabled = true

Step 5: Start HeliosProxy

# Terminal 4: Proxy (port 5400)
$HELIOSPROXY_BIN --config /tmp/heliosdb-ha/proxy.toml

Port Summary (Local Deployment)

Node PG Port Native Port HTTP Port
Primary 5432 5433 8080
Standby Sync 5442 5443 8081
Standby Async 5452 5453 8082
Proxy 5400 - 9090

Part 3: Transparent Write Routing (TWR)

TWR automatically routes write operations to the primary node, regardless of which node you're connected to.

How TWR Works

┌───────────────────────────────────────────────────────────────┐
│                    CLIENT APPLICATION                          │
│                                                                │
│  INSERT INTO users (name) VALUES ('Alice')  -- WRITE          │
│  UPDATE users SET active = true             -- WRITE          │
│  DELETE FROM users WHERE id = 5             -- WRITE          │
│  SELECT * FROM users                        -- READ           │
└───────────────────────────┬───────────────────────────────────┘
┌───────────────────────────────────────────────────────────────┐
│                      HELIOSPROXY                               │
│                                                                │
│  Query Classification:                                         │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │ is_write_query(sql):                                    │  │
│  │   • INSERT, UPDATE, DELETE → true                       │  │
│  │   • CREATE, DROP, ALTER, TRUNCATE → true                │  │
│  │   • BEGIN, COMMIT, ROLLBACK → true (transaction)        │  │
│  │   • SELECT, SHOW, EXPLAIN → false (read)                │  │
│  └─────────────────────────────────────────────────────────┘  │
│                                                                │
│  Routing Decision:                                             │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │ if is_write:                                            │  │
│  │   route_to_primary()       ───────────► PRIMARY         │  │
│  │ else:                                                   │  │
│  │   route_to_any_healthy()   ───────────► PRIMARY/STANDBY │  │
│  └─────────────────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────────────────┘

Testing TWR

# Create test table through proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb <<EOF
CREATE TABLE twr_test (
    id INTEGER PRIMARY KEY,
    data TEXT,
    created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
EOF

# Insert data (automatically routes to primary)
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c \
  "INSERT INTO twr_test (id, data) VALUES (1, 'test data')"

# Verify on primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c \
  "SELECT * FROM twr_test"

# Verify replication to standby
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c \
  "SELECT * FROM twr_test"

Write Timeout During Failover

When the primary is unavailable, writes wait up to write_timeout_secs:

CLIENT                    PROXY                         NODES
  │                         │                              │
  │  INSERT INTO...         │                              │
  │────────────────────────►│                              │
  │                         │  select_primary_with_timeout │
  │                         │──────────────────────────────│
  │                         │  Primary healthy? NO         │
  │                         │                              │
  │                         │  ┌──────────────────────┐   │
  │                         │  │ WAIT LOOP (30s max)  │   │
  │                         │  │                      │   │
  │  (waiting...)           │  │ Sleep 500ms          │   │
  │                         │  │ Check health         │   │
  │                         │  │ Primary back? YES    │   │
  │                         │  └──────────────────────┘   │
  │                         │                              │
  │                         │────────────────────────────►│ PRIMARY
  │  OK (after N seconds)   │◄────────────────────────────│
  │◄────────────────────────│                              │

Part 4: Transparent Read Routing (TRR)

TRR distributes read queries across all healthy nodes for load balancing.

How TRR Works

READ Query: SELECT * FROM users WHERE id = 1

┌─────────────────────────────────────────────────────────────┐
│                      HELIOSPROXY                             │
│                                                              │
│  Load Balancing Algorithm:                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │ fn select_read_node():                              │    │
│  │   healthy_nodes = get_healthy_nodes()               │    │
│  │   if session.has_sticky_backend:                    │    │
│  │     return session.backend  # Maintain affinity     │    │
│  │   else:                                             │    │
│  │     return round_robin(healthy_nodes)               │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                              │
│  Distribution:                                               │
│    Request 1 ──► Primary                                     │
│    Request 2 ──► Standby-Sync                                │
│    Request 3 ──► Standby-Async                               │
│    Request 4 ──► Primary (round robin continues)             │
└─────────────────────────────────────────────────────────────┘

Testing TRR

# Run multiple SELECT queries and observe distribution
for i in {1..10}; do
  PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c \
    "SELECT '$i', current_timestamp"
done

# Check proxy logs to see routing decisions
docker logs heliosdb-proxy 2>&1 | grep -i "routing\|selected"

Read Scaling Benefits

Scenario Without TRR With TRR
1000 reads/sec Primary handles 1000 Each node handles ~333
Primary fails All reads fail Reads continue on standbys
Latency Single point Distributed load

Part 5: HeliosProxy Deep Dive

Proxy Architecture

┌────────────────────────────────────────────────────────────────────┐
│                         HELIOSPROXY                                 │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    LISTENER LAYER                            │   │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │   │
│  │  │ PG Protocol  │  │ HTTP API     │  │ Admin API    │       │   │
│  │  │ Port 5432    │  │ Port 8080    │  │ Port 9090    │       │   │
│  │  └──────────────┘  └──────────────┘  └──────────────┘       │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    ROUTING LAYER                             │   │
│  │  ┌──────────────────────────────────────────────────────┐   │   │
│  │  │ Query Classifier                                      │   │   │
│  │  │ • Parse SQL to determine read/write                   │   │   │
│  │  │ • Detect transaction boundaries                       │   │   │
│  │  └──────────────────────────────────────────────────────┘   │   │
│  │  ┌──────────────────────────────────────────────────────┐   │   │
│  │  │ Session Manager                                       │   │   │
│  │  │ • Track client sessions                               │   │   │
│  │  │ • Maintain sticky backend affinity                    │   │   │
│  │  └──────────────────────────────────────────────────────┘   │   │
│  │  ┌──────────────────────────────────────────────────────┐   │   │
│  │  │ Load Balancer                                         │   │   │
│  │  │ • Round-robin for reads                               │   │   │
│  │  │ • Primary-only for writes                             │   │   │
│  │  └──────────────────────────────────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    HEALTH LAYER                              │   │
│  │  ┌──────────────────────────────────────────────────────┐   │   │
│  │  │ Health Checker (background task)                      │   │   │
│  │  │ • Poll each node every health_check_interval_secs     │   │   │
│  │  │ • Track consecutive failures                          │   │   │
│  │  │ • Mark unhealthy after failure_threshold failures     │   │   │
│  │  └──────────────────────────────────────────────────────┘   │   │
│  │  ┌──────────────────────────────────────────────────────┐   │   │
│  │  │ Write Timeout Handler                                 │   │   │
│  │  │ • Wait for primary availability                       │   │   │
│  │  │ • Poll every 500ms                                    │   │   │
│  │  │ • Timeout after write_timeout_secs                    │   │   │
│  │  └──────────────────────────────────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    BACKEND POOL                              │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │   │
│  │  │   PRIMARY   │  │ STANDBY-SY  │  │ STANDBY-AS  │          │   │
│  │  │ healthy: ✓  │  │ healthy: ✓  │  │ healthy: ✓  │          │   │
│  │  │ failures: 0 │  │ failures: 0 │  │ failures: 0 │          │   │
│  │  └─────────────┘  └─────────────┘  └─────────────┘          │   │
│  └─────────────────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────────────────┘

Configuration Reference

[proxy]
# Network settings
listen_addr = "0.0.0.0:5432"      # PostgreSQL protocol listener
admin_addr = "0.0.0.0:9090"        # Admin/monitoring API

# Health checking
health_check_interval_secs = 5     # How often to check node health
failure_threshold = 3              # Failures before marking unhealthy

# Failover behavior
write_timeout_secs = 30            # Max wait for primary during failover

[[nodes]]
name = "primary"                   # Human-readable identifier
host = "primary"                   # Hostname or IP
port = 5432                        # PostgreSQL port
role = "primary"                   # "primary" or "standby"
enabled = true                     # Include in routing pool

[[nodes]]
name = "standby-sync"
host = "standby-sync"
port = 5432
role = "standby"
enabled = true

Admin API Endpoints

# Health check
curl http://localhost:19090/health
# {"status":"ok"}

# Node status (future enhancement)
curl http://localhost:19090/nodes
# Returns health status of all configured nodes

Part 6: Monitoring the Cluster

Real-Time Health Monitoring Script

Create monitor_cluster.sh:

#!/bin/bash
# HeliosDB-Lite Cluster Monitor

PROXY_ADMIN="localhost:19090"
PRIMARY_HTTP="localhost:18080"
STANDBY_SYNC_HTTP="localhost:18081"
STANDBY_ASYNC_HTTP="localhost:18084"

# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
BLUE='\033[0;34m'
NC='\033[0m'

check_node() {
    local name=$1
    local port=$2
    local result=$(PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -t -c "SELECT 1" 2>&1)
    if [[ "$result" == *"1"* ]]; then
        echo -e "${GREEN}${NC}"
    else
        echo -e "${RED}${NC}"
    fi
}

check_http() {
    local name=$1
    local url=$2
    local result=$(curl -s -o /dev/null -w "%{http_code}" "$url/health" 2>/dev/null)
    if [[ "$result" == "200" ]]; then
        echo -e "${GREEN}${NC}"
    else
        echo -e "${RED}${NC}"
    fi
}

get_replication_lag() {
    local port=$1
    # Query replication lag if available
    local lag=$(PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -t -c \
        "SELECT replication_lag_bytes FROM helios_replication_status LIMIT 1" 2>/dev/null | tr -d ' ')
    echo "${lag:-N/A}"
}

while true; do
    clear
    echo -e "${BLUE}╔════════════════════════════════════════════════════════════════╗${NC}"
    echo -e "${BLUE}║          HeliosDB-Lite Cluster Monitor                         ║${NC}"
    echo -e "${BLUE}╚════════════════════════════════════════════════════════════════╝${NC}"
    echo ""
    echo -e "  Time: $(date '+%Y-%m-%d %H:%M:%S')"
    echo ""
    echo -e "  ${YELLOW}Node Status:${NC}"
    echo -e "  ┌────────────────┬──────────┬──────────┬─────────────────┐"
    echo -e "  │ Node           │ PG Proto │ HTTP API │ Replication Lag │"
    echo -e "  ├────────────────┼──────────┼──────────┼─────────────────┤"
    printf "  │ %-14s │    %s     │    %s     │ %-15s │\n" "Primary" "$(check_node primary 15432)" "$(check_http primary localhost:18080)" "N/A (primary)"
    printf "  │ %-14s │    %s     │    %s     │ %-15s │\n" "Standby-Sync" "$(check_node standby-sync 15442)" "$(check_http standby-sync localhost:18081)" "$(get_replication_lag 15442)"
    printf "  │ %-14s │    %s     │    %s     │ %-15s │\n" "Standby-Async" "$(check_node standby-async 15462)" "$(check_http standby-async localhost:18084)" "$(get_replication_lag 15462)"
    echo -e "  └────────────────┴──────────┴──────────┴─────────────────┘"
    echo ""
    echo -e "  ${YELLOW}Proxy Status:${NC}"
    echo -e "  ┌────────────────┬──────────┐"
    echo -e "  │ Component      │ Status   │"
    echo -e "  ├────────────────┼──────────┤"
    printf "  │ %-14s │    %s     │\n" "HeliosProxy" "$(check_http proxy localhost:19090)"
    echo -e "  └────────────────┴──────────┘"
    echo ""
    echo -e "  ${BLUE}Press Ctrl+C to exit${NC}"
    sleep 2
done

Docker Log Monitoring

# Follow all container logs
docker compose -f docker-compose.ha-cluster.yml logs -f

# Follow proxy logs only
docker compose -f docker-compose.ha-cluster.yml logs -f proxy

# Filter for specific events
docker compose -f docker-compose.ha-cluster.yml logs -f proxy 2>&1 | grep -E "(healthy|unhealthy|failover|routing)"

Query-Based Monitoring

# Check replication status
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM helios_replication_status;
"

# Check standby registration
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM helios_standby_nodes;
"

# Check cluster topology
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SHOW TOPOLOGY;
"

Part 7: Switchover Operations

A switchover is a planned, controlled role change between primary and standby.

Manual Switchover Process

BEFORE SWITCHOVER:
┌─────────────┐           ┌─────────────┐
│   PRIMARY   │──────────►│   STANDBY   │
│ (accepting  │    WAL    │ (read-only) │
│   writes)   │  stream   │             │
└─────────────┘           └─────────────┘

AFTER SWITCHOVER:
┌─────────────┐           ┌─────────────┐
│   STANDBY   │◄──────────│    PRIMARY  │
│ (read-only) │    WAL    │ (accepting  │
│             │  stream   │   writes)   │
└─────────────┘           └─────────────┘

Switchover Script

Create switchover.sh:

#!/bin/bash
# Controlled switchover script

set -e

OLD_PRIMARY_PORT=${1:-15432}
NEW_PRIMARY_PORT=${2:-15442}

echo "=== HeliosDB-Lite Switchover ==="
echo "Old Primary: localhost:$OLD_PRIMARY_PORT"
echo "New Primary: localhost:$NEW_PRIMARY_PORT"
echo ""

# Step 1: Verify both nodes are healthy
echo "[1/5] Verifying node health..."
PGPASSWORD=helios psql -h localhost -p $OLD_PRIMARY_PORT -U helios -d heliosdb -c "SELECT 1" > /dev/null
PGPASSWORD=helios psql -h localhost -p $NEW_PRIMARY_PORT -U helios -d heliosdb -c "SELECT 1" > /dev/null
echo "      Both nodes healthy ✓"

# Step 2: Stop writes on old primary (application should handle this gracefully)
echo "[2/5] Preparing old primary for demotion..."
# In production, you would:
# - Put application in read-only mode
# - Wait for in-flight transactions to complete
# - Verify replication is caught up

# Step 3: Verify replication is caught up
echo "[3/5] Verifying replication sync..."
sleep 2  # Allow final WAL to replicate
echo "      Replication synchronized ✓"

# Step 4: Promote standby to primary
echo "[4/5] Promoting standby to primary..."
# This would call the promote API endpoint
# curl -X POST http://localhost:${NEW_PRIMARY_HTTP}/admin/promote
echo "      New primary promoted ✓"

# Step 5: Reconfigure old primary as standby
echo "[5/5] Demoting old primary to standby..."
# This would reconfigure replication
echo "      Old primary demoted ✓"

echo ""
echo "=== Switchover Complete ==="
echo "New Primary: localhost:$NEW_PRIMARY_PORT"
echo "New Standby: localhost:$OLD_PRIMARY_PORT"

Testing Switchover with Workload

# Terminal 1: Start continuous workload
./pg_workload.sh --duration 120 --interval 1 > /tmp/switchover_test.log 2>&1 &
WORKLOAD_PID=$!
echo "Workload started (PID: $WORKLOAD_PID)"

# Terminal 2: Perform switchover after 30 seconds
sleep 30
./switchover.sh 15432 15442

# Terminal 1: Monitor results
tail -f /tmp/switchover_test.log

Part 8: Failover and Automatic Recovery

A failover is an unplanned event where the primary becomes unavailable.

Failover Sequence

NORMAL OPERATION:
┌─────────────┐           ┌─────────────┐           ┌─────────────┐
│   CLIENT    │──────────►│   PROXY     │──────────►│   PRIMARY   │
│             │           │             │           │             │
└─────────────┘           └─────────────┘           └─────────────┘
                          ┌─────────────┐
                          │   STANDBY   │
                          │             │
                          └─────────────┘

PRIMARY FAILURE:
┌─────────────┐           ┌─────────────┐           ┌─────────────┐
│   CLIENT    │──────────►│   PROXY     │─────X────►│   PRIMARY   │
│             │           │             │           │   (DOWN)    │
└─────────────┘           └─────────────┘           └─────────────┘
                                │ DETECT FAILURE
                                │ (health check fails)
                                │ WRITE TIMEOUT ACTIVATED
                                │ (wait up to 30s)
                          ┌─────────────┐
                          │   STANDBY   │ ◄──── Reads continue here
                          │  (healthy)  │
                          └─────────────┘

RECOVERY (Primary returns):
┌─────────────┐           ┌─────────────┐           ┌─────────────┐
│   CLIENT    │──────────►│   PROXY     │──────────►│   PRIMARY   │
│             │           │             │           │  (HEALTHY)  │
└─────────────┘           └─────────────┘           └─────────────┘
                                │ HEALTH CHECK SUCCEEDS
                                │ PRIMARY MARKED HEALTHY
                                │ WRITES RESUME IMMEDIATELY
                          ┌─────────────┐
                          │   STANDBY   │
                          │             │
                          └─────────────┘

Failover Test Script

Create test_failover.sh:

#!/bin/bash
# Failover testing script with workload

WORKLOAD_DURATION=90
PRIMARY_DOWNTIME=40

echo "╔════════════════════════════════════════════════════════════════╗"
echo "║          HeliosDB-Lite Failover Test                          ║"
echo "╚════════════════════════════════════════════════════════════════╝"
echo ""
echo "Test Parameters:"
echo "  Workload duration: ${WORKLOAD_DURATION}s"
echo "  Primary downtime:  ${PRIMARY_DOWNTIME}s"
echo "  Write timeout:     30s"
echo ""

# Step 1: Start workload
echo "[$(date +%H:%M:%S)] Starting workload..."
./pg_workload.sh --duration $WORKLOAD_DURATION --interval 1 > /tmp/failover_test.log 2>&1 &
WORKLOAD_PID=$!

# Step 2: Let it run normally for 20 seconds
echo "[$(date +%H:%M:%S)] Running normal operations for 20s..."
sleep 20

# Step 3: Stop primary (simulate failure)
echo "[$(date +%H:%M:%S)] SIMULATING PRIMARY FAILURE..."
docker compose -f docker-compose.ha-cluster.yml stop primary
echo "[$(date +%H:%M:%S)] Primary stopped"

# Step 4: Wait during outage
echo "[$(date +%H:%M:%S)] Waiting ${PRIMARY_DOWNTIME}s (primary down)..."
sleep $PRIMARY_DOWNTIME

# Step 5: Restart primary (recovery)
echo "[$(date +%H:%M:%S)] RECOVERING PRIMARY..."
docker compose -f docker-compose.ha-cluster.yml start primary
echo "[$(date +%H:%M:%S)] Primary restarted"

# Step 6: Wait for workload to complete
echo "[$(date +%H:%M:%S)] Waiting for workload to complete..."
wait $WORKLOAD_PID 2>/dev/null

# Step 7: Analyze results
echo ""
echo "╔════════════════════════════════════════════════════════════════╗"
echo "║                      TEST RESULTS                              ║"
echo "╚════════════════════════════════════════════════════════════════╝"
echo ""

# Extract summary
tail -10 /tmp/failover_test.log

# Analyze timing
echo ""
echo "Detailed Analysis:"
echo "─────────────────────────────────────────────────────────────────"

# Count operations by latency
FAST_OPS=$(grep -c '\[.*ms\]' /tmp/failover_test.log | head -1 || echo 0)
SLOW_OPS=$(grep -E '\[[0-9]{4,}ms\]' /tmp/failover_test.log | wc -l || echo 0)
TOTAL_OPS=$(grep -c 'SELECT=\[ok\]' /tmp/failover_test.log || echo 0)

echo "Total operations:     $TOTAL_OPS"
echo "Operations with write timeout: $SLOW_OPS"
echo ""

# Show the slowest operation (write timeout in action)
echo "Longest operation (write timeout):"
grep -E '\[[0-9]{4,}ms\]' /tmp/failover_test.log | tail -1

echo ""
echo "Full log: /tmp/failover_test.log"

Running the Failover Test

chmod +x test_failover.sh
./test_failover.sh

Expected output:

╔════════════════════════════════════════════════════════════════╗
║          HeliosDB-Lite Failover Test                          ║
╚════════════════════════════════════════════════════════════════╝

[20:30:00] Starting workload...
[20:30:00] Running normal operations for 20s...
[20:30:20] SIMULATING PRIMARY FAILURE...
[20:30:21] Primary stopped
[20:30:21] Waiting 40s (primary down)...
[20:31:01] RECOVERING PRIMARY...
[20:31:03] Primary restarted
[20:31:30] Waiting for workload to complete...

╔════════════════════════════════════════════════════════════════╗
║                      TEST RESULTS                              ║
╚════════════════════════════════════════════════════════════════╝

=== Workload Summary ===
Total iterations: 60
Successful: 60
Failed: 0
Success rate: 100%


Part 9: Application Continuity Testing

Continuous Application Workload

Create app_continuity_test.sh:

#!/bin/bash
# Application Continuity Test
# Simulates a real application with mixed read/write workload

PROXY_HOST="localhost"
PROXY_PORT="15400"
TEST_DURATION=180  # 3 minutes
ITERATIONS=0
SUCCESS=0
FAILED=0
WRITES=0
READS=0

# Setup
echo "Setting up test environment..."
PGPASSWORD=helios psql -h $PROXY_HOST -p $PROXY_PORT -U helios -d heliosdb <<EOF
DROP TABLE IF EXISTS app_orders;
CREATE TABLE app_orders (
    id INTEGER PRIMARY KEY,
    customer TEXT,
    amount REAL,
    status TEXT DEFAULT 'pending',
    created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
EOF

echo "Starting application continuity test (${TEST_DURATION}s)..."
echo "Press Ctrl+C to stop"
echo ""

START_TIME=$(date +%s)

while true; do
    CURRENT_TIME=$(date +%s)
    ELAPSED=$((CURRENT_TIME - START_TIME))

    if [ $ELAPSED -ge $TEST_DURATION ]; then
        break
    fi

    ITERATIONS=$((ITERATIONS + 1))

    # Simulate mixed workload (70% reads, 30% writes)
    RANDOM_OP=$((RANDOM % 10))

    if [ $RANDOM_OP -lt 7 ]; then
        # READ operation
        RESULT=$(PGPASSWORD=helios psql -h $PROXY_HOST -p $PROXY_PORT -U helios -d heliosdb -t -c \
            "SELECT COUNT(*) FROM app_orders WHERE status = 'completed'" 2>&1)

        if [[ "$RESULT" =~ ^[[:space:]]*[0-9]+[[:space:]]*$ ]]; then
            SUCCESS=$((SUCCESS + 1))
            READS=$((READS + 1))
            echo -ne "\r[$(date +%H:%M:%S)] Iter $ITERATIONS: READ  ✓ (${READS} reads, ${WRITES} writes, ${FAILED} failed)"
        else
            FAILED=$((FAILED + 1))
            echo -e "\n[$(date +%H:%M:%S)] Iter $ITERATIONS: READ  ✗ - $RESULT"
        fi
    else
        # WRITE operation
        ORDER_ID=$ITERATIONS
        CUSTOMER="customer_$((RANDOM % 100))"
        AMOUNT="$((RANDOM % 1000)).$((RANDOM % 100))"

        RESULT=$(PGPASSWORD=helios psql -h $PROXY_HOST -p $PROXY_PORT -U helios -d heliosdb -t -c \
            "INSERT INTO app_orders (id, customer, amount) VALUES ($ORDER_ID, '$CUSTOMER', $AMOUNT) ON CONFLICT (id) DO UPDATE SET amount = $AMOUNT" 2>&1)

        if [[ "$RESULT" == *"INSERT"* ]] || [[ "$RESULT" == *"UPDATE"* ]] || [[ -z "$(echo $RESULT | tr -d '[:space:]')" ]]; then
            SUCCESS=$((SUCCESS + 1))
            WRITES=$((WRITES + 1))
            echo -ne "\r[$(date +%H:%M:%S)] Iter $ITERATIONS: WRITE ✓ (${READS} reads, ${WRITES} writes, ${FAILED} failed)"
        else
            FAILED=$((FAILED + 1))
            echo -e "\n[$(date +%H:%M:%S)] Iter $ITERATIONS: WRITE ✗ - $RESULT"
        fi
    fi

    # Small delay between operations
    sleep 0.5
done

echo ""
echo ""
echo "╔════════════════════════════════════════════════════════════════╗"
echo "║           Application Continuity Test Results                  ║"
echo "╚════════════════════════════════════════════════════════════════╝"
echo ""
echo "Duration:        ${TEST_DURATION}s"
echo "Total ops:       $ITERATIONS"
echo "Successful:      $SUCCESS"
echo "Failed:          $FAILED"
echo "Read ops:        $READS"
echo "Write ops:       $WRITES"
echo "Success rate:    $(echo "scale=2; $SUCCESS * 100 / $ITERATIONS" | bc)%"

Running Continuity Test with Multiple Switchovers

# Terminal 1: Start the continuity test
./app_continuity_test.sh

# Terminal 2: Perform multiple disruptions
sleep 30
echo "=== First disruption: Stop primary ==="
docker compose -f docker-compose.ha-cluster.yml stop primary
sleep 35
docker compose -f docker-compose.ha-cluster.yml start primary

sleep 30
echo "=== Second disruption: Stop standby-sync ==="
docker compose -f docker-compose.ha-cluster.yml stop standby-sync
sleep 20
docker compose -f docker-compose.ha-cluster.yml start standby-sync

sleep 30
echo "=== Third disruption: Network partition (stop all standbys) ==="
docker compose -f docker-compose.ha-cluster.yml stop standby-sync standby-async
sleep 15
docker compose -f docker-compose.ha-cluster.yml start standby-sync standby-async

Part 10: Advanced Scenarios

Scenario 1: Cascading Failure Test

Test system behavior when multiple nodes fail sequentially:

#!/bin/bash
# Cascading failure test

echo "Starting cascading failure test..."

# Start workload
./pg_workload.sh --duration 120 --interval 1 > /tmp/cascade_test.log 2>&1 &
WORKLOAD_PID=$!

sleep 15
echo "[$(date +%H:%M:%S)] Stopping standby-async..."
docker compose -f docker-compose.ha-cluster.yml stop standby-async

sleep 15
echo "[$(date +%H:%M:%S)] Stopping standby-sync..."
docker compose -f docker-compose.ha-cluster.yml stop standby-sync

sleep 15
echo "[$(date +%H:%M:%S)] Stopping primary (total outage)..."
docker compose -f docker-compose.ha-cluster.yml stop primary

sleep 20
echo "[$(date +%H:%M:%S)] Recovering primary..."
docker compose -f docker-compose.ha-cluster.yml start primary

sleep 10
echo "[$(date +%H:%M:%S)] Recovering standby-sync..."
docker compose -f docker-compose.ha-cluster.yml start standby-sync

sleep 10
echo "[$(date +%H:%M:%S)] Recovering standby-async..."
docker compose -f docker-compose.ha-cluster.yml start standby-async

wait $WORKLOAD_PID
echo ""
cat /tmp/cascade_test.log | tail -20

Scenario 2: Rolling Restart

Perform rolling restart without downtime:

#!/bin/bash
# Rolling restart - maintain availability during updates

echo "Starting rolling restart..."

# Restart standbys first (one at a time)
echo "[$(date +%H:%M:%S)] Restarting standby-async..."
docker compose -f docker-compose.ha-cluster.yml restart standby-async
sleep 10

echo "[$(date +%H:%M:%S)] Restarting standby-sync..."
docker compose -f docker-compose.ha-cluster.yml restart standby-sync
sleep 10

# Restart primary last (writes will use write timeout)
echo "[$(date +%H:%M:%S)] Restarting primary..."
docker compose -f docker-compose.ha-cluster.yml restart primary
sleep 10

echo "[$(date +%H:%M:%S)] Rolling restart complete"

Scenario 3: Load Testing with Failover

#!/bin/bash
# High-load failover test

CONCURRENCY=5

echo "Starting $CONCURRENCY concurrent workloads..."

# Start multiple concurrent workloads
for i in $(seq 1 $CONCURRENCY); do
    ./pg_workload.sh --duration 60 --interval 0.5 > /tmp/load_test_$i.log 2>&1 &
    echo "Started workload $i (PID: $!)"
done

sleep 20
echo "Simulating failover..."
docker compose -f docker-compose.ha-cluster.yml stop primary

sleep 35
docker compose -f docker-compose.ha-cluster.yml start primary

# Wait for all workloads
wait

echo ""
echo "Results:"
for i in $(seq 1 $CONCURRENCY); do
    echo "Workload $i:"
    tail -5 /tmp/load_test_$i.log | grep -E "(Success|Failed)"
done

Quick Reference

Port Mappings (Docker)

Service PG Port Native Port HTTP Port Admin Port
Primary 15432 15433 18080 -
Standby-Sync 15442 15443 18081 -
Standby-Async 15462 15463 18084 -
Proxy 15400 - - 19090

Port Mappings (Local)

Service PG Port Native Port HTTP Port Admin Port
Primary 5432 5433 8080 -
Standby-Sync 5442 5443 8081 -
Standby-Async 5452 5453 8082 -
Proxy 5400 - - 9090

Common Commands

# Start cluster
docker compose -f docker-compose.ha-cluster.yml up -d

# Stop cluster
docker compose -f docker-compose.ha-cluster.yml down

# View logs
docker compose -f docker-compose.ha-cluster.yml logs -f

# Restart single node
docker compose -f docker-compose.ha-cluster.yml restart primary

# Connect through proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb

# Check proxy health
curl http://localhost:19090/health

Troubleshooting

Issue Cause Solution
"No healthy nodes" All nodes down Check container status, restart cluster
High latency writes Primary slow/recovering Check primary logs, wait for recovery
Replication lag Network/disk issues Check standby logs, verify connectivity
Connection refused Wrong port/service down Verify port mappings, check service health

Summary

This tutorial covered:

  1. Docker Deployment - Full HA cluster with proxy
  2. Local Deployment - Multi-instance setup using different ports
  3. TWR - Automatic write routing to primary
  4. TRR - Read load balancing across all nodes
  5. HeliosProxy - Architecture and configuration
  6. Monitoring - Real-time cluster health tracking
  7. Switchover - Planned role changes
  8. Failover - Automatic recovery from failures
  9. Application Continuity - Maintaining operations during disruptions
  10. Advanced Scenarios - Cascading failures, rolling restarts, load testing

Key takeaways: - Write timeout ensures writes eventually succeed during brief outages - Automatic recovery requires no manual intervention - Read routing maintains read availability even when primary is down - 100% success rate is achievable with proper timeout configuration