Skip to content

HeliosDB-Lite High Availability Interactive Tutorial

This hands-on tutorial walks you through HeliosDB-Lite's High Availability features with interactive checkpoints where you can pause, observe, and verify cluster behavior in real-time.


Table of Contents

  1. Features & Benefits Overview
  2. Prerequisites
  3. Part 1: Starting the Cluster
  4. Part 2: Understanding Replication Modes
  5. Part 3: Data Replication Verification
  6. Part 4: Transparent Write Routing (TWR)
  7. Part 5: Failover & Recovery
  8. Part 6: Automatic Reconnection
  9. Part 7: Advanced Scenarios
  10. Quick Reference

Features & Benefits Overview

What is HeliosDB-Lite HA?

HeliosDB-Lite provides enterprise-grade High Availability through WAL (Write-Ahead Log) streaming replication with multiple sync modes, automatic failover detection, and intelligent connection routing.

Key Features

Feature Description Benefit
WAL Streaming Replication Continuous log-based replication from primary to standbys Near-zero data loss, real-time sync
Multiple Sync Modes Sync, Semi-sync, and Async modes Balance between durability and performance
Transparent Write Routing (TWR) Writes auto-route to primary from any node Simplified application connectivity
Automatic Reconnection Standbys auto-reconnect with exponential backoff Self-healing cluster
Split-Brain Protection Observer nodes and fencing tokens prevent dual-primary Data consistency guaranteed
HeliosProxy Intelligent connection router with health monitoring Zero-downtime failover for applications

Replication Mode Comparison

┌─────────────────────────────────────────────────────────────────────────────┐
│                        REPLICATION MODE SPECTRUM                             │
│                                                                              │
│   ◄─────────────────────────────────────────────────────────────────────►   │
│   PERFORMANCE                                                   DURABILITY   │
│                                                                              │
│   ┌─────────────┐      ┌─────────────┐      ┌─────────────┐                │
│   │    ASYNC    │      │  SEMI-SYNC  │      │    SYNC     │                │
│   │             │      │             │      │             │                │
│   │ Fire-and-   │      │ Wait for 1  │      │ Wait for 1+ │                │
│   │ forget      │      │ standby ACK │      │ standby ACK │                │
│   │             │      │ (flush_lsn) │      │ (flush_lsn) │                │
│   │ Fastest     │      │ Balanced    │      │ Safest      │                │
│   │ May lose    │      │ Low risk    │      │ Zero loss*  │                │
│   │ recent data │      │             │      │             │                │
│   └─────────────┘      └─────────────┘      └─────────────┘                │
│                                                                              │
│   *Zero loss when at least one sync standby is available                    │
└─────────────────────────────────────────────────────────────────────────────┘

Architecture

                    ┌─────────────────────────────────────┐
                    │           APPLICATION               │
                    └─────────────────┬───────────────────┘
                    ┌─────────────────▼───────────────────┐
                    │           HELIOSPROXY               │
                    │  • Automatic primary detection      │
                    │  • Health-based routing             │
                    │  • Write timeout (waits for primary)│
                    └───────┬───────────┬───────────┬────┘
                            │           │           │
              ┌─────────────▼───┐   ┌───▼───────┐  ┌▼─────────────┐
              │     PRIMARY     │   │  STANDBY  │  │   STANDBY    │
              │   (Read/Write)  │   │   SYNC    │  │    ASYNC     │
              │                 │   │           │  │              │
              │   Port: 15432   │   │   15442   │  │    15462     │
              └────────┬────────┘   └─────┬─────┘  └──────┬───────┘
                       │                  │               │
                       └──────────────────┴───────────────┘
                              WAL Streaming Replication

Prerequisites

Before starting, ensure you have:

# Docker and Docker Compose
docker --version         # 20.10+
docker compose version   # 2.0+

# psql client (for verification)
psql --version          # Any version

# curl (for API checks)
curl --version

# jq (optional, for JSON formatting)
jq --version

Part 1: Starting the Cluster

Step 1.1: Navigate to the Docker Test Directory

cd /home/app/HeliosDB-Lite/tests/docker

Step 1.2: Build and Start the Cluster

# Build with fresh binaries (recommended)
docker compose -f docker-compose.ha-cluster.yml build --no-cache

# Start all services
docker compose -f docker-compose.ha-cluster.yml up -d

Wait approximately 30-45 seconds for all nodes to start and establish replication.


CHECKPOINT 1: Verify Cluster Startup

Open a new terminal and run:

cd /home/app/HeliosDB-Lite/tests/docker
./ha_interactive_tutorial.sh checkpoint1

Or manually verify:

# Check all containers are running
docker compose -f docker-compose.ha-cluster.yml ps

# Expected output (all should show "Up" or "healthy"):
# NAME                        STATUS
# heliosdb-primary            Up (healthy)
# heliosdb-standby-sync       Up (healthy)
# heliosdb-standby-semisync   Up (healthy)
# heliosdb-standby-async      Up (healthy)
# heliosdb-observer           Up
# heliosdb-proxy              Up (healthy)
# Verify each node is accessible
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 'Primary OK' as status"
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT 'Standby-Sync OK' as status"
PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "SELECT 'Standby-Semi OK' as status"
PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "SELECT 'Standby-Async OK' as status"

What you should see: - All 6 containers running - Each node responds to queries - Primary accepts reads and writes, standbys are read-only

Take a moment to: 1. Run the monitor script: ./monitor_cluster.sh 2. Observe all nodes showing "UP" 3. Note the latency values for each node

Press Enter to continue when ready...


Part 2: Understanding Replication Modes

The Cluster Configuration

Our test cluster has 4 database nodes with different sync modes:

Node Port Sync Mode Durability Performance
Primary 15432 N/A Source Fastest
Standby-Sync 15442 sync Guaranteed Slower writes
Standby-Semi 15452 semi-sync High Balanced
Standby-Async 15462 async Eventual Fast writes

Step 2.1: Check Replication Status on Primary

# Connect to primary and check standbys
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM pg_replication_standbys;
"

Expected output:

 node_id          | sync_mode | state     | flush_lsn | apply_lsn | lag_ms
------------------+-----------+-----------+-----------+-----------+--------
 standby-sync     | sync      | streaming | 1000      | 1000      | 0
 standby-semisync | semi-sync | streaming | 1000      | 1000      | 5
 standby-async    | async     | streaming | 995       | 990       | 15

Step 2.2: Check Primary Status from a Standby

# Connect to standby-sync and check primary connection
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
SELECT * FROM pg_replication_primary;
"

CHECKPOINT 2: Verify Replication Status

Run the verification script:

./ha_interactive_tutorial.sh checkpoint2

Or manually verify:

# Check primary's view of standbys
echo "=== PRIMARY'S VIEW OF STANDBYS ==="
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT
    node_id,
    sync_mode,
    state,
    flush_lsn,
    apply_lsn,
    lag_bytes,
    lag_ms
FROM pg_replication_standbys;
"

# Check each standby's view of primary
for port in 15442 15452 15462; do
    echo ""
    echo "=== STANDBY ON PORT $port ==="
    PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -c "
    SELECT
        node_id as primary_id,
        state,
        primary_lsn,
        local_lsn,
        lag_bytes
    FROM pg_replication_primary;
    "
done

What you should see: - Primary shows 3-4 connected standbys - Each standby shows connection to primary - state should be streaming for all - lag_ms should be low (< 100ms typically)

Take a moment to: 1. Compare flush_lsn and apply_lsn values 2. Note that sync standby has lowest lag 3. Async standby may show slightly higher lag

Press Enter to continue when ready...


Part 3: Data Replication Verification

Step 3.1: Create a Test Table on Primary

PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
CREATE TABLE IF NOT EXISTS replication_test (
    id SERIAL PRIMARY KEY,
    message TEXT,
    sync_mode TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
"

Step 3.2: Insert Data and Watch it Replicate

# Insert a row
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('Hello from primary!', 'sync mode test');
"

CHECKPOINT 3: Verify Data Replication

Run the verification script:

./ha_interactive_tutorial.sh checkpoint3

Or manually verify:

echo "=== DATA ON PRIMARY (port 15432) ==="
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"

echo ""
echo "=== DATA ON STANDBY-SYNC (port 15442) ==="
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"

echo ""
echo "=== DATA ON STANDBY-SEMISYNC (port 15452) ==="
PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"

echo ""
echo "=== DATA ON STANDBY-ASYNC (port 15462) ==="
PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"

What you should see: - Identical data on ALL nodes (primary + 3 standbys) - Same row count, same values - Timestamps are identical (replicated, not re-generated)

Take a moment to: 1. Insert more rows and repeat the check 2. Note how fast replication occurs (typically < 100ms) 3. Try inserting from different connections simultaneously

Press Enter to continue when ready...


Part 4: Transparent Write Routing (TWR)

TWR allows you to connect to ANY node (including standbys) and have writes automatically routed to the primary.

Step 4.1: Connect to a Standby and Try to Write

Without TWR, writing to a standby would fail. With TWR enabled on sync/semi-sync standbys, writes are transparently forwarded.

# Connect to STANDBY-SYNC and insert data
# This should work because TWR forwards to primary!
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('Written via standby-sync with TWR!', 'TWR test');
"

Step 4.2: Verify the Write Went to Primary

# Check data on primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT id, message, sync_mode FROM replication_test WHERE message LIKE '%TWR%';
"

CHECKPOINT 4: Verify TWR Functionality

Run the verification script:

./ha_interactive_tutorial.sh checkpoint4

Or manually verify:

echo "=== TWR TEST: Writing to each node ==="

# Write via standby-sync
echo "1. Writing via standby-sync (port 15442)..."
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('TWR via standby-sync', 'sync');
" 2>&1

# Write via standby-semisync
echo "2. Writing via standby-semisync (port 15452)..."
PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('TWR via standby-semisync', 'semi-sync');
" 2>&1

# Write via standby-async (should fail - async doesn't support TWR)
echo "3. Writing via standby-async (port 15462)..."
PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('TWR via standby-async', 'async');
" 2>&1

echo ""
echo "=== FINAL DATA CHECK ==="
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT id, message, sync_mode FROM replication_test ORDER BY id;
"

What you should see: - Writes via sync and semi-sync standbys: SUCCESS (TWR forwarded to primary) - Write via async standby: REJECTED (async mode doesn't support TWR) - All successful writes appear on primary and replicate to all standbys

Take a moment to: 1. Notice the latency difference when writing via standby vs direct to primary 2. Understand why async doesn't support TWR (data consistency) 3. Consider use cases for TWR (simplified application configuration)

Press Enter to continue when ready...


Part 5: Failover & Recovery

This section demonstrates what happens when the primary becomes unavailable.

Step 5.1: Start a Background Workload

In one terminal, start a continuous workload:

./pg_workload.sh --duration 120 --interval 2 | tee /tmp/failover_test.log

Step 5.2: Simulate Primary Failure

In another terminal, stop the primary:

# Stop the primary (simulate crash)
docker compose -f docker-compose.ha-cluster.yml stop primary

Step 5.3: Observe Workload Behavior

Watch the workload terminal. You should see: - Writes pause briefly (write timeout kicks in) - Reads continue (routed to standbys) - After ~30 seconds, write operations timeout if primary doesn't return


CHECKPOINT 5: Verify Failover Behavior

Run the verification script:

./ha_interactive_tutorial.sh checkpoint5

Or manually verify (while primary is stopped):

echo "=== CLUSTER STATUS WITH PRIMARY DOWN ==="
echo ""

echo "1. Primary (port 15432) - should fail:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 1" 2>&1 | head -3

echo ""
echo "2. Standby-sync (port 15442) - should work (read-only):"
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT COUNT(*) as row_count FROM replication_test" 2>&1

echo ""
echo "3. Via Proxy (port 15400) - READ should work:"
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c "SELECT COUNT(*) as row_count FROM replication_test" 2>&1

echo ""
echo "4. Via Proxy (port 15400) - WRITE will timeout:"
echo "   (This will wait up to 30s for primary to return)"
timeout 35 bash -c 'PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c "INSERT INTO replication_test (message) VALUES ('\''failover test'\'')"' 2>&1 || echo "   Write timed out (expected)"

What you should see: - Primary connection: REFUSED (container stopped) - Standby connections: WORKING (read-only) - Proxy reads: WORKING (routed to standbys) - Proxy writes: TIMEOUT after ~30 seconds (waiting for primary)

Take a moment to: 1. Check the workload log for timing patterns 2. Notice reads continue while writes pause 3. Understand the write timeout behavior

Press Enter to continue when ready...


Part 6: Automatic Reconnection

Step 6.1: Restart the Primary

# Restart the primary
docker compose -f docker-compose.ha-cluster.yml start primary

Step 6.2: Watch Standbys Reconnect

Monitor the standby logs to see reconnection:

# Follow standby-sync logs
docker logs -f heliosdb-standby-sync 2>&1 | grep -E "(connect|reconnect|streaming|primary)"

You should see messages like:

INFO Attempting reconnection to primary (attempt 1)...
INFO Connected to primary, starting streaming...
INFO Streaming replication established


CHECKPOINT 6: Verify Recovery and Reconnection

Run the verification script:

./ha_interactive_tutorial.sh checkpoint6

Or manually verify (after primary restart):

echo "=== WAITING FOR PRIMARY TO START ==="
sleep 10

echo ""
echo "1. Primary (port 15432) - should be back:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 'Primary RECOVERED' as status" 2>&1

echo ""
echo "2. Replication status from primary:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT
    node_id,
    state,
    flush_lsn,
    lag_ms
FROM pg_replication_standbys;
"

echo ""
echo "3. Test write capability restored:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('Post-recovery write', 'recovery test')
RETURNING id, message;
"

echo ""
echo "4. Verify replication still works (check standby-sync):"
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
SELECT id, message FROM replication_test WHERE message = 'Post-recovery write';
"

What you should see: - Primary: RECOVERED and accepting connections - All standbys: state = streaming (reconnected) - Writes working: New data inserted successfully - Replication working: Data appears on standbys

Take a moment to: 1. Check reconnection attempts in standby logs 2. Note the exponential backoff (5s → 10s → 20s → ...) 3. Observe how quickly full functionality is restored

Press Enter to continue when ready...


Part 7: Advanced Scenarios

Scenario 7.1: Batch Durability Test

Test that rapid commits in sync mode are preserved after failover:

# Insert 100 rows rapidly
for i in $(seq 1 100); do
    PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c \
        "INSERT INTO replication_test (message) VALUES ('Batch row $i')" > /dev/null
done
echo "Inserted 100 rows"

# Check count on primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c \
    "SELECT COUNT(*) FROM replication_test WHERE message LIKE 'Batch row%'"

# Stop primary
docker compose -f docker-compose.ha-cluster.yml stop primary

# Check count on standby-sync (should match!)
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c \
    "SELECT COUNT(*) FROM replication_test WHERE message LIKE 'Batch row%'"

# Restart primary
docker compose -f docker-compose.ha-cluster.yml start primary

Expected: All 100 rows are preserved on the sync standby.

Scenario 7.2: Read Scaling Test

Observe how reads are distributed across nodes through the proxy:

# Run 10 reads through proxy and check routing
for i in $(seq 1 10); do
    PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c \
        "SELECT '$i' as query_num, current_timestamp" 2>/dev/null
done

# Check proxy logs for routing decisions
docker logs heliosdb-proxy 2>&1 | tail -20 | grep -E "(routing|selected|backend)"

Scenario 7.3: Connection Through Proxy During Failover

# Start continuous reads through proxy in background
(for i in $(seq 1 60); do
    PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -t -c \
        "SELECT COUNT(*) FROM replication_test" 2>&1 | tr -d ' '
    sleep 1
done) &
READER_PID=$!

# Wait, then stop primary
sleep 10
docker compose -f docker-compose.ha-cluster.yml stop primary

# Wait during outage
sleep 20

# Restart primary
docker compose -f docker-compose.ha-cluster.yml start primary

# Wait for reads to complete
wait $READER_PID

# All reads should have succeeded (routed to standbys)
echo "Reader completed - check output above for continuous counts"

CHECKPOINT 7: Final Verification

Run the final verification:

./ha_interactive_tutorial.sh checkpoint7

Or manually verify:

echo "╔══════════════════════════════════════════════════════════════╗"
echo "║           FINAL CLUSTER VERIFICATION                         ║"
echo "╚══════════════════════════════════════════════════════════════╝"
echo ""

echo "1. Cluster Node Status:"
echo "   ─────────────────────────────────────────"
for port in 15432 15442 15452 15462; do
    result=$(PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -t -c "SELECT 1" 2>&1)
    if [[ "$result" == *"1"* ]]; then
        echo "   Port $port: UP"
    else
        echo "   Port $port: DOWN"
    fi
done

echo ""
echo "2. Proxy Status:"
curl -s http://localhost:19090/health 2>/dev/null && echo "   Proxy: HEALTHY" || echo "   Proxy: UNHEALTHY"

echo ""
echo "3. Replication Status (from primary):"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT
    node_id,
    sync_mode,
    state,
    lag_ms
FROM pg_replication_standbys
ORDER BY node_id;
"

echo ""
echo "4. Data Consistency Check:"
PRIMARY_COUNT=$(PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -t -c "SELECT COUNT(*) FROM replication_test" 2>/dev/null | tr -d ' ')
SYNC_COUNT=$(PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -t -c "SELECT COUNT(*) FROM replication_test" 2>/dev/null | tr -d ' ')
echo "   Primary count:      $PRIMARY_COUNT"
echo "   Standby-sync count: $SYNC_COUNT"
if [ "$PRIMARY_COUNT" == "$SYNC_COUNT" ]; then
    echo "   Status: CONSISTENT"
else
    echo "   Status: INCONSISTENT (may need to wait for replication)"
fi

echo ""
echo "╔══════════════════════════════════════════════════════════════╗"
echo "║           TUTORIAL COMPLETE!                                 ║"
echo "╚══════════════════════════════════════════════════════════════╝"

Quick Reference

Port Mappings

Service PostgreSQL Native HTTP Admin
Primary 15432 15433 18080 -
Standby-Sync 15442 15443 18081 -
Standby-Semi 15452 15453 18082 -
Standby-Async 15462 15463 18084 -
Observer - - 18083 -
Proxy 15400 - - 19090

Common Commands

# Start cluster
docker compose -f docker-compose.ha-cluster.yml up -d

# Stop cluster
docker compose -f docker-compose.ha-cluster.yml down

# View logs (all)
docker compose -f docker-compose.ha-cluster.yml logs -f

# View logs (specific node)
docker logs -f heliosdb-standby-sync

# Restart specific node
docker compose -f docker-compose.ha-cluster.yml restart standby-sync

# Connect via proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb

# Connect directly to primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb

# Run monitor
./monitor_cluster.sh

# Run interactive tutorial
./ha_interactive_tutorial.sh

Key System Views

View Purpose
pg_replication_status Node's role and configuration
pg_replication_standbys Connected standbys (primary only)
pg_replication_primary Primary connection info (standby only)
pg_replication_metrics Performance counters

Cleanup

# Stop and remove all containers and volumes
docker compose -f docker-compose.ha-cluster.yml down -v

Summary: Key Takeaways

  1. Sync Mode Guarantees Zero Data Loss: When writing in sync mode, the transaction is only committed after at least one sync standby acknowledges receipt.

  2. TWR Simplifies Application Design: Applications can connect to any sync/semi-sync node and writes are automatically forwarded to the primary.

  3. Automatic Reconnection Provides Self-Healing: Standbys automatically reconnect to the primary with exponential backoff after connection loss.

  4. The Proxy Provides Seamless Failover: Applications connecting through HeliosProxy experience minimal disruption during primary outages - reads continue immediately, writes wait for primary recovery.

  5. Multiple Sync Modes Offer Flexibility: Choose between performance (async), balance (semi-sync), and durability (sync) based on your application's needs.


Next Steps