HeliosDB-Lite High Availability Interactive Tutorial¶

This hands-on tutorial walks you through HeliosDB-Lite's High Availability features with interactive checkpoints where you can pause, observe, and verify cluster behavior in real-time.

Table of Contents¶

Features & Benefits Overview
Prerequisites
Part 1: Starting the Cluster
Part 2: Understanding Replication Modes
Part 3: Data Replication Verification
Part 4: Transparent Write Routing (TWR)
Part 5: Failover & Recovery
Part 6: Automatic Reconnection
Part 7: Advanced Scenarios
Quick Reference

Features & Benefits Overview¶

What is HeliosDB-Lite HA?¶

HeliosDB-Lite provides enterprise-grade High Availability through WAL (Write-Ahead Log) streaming replication with multiple sync modes, automatic failover detection, and intelligent connection routing.

Key Features¶

Feature	Description	Benefit
WAL Streaming Replication	Continuous log-based replication from primary to standbys	Near-zero data loss, real-time sync
Multiple Sync Modes	Sync, Semi-sync, and Async modes	Balance between durability and performance
Transparent Write Routing (TWR)	Writes auto-route to primary from any node	Simplified application connectivity
Automatic Reconnection	Standbys auto-reconnect with exponential backoff	Self-healing cluster
Split-Brain Protection	Observer nodes and fencing tokens prevent dual-primary	Data consistency guaranteed
HeliosProxy	Intelligent connection router with health monitoring	Zero-downtime failover for applications

Replication Mode Comparison¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                        REPLICATION MODE SPECTRUM                             │
│                                                                              │
│   ◄─────────────────────────────────────────────────────────────────────►   │
│   PERFORMANCE                                                   DURABILITY   │
│                                                                              │
│   ┌─────────────┐      ┌─────────────┐      ┌─────────────┐                │
│   │    ASYNC    │      │  SEMI-SYNC  │      │    SYNC     │                │
│   │             │      │             │      │             │                │
│   │ Fire-and-   │      │ Wait for 1  │      │ Wait for 1+ │                │
│   │ forget      │      │ standby ACK │      │ standby ACK │                │
│   │             │      │ (flush_lsn) │      │ (flush_lsn) │                │
│   │ Fastest     │      │ Balanced    │      │ Safest      │                │
│   │ May lose    │      │ Low risk    │      │ Zero loss*  │                │
│   │ recent data │      │             │      │             │                │
│   └─────────────┘      └─────────────┘      └─────────────┘                │
│                                                                              │
│   *Zero loss when at least one sync standby is available                    │
└─────────────────────────────────────────────────────────────────────────────┘

Architecture¶

                    ┌─────────────────────────────────────┐
                    │           APPLICATION               │
                    └─────────────────┬───────────────────┘
                                      │
                    ┌─────────────────▼───────────────────┐
                    │           HELIOSPROXY               │
                    │  • Automatic primary detection      │
                    │  • Health-based routing             │
                    │  • Write timeout (waits for primary)│
                    └───────┬───────────┬───────────┬────┘
                            │           │           │
              ┌─────────────▼───┐   ┌───▼───────┐  ┌▼─────────────┐
              │     PRIMARY     │   │  STANDBY  │  │   STANDBY    │
              │   (Read/Write)  │   │   SYNC    │  │    ASYNC     │
              │                 │   │           │  │              │
              │   Port: 15432   │   │   15442   │  │    15462     │
              └────────┬────────┘   └─────┬─────┘  └──────┬───────┘
                       │                  │               │
                       └──────────────────┴───────────────┘
                              WAL Streaming Replication

Prerequisites¶

Before starting, ensure you have:

# Docker and Docker Compose
docker --version         # 20.10+
docker compose version   # 2.0+

# psql client (for verification)
psql --version          # Any version

# curl (for API checks)
curl --version

# jq (optional, for JSON formatting)
jq --version

Part 1: Starting the Cluster¶

Step 1.1: Navigate to the Docker Test Directory¶

cd /home/app/HeliosDB-Lite/tests/docker

Step 1.2: Build and Start the Cluster¶

# Build with fresh binaries (recommended)
docker compose -f docker-compose.ha-cluster.yml build --no-cache

# Start all services
docker compose -f docker-compose.ha-cluster.yml up -d

Wait approximately 30-45 seconds for all nodes to start and establish replication.

CHECKPOINT 1: Verify Cluster Startup¶

Open a new terminal and run:

cd /home/app/HeliosDB-Lite/tests/docker
./ha_interactive_tutorial.sh checkpoint1

Or manually verify:

# Check all containers are running
docker compose -f docker-compose.ha-cluster.yml ps

# Expected output (all should show "Up" or "healthy"):
# NAME                        STATUS
# heliosdb-primary            Up (healthy)
# heliosdb-standby-sync       Up (healthy)
# heliosdb-standby-semisync   Up (healthy)
# heliosdb-standby-async      Up (healthy)
# heliosdb-observer           Up
# heliosdb-proxy              Up (healthy)

# Verify each node is accessible
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 'Primary OK' as status"
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT 'Standby-Sync OK' as status"
PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "SELECT 'Standby-Semi OK' as status"
PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "SELECT 'Standby-Async OK' as status"

What you should see: - All 6 containers running - Each node responds to queries - Primary accepts reads and writes, standbys are read-only

Take a moment to: 1. Run the monitor script: ./monitor_cluster.sh 2. Observe all nodes showing "UP" 3. Note the latency values for each node

Press Enter to continue when ready...

Part 2: Understanding Replication Modes¶

The Cluster Configuration¶

Our test cluster has 4 database nodes with different sync modes:

Node	Port	Sync Mode	Durability	Performance
Primary	15432	N/A	Source	Fastest
Standby-Sync	15442	`sync`	Guaranteed	Slower writes
Standby-Semi	15452	`semi-sync`	High	Balanced
Standby-Async	15462	`async`	Eventual	Fast writes

Step 2.1: Check Replication Status on Primary¶

# Connect to primary and check standbys
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM pg_replication_standbys;
"

Expected output:

 node_id          | sync_mode | state     | flush_lsn | apply_lsn | lag_ms
------------------+-----------+-----------+-----------+-----------+--------
 standby-sync     | sync      | streaming | 1000      | 1000      | 0
 standby-semisync | semi-sync | streaming | 1000      | 1000      | 5
 standby-async    | async     | streaming | 995       | 990       | 15

Step 2.2: Check Primary Status from a Standby¶

# Connect to standby-sync and check primary connection
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
SELECT * FROM pg_replication_primary;
"

CHECKPOINT 2: Verify Replication Status¶

Run the verification script:

./ha_interactive_tutorial.sh checkpoint2

Or manually verify:

# Check primary's view of standbys
echo "=== PRIMARY'S VIEW OF STANDBYS ==="
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT
    node_id,
    sync_mode,
    state,
    flush_lsn,
    apply_lsn,
    lag_bytes,
    lag_ms
FROM pg_replication_standbys;
"

# Check each standby's view of primary
for port in 15442 15452 15462; do
    echo ""
    echo "=== STANDBY ON PORT $port ==="
    PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -c "
    SELECT
        node_id as primary_id,
        state,
        primary_lsn,
        local_lsn,
        lag_bytes
    FROM pg_replication_primary;
    "
done

What you should see: - Primary shows 3-4 connected standbys - Each standby shows connection to primary - state should be streaming for all - lag_ms should be low (< 100ms typically)

Take a moment to: 1. Compare flush_lsn and apply_lsn values 2. Note that sync standby has lowest lag 3. Async standby may show slightly higher lag

Press Enter to continue when ready...

Part 3: Data Replication Verification¶

Step 3.1: Create a Test Table on Primary¶

PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
CREATE TABLE IF NOT EXISTS replication_test (
    id SERIAL PRIMARY KEY,
    message TEXT,
    sync_mode TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
"

Step 3.2: Insert Data and Watch it Replicate¶

# Insert a row
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('Hello from primary!', 'sync mode test');
"

CHECKPOINT 3: Verify Data Replication¶

Run the verification script:

./ha_interactive_tutorial.sh checkpoint3

Or manually verify:

echo "=== DATA ON PRIMARY (port 15432) ==="
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"

echo ""
echo "=== DATA ON STANDBY-SYNC (port 15442) ==="
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"

echo ""
echo "=== DATA ON STANDBY-SEMISYNC (port 15452) ==="
PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"

echo ""
echo "=== DATA ON STANDBY-ASYNC (port 15462) ==="
PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"

What you should see: - Identical data on ALL nodes (primary + 3 standbys) - Same row count, same values - Timestamps are identical (replicated, not re-generated)

Take a moment to: 1. Insert more rows and repeat the check 2. Note how fast replication occurs (typically < 100ms) 3. Try inserting from different connections simultaneously

Press Enter to continue when ready...

Part 4: Transparent Write Routing (TWR)¶

TWR allows you to connect to ANY node (including standbys) and have writes automatically routed to the primary.

Step 4.1: Connect to a Standby and Try to Write¶

Without TWR, writing to a standby would fail. With TWR enabled on sync/semi-sync standbys, writes are transparently forwarded.

# Connect to STANDBY-SYNC and insert data
# This should work because TWR forwards to primary!
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('Written via standby-sync with TWR!', 'TWR test');
"

Step 4.2: Verify the Write Went to Primary¶

# Check data on primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT id, message, sync_mode FROM replication_test WHERE message LIKE '%TWR%';
"

CHECKPOINT 4: Verify TWR Functionality¶

Run the verification script:

./ha_interactive_tutorial.sh checkpoint4

Or manually verify:

echo "=== TWR TEST: Writing to each node ==="

# Write via standby-sync
echo "1. Writing via standby-sync (port 15442)..."
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('TWR via standby-sync', 'sync');
" 2>&1

# Write via standby-semisync
echo "2. Writing via standby-semisync (port 15452)..."
PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('TWR via standby-semisync', 'semi-sync');
" 2>&1

# Write via standby-async (should fail - async doesn't support TWR)
echo "3. Writing via standby-async (port 15462)..."
PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('TWR via standby-async', 'async');
" 2>&1

echo ""
echo "=== FINAL DATA CHECK ==="
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT id, message, sync_mode FROM replication_test ORDER BY id;
"

What you should see: - Writes via sync and semi-sync standbys: SUCCESS (TWR forwarded to primary) - Write via async standby: REJECTED (async mode doesn't support TWR) - All successful writes appear on primary and replicate to all standbys

Take a moment to: 1. Notice the latency difference when writing via standby vs direct to primary 2. Understand why async doesn't support TWR (data consistency) 3. Consider use cases for TWR (simplified application configuration)

Press Enter to continue when ready...

Part 5: Failover & Recovery¶

This section demonstrates what happens when the primary becomes unavailable.

Step 5.1: Start a Background Workload¶

In one terminal, start a continuous workload:

./pg_workload.sh --duration 120 --interval 2 | tee /tmp/failover_test.log

Step 5.2: Simulate Primary Failure¶

In another terminal, stop the primary:

# Stop the primary (simulate crash)
docker compose -f docker-compose.ha-cluster.yml stop primary

Step 5.3: Observe Workload Behavior¶

Watch the workload terminal. You should see: - Writes pause briefly (write timeout kicks in) - Reads continue (routed to standbys) - After ~30 seconds, write operations timeout if primary doesn't return

CHECKPOINT 5: Verify Failover Behavior¶

Run the verification script:

./ha_interactive_tutorial.sh checkpoint5

Or manually verify (while primary is stopped):

echo "=== CLUSTER STATUS WITH PRIMARY DOWN ==="
echo ""

echo "1. Primary (port 15432) - should fail:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 1" 2>&1 | head -3

echo ""
echo "2. Standby-sync (port 15442) - should work (read-only):"
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT COUNT(*) as row_count FROM replication_test" 2>&1

echo ""
echo "3. Via Proxy (port 15400) - READ should work:"
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c "SELECT COUNT(*) as row_count FROM replication_test" 2>&1

echo ""
echo "4. Via Proxy (port 15400) - WRITE will timeout:"
echo "   (This will wait up to 30s for primary to return)"
timeout 35 bash -c 'PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c "INSERT INTO replication_test (message) VALUES ('\''failover test'\'')"' 2>&1 || echo "   Write timed out (expected)"

What you should see: - Primary connection: REFUSED (container stopped) - Standby connections: WORKING (read-only) - Proxy reads: WORKING (routed to standbys) - Proxy writes: TIMEOUT after ~30 seconds (waiting for primary)

Take a moment to: 1. Check the workload log for timing patterns 2. Notice reads continue while writes pause 3. Understand the write timeout behavior

Press Enter to continue when ready...

Part 6: Automatic Reconnection¶

Step 6.1: Restart the Primary¶

# Restart the primary
docker compose -f docker-compose.ha-cluster.yml start primary

Step 6.2: Watch Standbys Reconnect¶

Monitor the standby logs to see reconnection:

# Follow standby-sync logs
docker logs -f heliosdb-standby-sync 2>&1 | grep -E "(connect|reconnect|streaming|primary)"

You should see messages like:

INFO Attempting reconnection to primary (attempt 1)...
INFO Connected to primary, starting streaming...
INFO Streaming replication established

CHECKPOINT 6: Verify Recovery and Reconnection¶

Run the verification script:

./ha_interactive_tutorial.sh checkpoint6

Or manually verify (after primary restart):

echo "=== WAITING FOR PRIMARY TO START ==="
sleep 10

echo ""
echo "1. Primary (port 15432) - should be back:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 'Primary RECOVERED' as status" 2>&1

echo ""
echo "2. Replication status from primary:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT
    node_id,
    state,
    flush_lsn,
    lag_ms
FROM pg_replication_standbys;
"

echo ""
echo "3. Test write capability restored:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('Post-recovery write', 'recovery test')
RETURNING id, message;
"

echo ""
echo "4. Verify replication still works (check standby-sync):"
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
SELECT id, message FROM replication_test WHERE message = 'Post-recovery write';
"

What you should see: - Primary: RECOVERED and accepting connections - All standbys: state = streaming (reconnected) - Writes working: New data inserted successfully - Replication working: Data appears on standbys

Take a moment to: 1. Check reconnection attempts in standby logs 2. Note the exponential backoff (5s → 10s → 20s → ...) 3. Observe how quickly full functionality is restored

Press Enter to continue when ready...

Part 7: Advanced Scenarios¶

Scenario 7.1: Batch Durability Test¶

Test that rapid commits in sync mode are preserved after failover:

# Insert 100 rows rapidly
for i in $(seq 1 100); do
    PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c \
        "INSERT INTO replication_test (message) VALUES ('Batch row $i')" > /dev/null
done
echo "Inserted 100 rows"

# Check count on primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c \
    "SELECT COUNT(*) FROM replication_test WHERE message LIKE 'Batch row%'"

# Stop primary
docker compose -f docker-compose.ha-cluster.yml stop primary

# Check count on standby-sync (should match!)
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c \
    "SELECT COUNT(*) FROM replication_test WHERE message LIKE 'Batch row%'"

# Restart primary
docker compose -f docker-compose.ha-cluster.yml start primary

Expected: All 100 rows are preserved on the sync standby.

Scenario 7.2: Read Scaling Test¶

Observe how reads are distributed across nodes through the proxy:

# Run 10 reads through proxy and check routing
for i in $(seq 1 10); do
    PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c \
        "SELECT '$i' as query_num, current_timestamp" 2>/dev/null
done

# Check proxy logs for routing decisions
docker logs heliosdb-proxy 2>&1 | tail -20 | grep -E "(routing|selected|backend)"

Scenario 7.3: Connection Through Proxy During Failover¶

# Start continuous reads through proxy in background
(for i in $(seq 1 60); do
    PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -t -c \
        "SELECT COUNT(*) FROM replication_test" 2>&1 | tr -d ' '
    sleep 1
done) &
READER_PID=$!

# Wait, then stop primary
sleep 10
docker compose -f docker-compose.ha-cluster.yml stop primary

# Wait during outage
sleep 20

# Restart primary
docker compose -f docker-compose.ha-cluster.yml start primary

# Wait for reads to complete
wait $READER_PID

# All reads should have succeeded (routed to standbys)
echo "Reader completed - check output above for continuous counts"

CHECKPOINT 7: Final Verification¶

Run the final verification:

./ha_interactive_tutorial.sh checkpoint7

Or manually verify:

echo "╔══════════════════════════════════════════════════════════════╗"
echo "║           FINAL CLUSTER VERIFICATION                         ║"
echo "╚══════════════════════════════════════════════════════════════╝"
echo ""

echo "1. Cluster Node Status:"
echo "   ─────────────────────────────────────────"
for port in 15432 15442 15452 15462; do
    result=$(PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -t -c "SELECT 1" 2>&1)
    if [[ "$result" == *"1"* ]]; then
        echo "   Port $port: UP"
    else
        echo "   Port $port: DOWN"
    fi
done

echo ""
echo "2. Proxy Status:"
curl -s http://localhost:19090/health 2>/dev/null && echo "   Proxy: HEALTHY" || echo "   Proxy: UNHEALTHY"

echo ""
echo "3. Replication Status (from primary):"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT
    node_id,
    sync_mode,
    state,
    lag_ms
FROM pg_replication_standbys
ORDER BY node_id;
"

echo ""
echo "4. Data Consistency Check:"
PRIMARY_COUNT=$(PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -t -c "SELECT COUNT(*) FROM replication_test" 2>/dev/null | tr -d ' ')
SYNC_COUNT=$(PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -t -c "SELECT COUNT(*) FROM replication_test" 2>/dev/null | tr -d ' ')
echo "   Primary count:      $PRIMARY_COUNT"
echo "   Standby-sync count: $SYNC_COUNT"
if [ "$PRIMARY_COUNT" == "$SYNC_COUNT" ]; then
    echo "   Status: CONSISTENT"
else
    echo "   Status: INCONSISTENT (may need to wait for replication)"
fi

echo ""
echo "╔══════════════════════════════════════════════════════════════╗"
echo "║           TUTORIAL COMPLETE!                                 ║"
echo "╚══════════════════════════════════════════════════════════════╝"

Quick Reference¶

Port Mappings¶

Service	PostgreSQL	Native	HTTP	Admin
Primary	15432	15433	18080	-
Standby-Sync	15442	15443	18081	-
Standby-Semi	15452	15453	18082	-
Standby-Async	15462	15463	18084	-
Observer	-	-	18083	-
Proxy	15400	-	-	19090

Common Commands¶

# Start cluster
docker compose -f docker-compose.ha-cluster.yml up -d

# Stop cluster
docker compose -f docker-compose.ha-cluster.yml down

# View logs (all)
docker compose -f docker-compose.ha-cluster.yml logs -f

# View logs (specific node)
docker logs -f heliosdb-standby-sync

# Restart specific node
docker compose -f docker-compose.ha-cluster.yml restart standby-sync

# Connect via proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb

# Connect directly to primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb

# Run monitor
./monitor_cluster.sh

# Run interactive tutorial
./ha_interactive_tutorial.sh

Key System Views¶

View	Purpose
`pg_replication_status`	Node's role and configuration
`pg_replication_standbys`	Connected standbys (primary only)
`pg_replication_primary`	Primary connection info (standby only)
`pg_replication_metrics`	Performance counters

Cleanup¶

# Stop and remove all containers and volumes
docker compose -f docker-compose.ha-cluster.yml down -v

Summary: Key Takeaways¶

Sync Mode Guarantees Zero Data Loss: When writing in sync mode, the transaction is only committed after at least one sync standby acknowledges receipt.
TWR Simplifies Application Design: Applications can connect to any sync/semi-sync node and writes are automatically forwarded to the primary.
Automatic Reconnection Provides Self-Healing: Standbys automatically reconnect to the primary with exponential backoff after connection loss.
The Proxy Provides Seamless Failover: Applications connecting through HeliosProxy experience minimal disruption during primary outages - reads continue immediately, writes wait for primary recovery.
Multiple Sync Modes Offer Flexibility: Choose between performance (async), balance (semi-sync), and durability (sync) based on your application's needs.

Next Steps¶

Review the High Availability Feature Documentation
Explore the HA Hands-On Tutorial for more scenarios
Check the HA Implementation Plan for roadmap details