HeliosDB-Lite High Availability Interactive Tutorial¶
This hands-on tutorial walks you through HeliosDB-Lite's High Availability features with interactive checkpoints where you can pause, observe, and verify cluster behavior in real-time.
Table of Contents¶
- Features & Benefits Overview
- Prerequisites
- Part 1: Starting the Cluster
- Part 2: Understanding Replication Modes
- Part 3: Data Replication Verification
- Part 4: Transparent Write Routing (TWR)
- Part 5: Failover & Recovery
- Part 6: Automatic Reconnection
- Part 7: Advanced Scenarios
- Quick Reference
Features & Benefits Overview¶
What is HeliosDB-Lite HA?¶
HeliosDB-Lite provides enterprise-grade High Availability through WAL (Write-Ahead Log) streaming replication with multiple sync modes, automatic failover detection, and intelligent connection routing.
Key Features¶
| Feature | Description | Benefit |
|---|---|---|
| WAL Streaming Replication | Continuous log-based replication from primary to standbys | Near-zero data loss, real-time sync |
| Multiple Sync Modes | Sync, Semi-sync, and Async modes | Balance between durability and performance |
| Transparent Write Routing (TWR) | Writes auto-route to primary from any node | Simplified application connectivity |
| Automatic Reconnection | Standbys auto-reconnect with exponential backoff | Self-healing cluster |
| Split-Brain Protection | Observer nodes and fencing tokens prevent dual-primary | Data consistency guaranteed |
| HeliosProxy | Intelligent connection router with health monitoring | Zero-downtime failover for applications |
Replication Mode Comparison¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ REPLICATION MODE SPECTRUM │
│ │
│ ◄─────────────────────────────────────────────────────────────────────► │
│ PERFORMANCE DURABILITY │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ ASYNC │ │ SEMI-SYNC │ │ SYNC │ │
│ │ │ │ │ │ │ │
│ │ Fire-and- │ │ Wait for 1 │ │ Wait for 1+ │ │
│ │ forget │ │ standby ACK │ │ standby ACK │ │
│ │ │ │ (flush_lsn) │ │ (flush_lsn) │ │
│ │ Fastest │ │ Balanced │ │ Safest │ │
│ │ May lose │ │ Low risk │ │ Zero loss* │ │
│ │ recent data │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ *Zero loss when at least one sync standby is available │
└─────────────────────────────────────────────────────────────────────────────┘
Architecture¶
┌─────────────────────────────────────┐
│ APPLICATION │
└─────────────────┬───────────────────┘
│
┌─────────────────▼───────────────────┐
│ HELIOSPROXY │
│ • Automatic primary detection │
│ • Health-based routing │
│ • Write timeout (waits for primary)│
└───────┬───────────┬───────────┬────┘
│ │ │
┌─────────────▼───┐ ┌───▼───────┐ ┌▼─────────────┐
│ PRIMARY │ │ STANDBY │ │ STANDBY │
│ (Read/Write) │ │ SYNC │ │ ASYNC │
│ │ │ │ │ │
│ Port: 15432 │ │ 15442 │ │ 15462 │
└────────┬────────┘ └─────┬─────┘ └──────┬───────┘
│ │ │
└──────────────────┴───────────────┘
WAL Streaming Replication
Prerequisites¶
Before starting, ensure you have:
# Docker and Docker Compose
docker --version # 20.10+
docker compose version # 2.0+
# psql client (for verification)
psql --version # Any version
# curl (for API checks)
curl --version
# jq (optional, for JSON formatting)
jq --version
Part 1: Starting the Cluster¶
Step 1.1: Navigate to the Docker Test Directory¶
Step 1.2: Build and Start the Cluster¶
# Build with fresh binaries (recommended)
docker compose -f docker-compose.ha-cluster.yml build --no-cache
# Start all services
docker compose -f docker-compose.ha-cluster.yml up -d
Wait approximately 30-45 seconds for all nodes to start and establish replication.
CHECKPOINT 1: Verify Cluster Startup¶
Open a new terminal and run:
Or manually verify:
# Check all containers are running
docker compose -f docker-compose.ha-cluster.yml ps
# Expected output (all should show "Up" or "healthy"):
# NAME STATUS
# heliosdb-primary Up (healthy)
# heliosdb-standby-sync Up (healthy)
# heliosdb-standby-semisync Up (healthy)
# heliosdb-standby-async Up (healthy)
# heliosdb-observer Up
# heliosdb-proxy Up (healthy)
# Verify each node is accessible
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 'Primary OK' as status"
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT 'Standby-Sync OK' as status"
PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "SELECT 'Standby-Semi OK' as status"
PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "SELECT 'Standby-Async OK' as status"
What you should see: - All 6 containers running - Each node responds to queries - Primary accepts reads and writes, standbys are read-only
Take a moment to:
1. Run the monitor script: ./monitor_cluster.sh
2. Observe all nodes showing "UP"
3. Note the latency values for each node
Press Enter to continue when ready...
Part 2: Understanding Replication Modes¶
The Cluster Configuration¶
Our test cluster has 4 database nodes with different sync modes:
| Node | Port | Sync Mode | Durability | Performance |
|---|---|---|---|---|
| Primary | 15432 | N/A | Source | Fastest |
| Standby-Sync | 15442 | sync |
Guaranteed | Slower writes |
| Standby-Semi | 15452 | semi-sync |
High | Balanced |
| Standby-Async | 15462 | async |
Eventual | Fast writes |
Step 2.1: Check Replication Status on Primary¶
# Connect to primary and check standbys
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM pg_replication_standbys;
"
Expected output:
node_id | sync_mode | state | flush_lsn | apply_lsn | lag_ms
------------------+-----------+-----------+-----------+-----------+--------
standby-sync | sync | streaming | 1000 | 1000 | 0
standby-semisync | semi-sync | streaming | 1000 | 1000 | 5
standby-async | async | streaming | 995 | 990 | 15
Step 2.2: Check Primary Status from a Standby¶
# Connect to standby-sync and check primary connection
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
SELECT * FROM pg_replication_primary;
"
CHECKPOINT 2: Verify Replication Status¶
Run the verification script:
Or manually verify:
# Check primary's view of standbys
echo "=== PRIMARY'S VIEW OF STANDBYS ==="
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT
node_id,
sync_mode,
state,
flush_lsn,
apply_lsn,
lag_bytes,
lag_ms
FROM pg_replication_standbys;
"
# Check each standby's view of primary
for port in 15442 15452 15462; do
echo ""
echo "=== STANDBY ON PORT $port ==="
PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -c "
SELECT
node_id as primary_id,
state,
primary_lsn,
local_lsn,
lag_bytes
FROM pg_replication_primary;
"
done
What you should see:
- Primary shows 3-4 connected standbys
- Each standby shows connection to primary
- state should be streaming for all
- lag_ms should be low (< 100ms typically)
Take a moment to: 1. Compare flush_lsn and apply_lsn values 2. Note that sync standby has lowest lag 3. Async standby may show slightly higher lag
Press Enter to continue when ready...
Part 3: Data Replication Verification¶
Step 3.1: Create a Test Table on Primary¶
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
CREATE TABLE IF NOT EXISTS replication_test (
id SERIAL PRIMARY KEY,
message TEXT,
sync_mode TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
"
Step 3.2: Insert Data and Watch it Replicate¶
# Insert a row
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('Hello from primary!', 'sync mode test');
"
CHECKPOINT 3: Verify Data Replication¶
Run the verification script:
Or manually verify:
echo "=== DATA ON PRIMARY (port 15432) ==="
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"
echo ""
echo "=== DATA ON STANDBY-SYNC (port 15442) ==="
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"
echo ""
echo "=== DATA ON STANDBY-SEMISYNC (port 15452) ==="
PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"
echo ""
echo "=== DATA ON STANDBY-ASYNC (port 15462) ==="
PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "
SELECT * FROM replication_test ORDER BY id;
"
What you should see: - Identical data on ALL nodes (primary + 3 standbys) - Same row count, same values - Timestamps are identical (replicated, not re-generated)
Take a moment to: 1. Insert more rows and repeat the check 2. Note how fast replication occurs (typically < 100ms) 3. Try inserting from different connections simultaneously
Press Enter to continue when ready...
Part 4: Transparent Write Routing (TWR)¶
TWR allows you to connect to ANY node (including standbys) and have writes automatically routed to the primary.
Step 4.1: Connect to a Standby and Try to Write¶
Without TWR, writing to a standby would fail. With TWR enabled on sync/semi-sync standbys, writes are transparently forwarded.
# Connect to STANDBY-SYNC and insert data
# This should work because TWR forwards to primary!
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('Written via standby-sync with TWR!', 'TWR test');
"
Step 4.2: Verify the Write Went to Primary¶
# Check data on primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT id, message, sync_mode FROM replication_test WHERE message LIKE '%TWR%';
"
CHECKPOINT 4: Verify TWR Functionality¶
Run the verification script:
Or manually verify:
echo "=== TWR TEST: Writing to each node ==="
# Write via standby-sync
echo "1. Writing via standby-sync (port 15442)..."
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('TWR via standby-sync', 'sync');
" 2>&1
# Write via standby-semisync
echo "2. Writing via standby-semisync (port 15452)..."
PGPASSWORD=helios psql -h localhost -p 15452 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('TWR via standby-semisync', 'semi-sync');
" 2>&1
# Write via standby-async (should fail - async doesn't support TWR)
echo "3. Writing via standby-async (port 15462)..."
PGPASSWORD=helios psql -h localhost -p 15462 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('TWR via standby-async', 'async');
" 2>&1
echo ""
echo "=== FINAL DATA CHECK ==="
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT id, message, sync_mode FROM replication_test ORDER BY id;
"
What you should see: - Writes via sync and semi-sync standbys: SUCCESS (TWR forwarded to primary) - Write via async standby: REJECTED (async mode doesn't support TWR) - All successful writes appear on primary and replicate to all standbys
Take a moment to: 1. Notice the latency difference when writing via standby vs direct to primary 2. Understand why async doesn't support TWR (data consistency) 3. Consider use cases for TWR (simplified application configuration)
Press Enter to continue when ready...
Part 5: Failover & Recovery¶
This section demonstrates what happens when the primary becomes unavailable.
Step 5.1: Start a Background Workload¶
In one terminal, start a continuous workload:
Step 5.2: Simulate Primary Failure¶
In another terminal, stop the primary:
Step 5.3: Observe Workload Behavior¶
Watch the workload terminal. You should see: - Writes pause briefly (write timeout kicks in) - Reads continue (routed to standbys) - After ~30 seconds, write operations timeout if primary doesn't return
CHECKPOINT 5: Verify Failover Behavior¶
Run the verification script:
Or manually verify (while primary is stopped):
echo "=== CLUSTER STATUS WITH PRIMARY DOWN ==="
echo ""
echo "1. Primary (port 15432) - should fail:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 1" 2>&1 | head -3
echo ""
echo "2. Standby-sync (port 15442) - should work (read-only):"
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "SELECT COUNT(*) as row_count FROM replication_test" 2>&1
echo ""
echo "3. Via Proxy (port 15400) - READ should work:"
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c "SELECT COUNT(*) as row_count FROM replication_test" 2>&1
echo ""
echo "4. Via Proxy (port 15400) - WRITE will timeout:"
echo " (This will wait up to 30s for primary to return)"
timeout 35 bash -c 'PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c "INSERT INTO replication_test (message) VALUES ('\''failover test'\'')"' 2>&1 || echo " Write timed out (expected)"
What you should see: - Primary connection: REFUSED (container stopped) - Standby connections: WORKING (read-only) - Proxy reads: WORKING (routed to standbys) - Proxy writes: TIMEOUT after ~30 seconds (waiting for primary)
Take a moment to: 1. Check the workload log for timing patterns 2. Notice reads continue while writes pause 3. Understand the write timeout behavior
Press Enter to continue when ready...
Part 6: Automatic Reconnection¶
Step 6.1: Restart the Primary¶
Step 6.2: Watch Standbys Reconnect¶
Monitor the standby logs to see reconnection:
# Follow standby-sync logs
docker logs -f heliosdb-standby-sync 2>&1 | grep -E "(connect|reconnect|streaming|primary)"
You should see messages like:
INFO Attempting reconnection to primary (attempt 1)...
INFO Connected to primary, starting streaming...
INFO Streaming replication established
CHECKPOINT 6: Verify Recovery and Reconnection¶
Run the verification script:
Or manually verify (after primary restart):
echo "=== WAITING FOR PRIMARY TO START ==="
sleep 10
echo ""
echo "1. Primary (port 15432) - should be back:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "SELECT 'Primary RECOVERED' as status" 2>&1
echo ""
echo "2. Replication status from primary:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT
node_id,
state,
flush_lsn,
lag_ms
FROM pg_replication_standbys;
"
echo ""
echo "3. Test write capability restored:"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
INSERT INTO replication_test (message, sync_mode)
VALUES ('Post-recovery write', 'recovery test')
RETURNING id, message;
"
echo ""
echo "4. Verify replication still works (check standby-sync):"
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c "
SELECT id, message FROM replication_test WHERE message = 'Post-recovery write';
"
What you should see:
- Primary: RECOVERED and accepting connections
- All standbys: state = streaming (reconnected)
- Writes working: New data inserted successfully
- Replication working: Data appears on standbys
Take a moment to: 1. Check reconnection attempts in standby logs 2. Note the exponential backoff (5s → 10s → 20s → ...) 3. Observe how quickly full functionality is restored
Press Enter to continue when ready...
Part 7: Advanced Scenarios¶
Scenario 7.1: Batch Durability Test¶
Test that rapid commits in sync mode are preserved after failover:
# Insert 100 rows rapidly
for i in $(seq 1 100); do
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c \
"INSERT INTO replication_test (message) VALUES ('Batch row $i')" > /dev/null
done
echo "Inserted 100 rows"
# Check count on primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c \
"SELECT COUNT(*) FROM replication_test WHERE message LIKE 'Batch row%'"
# Stop primary
docker compose -f docker-compose.ha-cluster.yml stop primary
# Check count on standby-sync (should match!)
PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -c \
"SELECT COUNT(*) FROM replication_test WHERE message LIKE 'Batch row%'"
# Restart primary
docker compose -f docker-compose.ha-cluster.yml start primary
Expected: All 100 rows are preserved on the sync standby.
Scenario 7.2: Read Scaling Test¶
Observe how reads are distributed across nodes through the proxy:
# Run 10 reads through proxy and check routing
for i in $(seq 1 10); do
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -c \
"SELECT '$i' as query_num, current_timestamp" 2>/dev/null
done
# Check proxy logs for routing decisions
docker logs heliosdb-proxy 2>&1 | tail -20 | grep -E "(routing|selected|backend)"
Scenario 7.3: Connection Through Proxy During Failover¶
# Start continuous reads through proxy in background
(for i in $(seq 1 60); do
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb -t -c \
"SELECT COUNT(*) FROM replication_test" 2>&1 | tr -d ' '
sleep 1
done) &
READER_PID=$!
# Wait, then stop primary
sleep 10
docker compose -f docker-compose.ha-cluster.yml stop primary
# Wait during outage
sleep 20
# Restart primary
docker compose -f docker-compose.ha-cluster.yml start primary
# Wait for reads to complete
wait $READER_PID
# All reads should have succeeded (routed to standbys)
echo "Reader completed - check output above for continuous counts"
CHECKPOINT 7: Final Verification¶
Run the final verification:
Or manually verify:
echo "╔══════════════════════════════════════════════════════════════╗"
echo "║ FINAL CLUSTER VERIFICATION ║"
echo "╚══════════════════════════════════════════════════════════════╝"
echo ""
echo "1. Cluster Node Status:"
echo " ─────────────────────────────────────────"
for port in 15432 15442 15452 15462; do
result=$(PGPASSWORD=helios psql -h localhost -p $port -U helios -d heliosdb -t -c "SELECT 1" 2>&1)
if [[ "$result" == *"1"* ]]; then
echo " Port $port: UP"
else
echo " Port $port: DOWN"
fi
done
echo ""
echo "2. Proxy Status:"
curl -s http://localhost:19090/health 2>/dev/null && echo " Proxy: HEALTHY" || echo " Proxy: UNHEALTHY"
echo ""
echo "3. Replication Status (from primary):"
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -c "
SELECT
node_id,
sync_mode,
state,
lag_ms
FROM pg_replication_standbys
ORDER BY node_id;
"
echo ""
echo "4. Data Consistency Check:"
PRIMARY_COUNT=$(PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb -t -c "SELECT COUNT(*) FROM replication_test" 2>/dev/null | tr -d ' ')
SYNC_COUNT=$(PGPASSWORD=helios psql -h localhost -p 15442 -U helios -d heliosdb -t -c "SELECT COUNT(*) FROM replication_test" 2>/dev/null | tr -d ' ')
echo " Primary count: $PRIMARY_COUNT"
echo " Standby-sync count: $SYNC_COUNT"
if [ "$PRIMARY_COUNT" == "$SYNC_COUNT" ]; then
echo " Status: CONSISTENT"
else
echo " Status: INCONSISTENT (may need to wait for replication)"
fi
echo ""
echo "╔══════════════════════════════════════════════════════════════╗"
echo "║ TUTORIAL COMPLETE! ║"
echo "╚══════════════════════════════════════════════════════════════╝"
Quick Reference¶
Port Mappings¶
| Service | PostgreSQL | Native | HTTP | Admin |
|---|---|---|---|---|
| Primary | 15432 | 15433 | 18080 | - |
| Standby-Sync | 15442 | 15443 | 18081 | - |
| Standby-Semi | 15452 | 15453 | 18082 | - |
| Standby-Async | 15462 | 15463 | 18084 | - |
| Observer | - | - | 18083 | - |
| Proxy | 15400 | - | - | 19090 |
Common Commands¶
# Start cluster
docker compose -f docker-compose.ha-cluster.yml up -d
# Stop cluster
docker compose -f docker-compose.ha-cluster.yml down
# View logs (all)
docker compose -f docker-compose.ha-cluster.yml logs -f
# View logs (specific node)
docker logs -f heliosdb-standby-sync
# Restart specific node
docker compose -f docker-compose.ha-cluster.yml restart standby-sync
# Connect via proxy
PGPASSWORD=helios psql -h localhost -p 15400 -U helios -d heliosdb
# Connect directly to primary
PGPASSWORD=helios psql -h localhost -p 15432 -U helios -d heliosdb
# Run monitor
./monitor_cluster.sh
# Run interactive tutorial
./ha_interactive_tutorial.sh
Key System Views¶
| View | Purpose |
|---|---|
pg_replication_status |
Node's role and configuration |
pg_replication_standbys |
Connected standbys (primary only) |
pg_replication_primary |
Primary connection info (standby only) |
pg_replication_metrics |
Performance counters |
Cleanup¶
# Stop and remove all containers and volumes
docker compose -f docker-compose.ha-cluster.yml down -v
Summary: Key Takeaways¶
-
Sync Mode Guarantees Zero Data Loss: When writing in sync mode, the transaction is only committed after at least one sync standby acknowledges receipt.
-
TWR Simplifies Application Design: Applications can connect to any sync/semi-sync node and writes are automatically forwarded to the primary.
-
Automatic Reconnection Provides Self-Healing: Standbys automatically reconnect to the primary with exponential backoff after connection loss.
-
The Proxy Provides Seamless Failover: Applications connecting through HeliosProxy experience minimal disruption during primary outages - reads continue immediately, writes wait for primary recovery.
-
Multiple Sync Modes Offer Flexibility: Choose between performance (async), balance (semi-sync), and durability (sync) based on your application's needs.
Next Steps¶
- Review the High Availability Feature Documentation
- Explore the HA Hands-On Tutorial for more scenarios
- Check the HA Implementation Plan for roadmap details