Compression & Storage Quick Start Guide¶

HeliosDB-Lite v3.6+

This guide explains HeliosDB-Lite's compression architecture and per-column storage optimization features.

Compression Architecture¶

HeliosDB-Lite uses a two-layer compression strategy:

RocksDB LZ4 Block Compression (automatic, always-on)
Applied to all data blocks
3.7-7x compression depending on data patterns
Zero configuration required
Per-Column Storage Modes (optional, per-column)
Dictionary encoding for low-cardinality strings
Content-addressed deduplication for large values
Columnar batching for analytics workloads

Automatic Compression (Zero Configuration)¶

All data in HeliosDB-Lite is automatically compressed using RocksDB's LZ4 block compression:

use heliosdb_lite::EmbeddedDatabase;

// Create database - compression is automatic
let db = EmbeddedDatabase::new("./mydb.helio")?;

// Create and populate table
db.execute("CREATE TABLE sensors (id INT, reading FLOAT8, location TEXT)")?;
db.execute("INSERT INTO sensors VALUES (1, 23.5, 'Building A, Floor 3, Room 301')")?;

// Query - decompression is automatic
let results = db.query("SELECT * FROM sensors", &[])?;

That's it! No configuration needed for basic compression.

Compression Ratios (10GB Benchmarks)¶

Data Pattern	Compression Ratio	Example
Email addresses	6.95x	10.5 GB → 1.5 GB
Sorted text (regions)	6.26x	10.7 GB → 1.7 GB
Sequential integers	3.96x	10.7 GB → 2.5 GB
Low-cardinality strings	3.96x	10.7 GB → 2.7 GB
Floating-point prices	3.70x	18.3 GB → 2.9 GB

Per-Column Storage Modes¶

For additional optimization, you can specify storage modes per column:

1. Dictionary Encoding¶

Best for columns with few unique values (< 64K distinct):

CREATE TABLE orders (
    id INT PRIMARY KEY,
    status TEXT STORAGE DICTIONARY,     -- 6 values: pending, shipped, etc.
    country TEXT STORAGE DICTIONARY     -- ~200 country codes
);

Benefits: - 50-95% storage reduction for low-cardinality strings - 4-byte IDs instead of full strings in row tuples - Faster equality comparisons

2. Content-Addressed Storage¶

Best for large values (>1KB) with duplicates:

CREATE TABLE documents (
    id INT PRIMARY KEY,
    content TEXT STORAGE CONTENT_ADDRESSED,  -- Many docs share templates
    thumbnail BYTEA STORAGE CONTENT_ADDRESSED -- Default images
);

Benefits: - Near-100% deduplication for duplicate large values - Blake3 hashing for fast lookups - Only stores each unique value once

3. Columnar Storage¶

Best for analytics and aggregations:

CREATE TABLE metrics (
    id INT PRIMARY KEY,
    timestamp INT8 STORAGE COLUMNAR,     -- Time-series data
    value FLOAT8 STORAGE COLUMNAR        -- Aggregation targets
);

Benefits: - 20-50% better compression for numeric sequences - 2-4x faster aggregations (SUM, AVG, COUNT) - Efficient column-only scans

Changing Storage Modes (Online Migration)¶

You can change storage modes on existing tables:

-- Migrate existing data to dictionary encoding
ALTER TABLE orders ALTER COLUMN status SET STORAGE DICTIONARY;

-- Migrate to content-addressed storage
ALTER TABLE documents ALTER COLUMN content SET STORAGE CONTENT_ADDRESSED;

-- Migrate to columnar storage
ALTER TABLE metrics ALTER COLUMN value SET STORAGE COLUMNAR;

-- Revert to default
ALTER TABLE orders ALTER COLUMN status SET STORAGE DEFAULT;

Note: Migrations are performed online and restructure all existing data.

When to Use Each Mode¶

Scenario	Recommended	Reason
Status codes, enums	DICTIONARY	Few unique values
Country/region codes	DICTIONARY	~200 unique values
Email templates	CONTENT_ADDRESSED	Many duplicates
Default profile images	CONTENT_ADDRESSED	High duplication
Time-series timestamps	COLUMNAR	Sequential, aggregations
Sensor readings	COLUMNAR	Numeric, analytics
Unique IDs (UUIDs)	DEFAULT	High cardinality
Primary keys	DEFAULT	Point lookups

Fast Bulk Loading¶

For maximum ingestion performance, use bulk loading mode:

// Enable bulk load mode
db.execute("SET bulk_load_mode = true")?;

// Load data at 700K-900K rows/sec
let rows = db.direct_bulk_load("table_name", &batches)?;

// Disable bulk load mode
db.execute("SET bulk_load_mode = false")?;

Bulk Loading Performance¶

Method	Throughput	Use Case
`direct_bulk_load`	700K-900K rows/sec	ETL, data warehouse loading
Standard INSERT	30K-50K rows/sec	OLTP operations
Bulk Load Mode	500K-700K rows/sec	Batch processing

Monitoring Compression¶

Check Storage Usage¶

-- RocksDB statistics (when available)
SELECT * FROM pg_stat_storage;

Estimate Compression Ratio¶

Compare logical vs physical size: - Logical: Row count × average row size - Physical: Database file size

Performance Tips¶

1. Use Appropriate Storage Modes¶

-- Good: Dictionary for enums
status TEXT STORAGE DICTIONARY  -- 6 unique values

-- Bad: Dictionary for UUIDs
user_id TEXT STORAGE DICTIONARY  -- Millions of unique values

2. Batch Inserts for Better Compression¶

LZ4 compression works better with larger data blocks:

// Good: Batch inserts
db.execute("BEGIN")?;
for batch in data.chunks(1000) {
    db.execute(&insert_batch(batch))?;
}
db.execute("COMMIT")?;

// Also good: direct_bulk_load
db.direct_bulk_load("table", &batches)?;

3. Consider Data Order¶

Sorted data compresses better:

-- Better compression if data is pre-sorted
CREATE TABLE events (
    timestamp INT8 STORAGE COLUMNAR,  -- Sorted timestamps compress well
    event_type TEXT STORAGE DICTIONARY
);

Troubleshooting¶

Low Compression Ratio¶

Check data patterns: - Random data (UUIDs, hashes) compresses poorly - Already-compressed data (images, videos) won't compress further - Encrypted data has no compressible patterns

Solution: Use DEFAULT storage for incompressible data.

Dictionary Too Large¶

Symptom: Dictionary encoding not helping

Check:

SELECT COUNT(DISTINCT column_name) FROM table_name;

Solution: If >64K unique values, switch to DEFAULT.

Slow Aggregations¶

Symptom: SUM/AVG queries slow on large tables

Solution: Use COLUMNAR storage for aggregation targets:

ALTER TABLE metrics ALTER COLUMN value SET STORAGE COLUMNAR;