Skip to content

Compression & Storage Quick Start Guide

HeliosDB-Lite v3.6+

This guide explains HeliosDB-Lite's compression architecture and per-column storage optimization features.


Compression Architecture

HeliosDB-Lite uses a two-layer compression strategy:

  1. RocksDB LZ4 Block Compression (automatic, always-on)
  2. Applied to all data blocks
  3. 3.7-7x compression depending on data patterns
  4. Zero configuration required

  5. Per-Column Storage Modes (optional, per-column)

  6. Dictionary encoding for low-cardinality strings
  7. Content-addressed deduplication for large values
  8. Columnar batching for analytics workloads

Automatic Compression (Zero Configuration)

All data in HeliosDB-Lite is automatically compressed using RocksDB's LZ4 block compression:

use heliosdb_lite::EmbeddedDatabase;

// Create database - compression is automatic
let db = EmbeddedDatabase::new("./mydb.helio")?;

// Create and populate table
db.execute("CREATE TABLE sensors (id INT, reading FLOAT8, location TEXT)")?;
db.execute("INSERT INTO sensors VALUES (1, 23.5, 'Building A, Floor 3, Room 301')")?;

// Query - decompression is automatic
let results = db.query("SELECT * FROM sensors", &[])?;

That's it! No configuration needed for basic compression.

Compression Ratios (10GB Benchmarks)

Data Pattern Compression Ratio Example
Email addresses 6.95x 10.5 GB → 1.5 GB
Sorted text (regions) 6.26x 10.7 GB → 1.7 GB
Sequential integers 3.96x 10.7 GB → 2.5 GB
Low-cardinality strings 3.96x 10.7 GB → 2.7 GB
Floating-point prices 3.70x 18.3 GB → 2.9 GB

Per-Column Storage Modes

For additional optimization, you can specify storage modes per column:

1. Dictionary Encoding

Best for columns with few unique values (< 64K distinct):

CREATE TABLE orders (
    id INT PRIMARY KEY,
    status TEXT STORAGE DICTIONARY,     -- 6 values: pending, shipped, etc.
    country TEXT STORAGE DICTIONARY     -- ~200 country codes
);

Benefits: - 50-95% storage reduction for low-cardinality strings - 4-byte IDs instead of full strings in row tuples - Faster equality comparisons

2. Content-Addressed Storage

Best for large values (>1KB) with duplicates:

CREATE TABLE documents (
    id INT PRIMARY KEY,
    content TEXT STORAGE CONTENT_ADDRESSED,  -- Many docs share templates
    thumbnail BYTEA STORAGE CONTENT_ADDRESSED -- Default images
);

Benefits: - Near-100% deduplication for duplicate large values - Blake3 hashing for fast lookups - Only stores each unique value once

3. Columnar Storage

Best for analytics and aggregations:

CREATE TABLE metrics (
    id INT PRIMARY KEY,
    timestamp INT8 STORAGE COLUMNAR,     -- Time-series data
    value FLOAT8 STORAGE COLUMNAR        -- Aggregation targets
);

Benefits: - 20-50% better compression for numeric sequences - 2-4x faster aggregations (SUM, AVG, COUNT) - Efficient column-only scans


Changing Storage Modes (Online Migration)

You can change storage modes on existing tables:

-- Migrate existing data to dictionary encoding
ALTER TABLE orders ALTER COLUMN status SET STORAGE DICTIONARY;

-- Migrate to content-addressed storage
ALTER TABLE documents ALTER COLUMN content SET STORAGE CONTENT_ADDRESSED;

-- Migrate to columnar storage
ALTER TABLE metrics ALTER COLUMN value SET STORAGE COLUMNAR;

-- Revert to default
ALTER TABLE orders ALTER COLUMN status SET STORAGE DEFAULT;

Note: Migrations are performed online and restructure all existing data.


When to Use Each Mode

Scenario Recommended Reason
Status codes, enums DICTIONARY Few unique values
Country/region codes DICTIONARY ~200 unique values
Email templates CONTENT_ADDRESSED Many duplicates
Default profile images CONTENT_ADDRESSED High duplication
Time-series timestamps COLUMNAR Sequential, aggregations
Sensor readings COLUMNAR Numeric, analytics
Unique IDs (UUIDs) DEFAULT High cardinality
Primary keys DEFAULT Point lookups

Fast Bulk Loading

For maximum ingestion performance, use bulk loading mode:

// Enable bulk load mode
db.execute("SET bulk_load_mode = true")?;

// Load data at 700K-900K rows/sec
let rows = db.direct_bulk_load("table_name", &batches)?;

// Disable bulk load mode
db.execute("SET bulk_load_mode = false")?;

Bulk Loading Performance

Method Throughput Use Case
direct_bulk_load 700K-900K rows/sec ETL, data warehouse loading
Standard INSERT 30K-50K rows/sec OLTP operations
Bulk Load Mode 500K-700K rows/sec Batch processing

Monitoring Compression

Check Storage Usage

-- RocksDB statistics (when available)
SELECT * FROM pg_stat_storage;

Estimate Compression Ratio

Compare logical vs physical size: - Logical: Row count × average row size - Physical: Database file size


Performance Tips

1. Use Appropriate Storage Modes

-- Good: Dictionary for enums
status TEXT STORAGE DICTIONARY  -- 6 unique values

-- Bad: Dictionary for UUIDs
user_id TEXT STORAGE DICTIONARY  -- Millions of unique values

2. Batch Inserts for Better Compression

LZ4 compression works better with larger data blocks:

// Good: Batch inserts
db.execute("BEGIN")?;
for batch in data.chunks(1000) {
    db.execute(&insert_batch(batch))?;
}
db.execute("COMMIT")?;

// Also good: direct_bulk_load
db.direct_bulk_load("table", &batches)?;

3. Consider Data Order

Sorted data compresses better:

-- Better compression if data is pre-sorted
CREATE TABLE events (
    timestamp INT8 STORAGE COLUMNAR,  -- Sorted timestamps compress well
    event_type TEXT STORAGE DICTIONARY
);

Troubleshooting

Low Compression Ratio

Check data patterns: - Random data (UUIDs, hashes) compresses poorly - Already-compressed data (images, videos) won't compress further - Encrypted data has no compressible patterns

Solution: Use DEFAULT storage for incompressible data.

Dictionary Too Large

Symptom: Dictionary encoding not helping

Check:

SELECT COUNT(DISTINCT column_name) FROM table_name;

Solution: If >64K unique values, switch to DEFAULT.

Slow Aggregations

Symptom: SUM/AVG queries slow on large tables

Solution: Use COLUMNAR storage for aggregation targets:

ALTER TABLE metrics ALTER COLUMN value SET STORAGE COLUMNAR;


Further Reading


Last Updated: 2026-01-17 Version: HeliosDB-Lite v3.6+