Compression & Storage Quick Start Guide¶
HeliosDB-Lite v3.6+
This guide explains HeliosDB-Lite's compression architecture and per-column storage optimization features.
Compression Architecture¶
HeliosDB-Lite uses a two-layer compression strategy:
- RocksDB LZ4 Block Compression (automatic, always-on)
- Applied to all data blocks
- 3.7-7x compression depending on data patterns
-
Zero configuration required
-
Per-Column Storage Modes (optional, per-column)
- Dictionary encoding for low-cardinality strings
- Content-addressed deduplication for large values
- Columnar batching for analytics workloads
Automatic Compression (Zero Configuration)¶
All data in HeliosDB-Lite is automatically compressed using RocksDB's LZ4 block compression:
use heliosdb_lite::EmbeddedDatabase;
// Create database - compression is automatic
let db = EmbeddedDatabase::new("./mydb.helio")?;
// Create and populate table
db.execute("CREATE TABLE sensors (id INT, reading FLOAT8, location TEXT)")?;
db.execute("INSERT INTO sensors VALUES (1, 23.5, 'Building A, Floor 3, Room 301')")?;
// Query - decompression is automatic
let results = db.query("SELECT * FROM sensors", &[])?;
That's it! No configuration needed for basic compression.
Compression Ratios (10GB Benchmarks)¶
| Data Pattern | Compression Ratio | Example |
|---|---|---|
| Email addresses | 6.95x | 10.5 GB → 1.5 GB |
| Sorted text (regions) | 6.26x | 10.7 GB → 1.7 GB |
| Sequential integers | 3.96x | 10.7 GB → 2.5 GB |
| Low-cardinality strings | 3.96x | 10.7 GB → 2.7 GB |
| Floating-point prices | 3.70x | 18.3 GB → 2.9 GB |
Per-Column Storage Modes¶
For additional optimization, you can specify storage modes per column:
1. Dictionary Encoding¶
Best for columns with few unique values (< 64K distinct):
CREATE TABLE orders (
id INT PRIMARY KEY,
status TEXT STORAGE DICTIONARY, -- 6 values: pending, shipped, etc.
country TEXT STORAGE DICTIONARY -- ~200 country codes
);
Benefits: - 50-95% storage reduction for low-cardinality strings - 4-byte IDs instead of full strings in row tuples - Faster equality comparisons
2. Content-Addressed Storage¶
Best for large values (>1KB) with duplicates:
CREATE TABLE documents (
id INT PRIMARY KEY,
content TEXT STORAGE CONTENT_ADDRESSED, -- Many docs share templates
thumbnail BYTEA STORAGE CONTENT_ADDRESSED -- Default images
);
Benefits: - Near-100% deduplication for duplicate large values - Blake3 hashing for fast lookups - Only stores each unique value once
3. Columnar Storage¶
Best for analytics and aggregations:
CREATE TABLE metrics (
id INT PRIMARY KEY,
timestamp INT8 STORAGE COLUMNAR, -- Time-series data
value FLOAT8 STORAGE COLUMNAR -- Aggregation targets
);
Benefits: - 20-50% better compression for numeric sequences - 2-4x faster aggregations (SUM, AVG, COUNT) - Efficient column-only scans
Changing Storage Modes (Online Migration)¶
You can change storage modes on existing tables:
-- Migrate existing data to dictionary encoding
ALTER TABLE orders ALTER COLUMN status SET STORAGE DICTIONARY;
-- Migrate to content-addressed storage
ALTER TABLE documents ALTER COLUMN content SET STORAGE CONTENT_ADDRESSED;
-- Migrate to columnar storage
ALTER TABLE metrics ALTER COLUMN value SET STORAGE COLUMNAR;
-- Revert to default
ALTER TABLE orders ALTER COLUMN status SET STORAGE DEFAULT;
Note: Migrations are performed online and restructure all existing data.
When to Use Each Mode¶
| Scenario | Recommended | Reason |
|---|---|---|
| Status codes, enums | DICTIONARY | Few unique values |
| Country/region codes | DICTIONARY | ~200 unique values |
| Email templates | CONTENT_ADDRESSED | Many duplicates |
| Default profile images | CONTENT_ADDRESSED | High duplication |
| Time-series timestamps | COLUMNAR | Sequential, aggregations |
| Sensor readings | COLUMNAR | Numeric, analytics |
| Unique IDs (UUIDs) | DEFAULT | High cardinality |
| Primary keys | DEFAULT | Point lookups |
Fast Bulk Loading¶
For maximum ingestion performance, use bulk loading mode:
// Enable bulk load mode
db.execute("SET bulk_load_mode = true")?;
// Load data at 700K-900K rows/sec
let rows = db.direct_bulk_load("table_name", &batches)?;
// Disable bulk load mode
db.execute("SET bulk_load_mode = false")?;
Bulk Loading Performance¶
| Method | Throughput | Use Case |
|---|---|---|
direct_bulk_load |
700K-900K rows/sec | ETL, data warehouse loading |
| Standard INSERT | 30K-50K rows/sec | OLTP operations |
| Bulk Load Mode | 500K-700K rows/sec | Batch processing |
Monitoring Compression¶
Check Storage Usage¶
Estimate Compression Ratio¶
Compare logical vs physical size: - Logical: Row count × average row size - Physical: Database file size
Performance Tips¶
1. Use Appropriate Storage Modes¶
-- Good: Dictionary for enums
status TEXT STORAGE DICTIONARY -- 6 unique values
-- Bad: Dictionary for UUIDs
user_id TEXT STORAGE DICTIONARY -- Millions of unique values
2. Batch Inserts for Better Compression¶
LZ4 compression works better with larger data blocks:
// Good: Batch inserts
db.execute("BEGIN")?;
for batch in data.chunks(1000) {
db.execute(&insert_batch(batch))?;
}
db.execute("COMMIT")?;
// Also good: direct_bulk_load
db.direct_bulk_load("table", &batches)?;
3. Consider Data Order¶
Sorted data compresses better:
-- Better compression if data is pre-sorted
CREATE TABLE events (
timestamp INT8 STORAGE COLUMNAR, -- Sorted timestamps compress well
event_type TEXT STORAGE DICTIONARY
);
Troubleshooting¶
Low Compression Ratio¶
Check data patterns: - Random data (UUIDs, hashes) compresses poorly - Already-compressed data (images, videos) won't compress further - Encrypted data has no compressible patterns
Solution: Use DEFAULT storage for incompressible data.
Dictionary Too Large¶
Symptom: Dictionary encoding not helping
Check:
Solution: If >64K unique values, switch to DEFAULT.
Slow Aggregations¶
Symptom: SUM/AVG queries slow on large tables
Solution: Use COLUMNAR storage for aggregation targets:
Further Reading¶
- Per-Column Storage Modes Guide - Detailed storage mode documentation
- Bulk Loading Analysis - Performance optimization
- Benchmark Report - 10GB benchmark results
Last Updated: 2026-01-17 Version: HeliosDB-Lite v3.6+