Compression Codecs User Guide¶
HeliosDB-Lite v3.0.1 - Complete Compression Reference
This guide provides detailed information about each compression codec in HeliosDB-Lite, including optimal use cases, data characteristics, and estimated compression ratios for 10GB data scenarios.
Table of Contents¶
- Overview
- ALP - Adaptive Lossless floating-Point
- FSST - Fast Static Symbol Table
- Dictionary Encoding
- RLE - Run-Length Encoding
- Delta Encoding
- Codec Selection Guide
- Performance Comparison
- SQL Configuration
Overview¶
HeliosDB-Lite includes five specialized compression codecs, each optimized for different data patterns:
| Codec | Target Data | Typical Ratio | Speed |
|---|---|---|---|
| ALP | Floating-point numbers | 2-4x | Very Fast |
| FSST | Strings with patterns | 2-3x | Fast |
| Dictionary | Low-cardinality columns | 5-20x | Very Fast |
| RLE | Repetitive/sorted data | 10-100x | Fastest |
| Delta | Sequential numbers | 2-10x | Very Fast |
ALP - Adaptive Lossless floating-Point¶
Description¶
ALP (Adaptive Lossless floating-Point) is a state-of-the-art compression algorithm for IEEE 754 floating-point data. Based on ACM SIGMOD 2024 research, it automatically adapts between two strategies:
- ALP Classic: For decimal-origin data (financial, percentages, measurements)
- ALP-RD: For high-precision floats (scientific, ML weights)
Technical Characteristics¶
- Encoding Speed: ~0.5 doubles per CPU cycle
- Decoding Speed: ~2.6 doubles per CPU cycle
- Compression: 100% lossless (zero precision loss)
- Block Size: 1024 values (optimized for CPU cache)
Good Use Cases¶
| Use Case | Description | Expected Ratio | 10GB Scenario |
|---|---|---|---|
| Financial Data | Prices: $10.12, $99.95, $1234.56 | 3-4x | 10GB → 2.5-3.3GB |
| Sensor Readings | Temperature: 23.5°C, 24.1°C, 23.8°C | 3-4x | 10GB → 2.5-3.3GB |
| Percentages | Values: 0.25, 0.50, 0.75, 0.33 | 4x | 10GB → 2.5GB |
| Coordinates | GPS: -122.4194, 37.7749 | 2.5-3x | 10GB → 3.3-4GB |
| Measurements | Scientific: 9.81, 3.14159, 2.718 | 2-3x | 10GB → 3.3-5GB |
Example - Price Data (10GB):
CREATE TABLE orders (
id INT PRIMARY KEY,
price FLOAT8, -- ALP: 10GB → ~2.5GB
quantity INT
) WITH (compression = 'alp');
Bad Use Cases¶
| Use Case | Why It's Bad | Expected Ratio | Recommendation |
|---|---|---|---|
| Random doubles | No patterns to exploit | 1.0-1.2x | Don't compress |
| ML weights | Full precision, random distribution | 1.2-1.5x | Consider storing as binary |
| Encrypted data | Appears random | ~1.0x | Don't compress |
| Already compressed | No further reduction | ~1.0x | Store raw |
Example - Poor Compression:
Input: Random f64 values from rand::random()
Original: 10GB
Compressed: ~9GB (only 10% savings)
Overhead may not be worth it
10GB Data Estimates¶
| Data Pattern | Compression Ratio | Compressed Size | Space Saved |
|---|---|---|---|
| Financial prices (2 decimals) | 4.0x | 2.5 GB | 7.5 GB (75%) |
| Scientific measurements | 3.0x | 3.3 GB | 6.7 GB (67%) |
| GPS coordinates | 2.5x | 4.0 GB | 6.0 GB (60%) |
| Time-series sensor data | 3.5x | 2.9 GB | 7.1 GB (71%) |
| Random doubles | 1.2x | 8.3 GB | 1.7 GB (17%) |
FSST - Fast Static Symbol Table¶
Description¶
FSST (Fast Static Symbol Table) is a lightweight string compression algorithm that encodes common substrings (1-8 bytes) using a symbol table trained on sample data. It provides random access to individual strings.
Technical Characteristics¶
- Compression Speed: 1-3 GB/sec
- Decompression Speed: 1-3 GB/sec
- Symbol Table Size: ~2-3 KB per column
- Random Access: Yes (individual strings decompressible)
Good Use Cases¶
| Use Case | Description | Expected Ratio | 10GB Scenario |
|---|---|---|---|
| Email Addresses | user@example.com patterns | 2.5-3x | 10GB → 3.3-4GB |
| URLs | https://example.com/path patterns | 2-3x | 10GB → 3.3-5GB |
| Log Messages | Repetitive log formats | 2.5-3x | 10GB → 3.3-4GB |
| JSON Records | Structured text patterns | 2-2.5x | 10GB → 4-5GB |
| File Paths | /home/user/docs patterns | 2.5-3x | 10GB → 3.3-4GB |
| Request Logs | GET /api/v1/users patterns | 2.5-3.5x | 10GB → 2.9-4GB |
Example - Email Data (10GB):
CREATE TABLE users (
id INT PRIMARY KEY,
email TEXT, -- FSST: 10GB → ~3.5GB
name TEXT
) WITH (compression = 'fsst');
Bad Use Cases¶
| Use Case | Why It's Bad | Expected Ratio | Recommendation |
|---|---|---|---|
| UUIDs | No substring patterns | 1.1-1.3x | Use binary storage |
| Base64 data | Uniform character distribution | 1.0-1.2x | Store raw |
| Hashes (SHA/MD5) | Random character patterns | ~1.0x | Store raw |
| Encrypted text | No compressible patterns | ~1.0x | Don't compress |
| Random strings | No common substrings | 1.0-1.2x | Don't compress |
Example - Poor Compression:
Input: 10GB of UUIDs (550a8400-e29b-41d4-a716-446655440000)
Compressed: ~8.5GB (only 15% savings)
Symbol table overhead may exceed savings
10GB Data Estimates¶
| Data Pattern | Compression Ratio | Compressed Size | Space Saved |
|---|---|---|---|
| Email addresses (common domains) | 3.0x | 3.3 GB | 6.7 GB (67%) |
| URLs (same site) | 2.5x | 4.0 GB | 6.0 GB (60%) |
| Server logs (structured) | 3.0x | 3.3 GB | 6.7 GB (67%) |
| JSON records | 2.0x | 5.0 GB | 5.0 GB (50%) |
| UUIDs | 1.2x | 8.3 GB | 1.7 GB (17%) |
| Random text | 1.1x | 9.1 GB | 0.9 GB (9%) |
Dictionary Encoding¶
Description¶
Dictionary encoding replaces repeated values with compact integer indices into a dictionary of unique values. Ideal for columns with few unique values (low cardinality).
Technical Characteristics¶
- Max Dictionary Size: 65,536 unique values
- Index Width: 1, 2, or 4 bytes (auto-selected)
- Encoding Speed: Very fast (hash lookup)
- Decoding Speed: Very fast (array index)
Good Use Cases¶
| Use Case | Description | Expected Ratio | 10GB Scenario |
|---|---|---|---|
| Status Fields | active/inactive/pending | 10-50x | 10GB → 200MB-1GB |
| Country Codes | US, UK, DE, FR (~200 values) | 5-10x | 10GB → 1-2GB |
| Category Tags | electronics/clothing/food | 8-20x | 10GB → 500MB-1.25GB |
| Boolean-like | yes/no, true/false | 50-100x | 10GB → 100-200MB |
| Day of Week | Mon, Tue, Wed... (7 values) | 15-30x | 10GB → 333-666MB |
| Enum Fields | Predefined value sets | 10-50x | 10GB → 200MB-1GB |
Example - Status Field (10GB):
CREATE TABLE orders (
id INT PRIMARY KEY,
status TEXT, -- Dictionary: 10GB → ~200MB (3 unique values)
product_id INT
) WITH (compression_columns = 'status:dictionary');
Bad Use Cases¶
| Use Case | Why It's Bad | Expected Ratio | Recommendation |
|---|---|---|---|
| High cardinality | >50% unique values | 0.8-1.5x | Use FSST |
| User IDs | Mostly unique values | ~1.0x | Don't compress |
| Timestamps | All different | ~1.0x | Use Delta |
| Free-text fields | High uniqueness | ~1.0x | Use FSST |
| >65,536 unique | Exceeds dictionary limit | Fails | Use FSST |
Example - Poor Compression:
Input: 10GB of unique user IDs (user_12345, user_12346, ...)
Dictionary Size: Would exceed 65,536 limit or be nearly 1:1
Recommendation: Use FSST or store uncompressed
10GB Data Estimates¶
| Data Pattern | Unique Values | Compression Ratio | Compressed Size | Space Saved |
|---|---|---|---|---|
| Status (3 values) | 3 | 50x | 200 MB | 9.8 GB (98%) |
| Country codes | 200 | 8x | 1.25 GB | 8.75 GB (87.5%) |
| Product categories | 500 | 6x | 1.67 GB | 8.33 GB (83.3%) |
| User types | 10 | 20x | 500 MB | 9.5 GB (95%) |
| City names | 10,000 | 4x | 2.5 GB | 7.5 GB (75%) |
| Unique emails | 1,000,000+ | ~1.0x | ~10 GB | ~0 GB (0%) |
RLE - Run-Length Encoding¶
Description¶
Run-Length Encoding compresses sequences of repeated values into (value, count) pairs. Extremely effective for sorted columns or data with long runs of identical values.
Technical Characteristics¶
- Minimum Run Length: 3 (shorter runs stored verbatim)
- Maximum Run Length: 4.2 billion per entry
- Encoding Speed: Fastest (simple counting)
- Decoding Speed: Fastest (simple expansion)
Good Use Cases¶
| Use Case | Description | Expected Ratio | 10GB Scenario |
|---|---|---|---|
| Sorted partition keys | Same value for millions of rows | 100-10,000x | 10GB → 1-100MB |
| Time-bucketed data | Same hour/day for many rows | 50-500x | 10GB → 20-200MB |
| Flag columns (sorted) | 0,0,0,...,1,1,1 | 100-1000x | 10GB → 10-100MB |
| Sparse data | Mostly NULLs or zeros | 50-200x | 10GB → 50-200MB |
| Clustered keys | Same foreign key in batches | 20-100x | 10GB → 100-500MB |
Example - Sorted Partition (10GB):
-- Data sorted by region (4 regions, 2.5GB each)
CREATE TABLE events (
region TEXT, -- RLE: 10GB → ~1MB (only 4 runs!)
event_time TIMESTAMP,
data TEXT
) WITH (compression_columns = 'region:rle');
Extreme Example:
Input: 10GB of "active" status (all same value)
Runs: 1
Compressed: ~20 bytes (value + count)
Ratio: ~500,000,000x
Bad Use Cases¶
| Use Case | Why It's Bad | Expected Ratio | Recommendation |
|---|---|---|---|
| Random/unsorted | No consecutive duplicates | 0.5-1.0x | Use Dictionary |
| High cardinality unsorted | Every value different | ~0.5x (worse!) | Don't use RLE |
| Alternating values | A,B,A,B,A,B... | ~0.3x (worse!) | Use Dictionary |
| UUIDs | All unique | ~0.5x (worse!) | Don't compress |
Example - Poor Compression:
Input: 10GB of alternating true/false values
Runs: 5 billion (one per value)
Compressed: ~40GB (4x LARGER due to overhead!)
CRITICAL: RLE makes this data WORSE
10GB Data Estimates¶
| Data Pattern | Run Count | Compression Ratio | Compressed Size | Space Saved |
|---|---|---|---|---|
| Sorted partition (4 values) | 4 | 10,000x+ | ~1 MB | 9.999 GB (99.99%) |
| Hourly buckets (8760/year) | 8,760 | 1,000x | 10 MB | 9.99 GB (99.9%) |
| Daily flags (sorted) | 365 | 5,000x | 2 MB | 9.998 GB (99.98%) |
| Clustered FK (1000 groups) | 1,000 | 500x | 20 MB | 9.98 GB (99.8%) |
| Unsorted random | 5 billion | 0.5x | 20 GB | -10 GB (WORSE) |
Delta Encoding¶
Description¶
Delta encoding stores differences between consecutive values instead of absolute values. Uses zigzag + variable-length encoding for compact storage of small deltas.
Technical Characteristics¶
- Supported Types: INT4, INT8 (32/64-bit integers)
- Encoding: Zigzag encoding for signed deltas
- Storage: Variable-length integers (1-10 bytes per delta)
- Decoding: Sequential (requires reading from start)
Good Use Cases¶
| Use Case | Description | Expected Ratio | 10GB Scenario |
|---|---|---|---|
| Auto-increment IDs | 1, 2, 3, 4, 5... (delta=1) | 6-8x | 10GB → 1.25-1.67GB |
| Timestamps (ordered) | Regular intervals (delta~1000ms) | 4-8x | 10GB → 1.25-2.5GB |
| Counters | Monotonically increasing | 5-10x | 10GB → 1-2GB |
| Sequence numbers | 100, 101, 102... | 6-8x | 10GB → 1.25-1.67GB |
| Version numbers | 1, 2, 3... with gaps | 3-6x | 10GB → 1.67-3.3GB |
Example - Timestamps (10GB):
CREATE TABLE events (
id INT,
event_time BIGINT, -- Delta: 10GB → ~1.5GB (uniform intervals)
data TEXT
) WITH (compression_columns = 'event_time:delta');
Optimal Case - Sequential IDs:
Input: 10GB of sequential integers (1, 2, 3, 4, ...)
Base: 1, Deltas: [1, 1, 1, 1, ...]
Each delta = 1 byte (varint encoding)
Compressed: ~1.25GB (8x compression)
Bad Use Cases¶
| Use Case | Why It's Bad | Expected Ratio | Recommendation |
|---|---|---|---|
| Random integers | Large deltas need more bytes | 0.8-1.2x | Don't compress |
| Unsorted data | Deltas vary wildly | ~1.0x | Sort first or skip |
| Floating-point | Not supported | N/A | Use ALP |
| Sparse sequences | Large gaps = large deltas | 1.0-1.5x | Use Dictionary |
| Non-sequential | [1000, 5, 999999, 100] | ~1.0x | Don't use Delta |
Example - Poor Compression:
Input: 10GB of random integers [1000000, 5, 999999, 100, 888888]
Deltas: [-999995, 999994, -999899, 888788]
Each delta = 4-5 bytes (large varints)
Compressed: ~10GB (no savings)
10GB Data Estimates¶
| Data Pattern | Average Delta | Compression Ratio | Compressed Size | Space Saved |
|---|---|---|---|---|
| Sequential IDs (delta=1) | 1 | 8x | 1.25 GB | 8.75 GB (87.5%) |
| Timestamps (1s intervals) | 1,000 | 5x | 2 GB | 8 GB (80%) |
| Timestamps (1ms intervals) | 1 | 8x | 1.25 GB | 8.75 GB (87.5%) |
| Version numbers (gaps) | ~100 | 4x | 2.5 GB | 7.5 GB (75%) |
| Random integers | varies | 1.0x | 10 GB | 0 GB (0%) |
Codec Selection Guide¶
Decision Tree¶
Is your data...
FLOATING-POINT (FLOAT4/FLOAT8)?
├─ Yes → Use ALP
│ └─ Expected: 2-4x compression
└─ No ↓
TEXT/VARCHAR?
├─ Yes → Check cardinality
│ ├─ < 65,536 unique AND > 50% repetition → Use Dictionary (5-20x)
│ ├─ Has substring patterns → Use FSST (2-3x)
│ └─ Random/encrypted → Don't compress
└─ No ↓
INTEGER (INT4/INT8)?
├─ Yes → Check pattern
│ ├─ Sequential/sorted → Use Delta (2-10x)
│ ├─ Sorted with long runs → Use RLE (10-10000x)
│ ├─ Low cardinality → Use Dictionary (5-20x)
│ └─ Random → Don't compress
└─ No ↓
BINARY/BLOB?
└─ Don't compress (already efficient or encrypted)
Quick Reference Matrix¶
| Data Characteristic | Best Codec | Second Choice | Avoid |
|---|---|---|---|
| Financial prices | ALP | - | RLE |
| Status flags (sorted) | RLE | Dictionary | Delta |
| Status flags (unsorted) | Dictionary | - | RLE |
| Email addresses | FSST | Dictionary | RLE |
| Sequential IDs | Delta | RLE (if sorted) | Dictionary |
| Timestamps (ordered) | Delta | - | RLE |
| Country codes | Dictionary | FSST | RLE |
| UUIDs | - (don't compress) | FSST | RLE, Dictionary |
| Sensor readings | ALP | Delta (if integer) | - |
Performance Comparison¶
Compression Speed (GB/sec)¶
| Codec | Encode Speed | Decode Speed | Notes |
|---|---|---|---|
| RLE | 5-10 GB/s | 5-10 GB/s | Fastest, CPU-bound |
| Dictionary | 2-4 GB/s | 4-8 GB/s | Fast hash lookup |
| Delta | 3-6 GB/s | 4-8 GB/s | Simple arithmetic |
| ALP | 1-2 GB/s | 3-5 GB/s | SIMD accelerated |
| FSST | 1-3 GB/s | 1-3 GB/s | Symbol table lookup |
Memory Overhead¶
| Codec | Per-Column Overhead | Per-Value Overhead |
|---|---|---|
| RLE | 12 bytes (header) | 8 bytes/run |
| Dictionary | Dictionary size + 16 bytes | 1-4 bytes/value |
| Delta | 20 bytes (header) | 1-10 bytes/delta |
| ALP | ~100 bytes (metadata) | Variable |
| FSST | 2-3 KB (symbol table) | Variable |
SQL Configuration¶
CREATE TABLE WITH Clause¶
-- Single codec for entire table
CREATE TABLE measurements (
id INT PRIMARY KEY,
value FLOAT8,
label TEXT
) WITH (compression = 'auto');
-- Per-column codec specification
CREATE TABLE events (
id INT,
status TEXT,
event_time BIGINT,
temperature FLOAT8
) WITH (
compression = 'auto',
compression_level = 6,
compression_columns = 'status:dictionary,event_time:delta,temperature:alp'
);
ALTER TABLE Configuration¶
-- Enable/disable compression
ALTER TABLE events SET COMPRESSION = 'auto';
ALTER TABLE events SET COMPRESSION = 'none';
-- Set compression level (1-9)
ALTER TABLE events SET COMPRESSION_LEVEL = 9;
-- Configure per-column
ALTER TABLE events SET COMPRESSION_COLUMN status = 'dictionary';
ALTER TABLE events SET COMPRESSION_COLUMN event_time = 'delta';
Monitor Compression Statistics¶
-- Overall compression stats
SELECT * FROM heliosdb_compression_stats;
-- Pattern analysis
SELECT * FROM heliosdb_pattern_stats;
-- Recent compression events
SELECT * FROM heliosdb_compression_events;
-- Current configuration
SELECT * FROM heliosdb_config WHERE setting LIKE 'compression%';
Summary: 10GB Compression Estimates¶
| Codec | Best Case | Typical Case | Worst Case |
|---|---|---|---|
| ALP | 4x (2.5 GB) | 3x (3.3 GB) | 1.2x (8.3 GB) |
| FSST | 3x (3.3 GB) | 2.5x (4 GB) | 1.1x (9.1 GB) |
| Dictionary | 50x (200 MB) | 8x (1.25 GB) | 1x (10 GB) |
| RLE | 10000x+ (1 MB) | 100x (100 MB) | 0.5x (20 GB)* |
| Delta | 8x (1.25 GB) | 5x (2 GB) | 1x (10 GB) |
*RLE can make data larger if used incorrectly!
Last Updated: 2026-01-16 Version: HeliosDB-Lite v3.0.1