Skip to content

Compression Codecs User Guide

HeliosDB-Lite v3.0.1 - Complete Compression Reference

This guide provides detailed information about each compression codec in HeliosDB-Lite, including optimal use cases, data characteristics, and estimated compression ratios for 10GB data scenarios.


Table of Contents

  1. Overview
  2. ALP - Adaptive Lossless floating-Point
  3. FSST - Fast Static Symbol Table
  4. Dictionary Encoding
  5. RLE - Run-Length Encoding
  6. Delta Encoding
  7. Codec Selection Guide
  8. Performance Comparison
  9. SQL Configuration

Overview

HeliosDB-Lite includes five specialized compression codecs, each optimized for different data patterns:

Codec Target Data Typical Ratio Speed
ALP Floating-point numbers 2-4x Very Fast
FSST Strings with patterns 2-3x Fast
Dictionary Low-cardinality columns 5-20x Very Fast
RLE Repetitive/sorted data 10-100x Fastest
Delta Sequential numbers 2-10x Very Fast

ALP - Adaptive Lossless floating-Point

Description

ALP (Adaptive Lossless floating-Point) is a state-of-the-art compression algorithm for IEEE 754 floating-point data. Based on ACM SIGMOD 2024 research, it automatically adapts between two strategies:

  • ALP Classic: For decimal-origin data (financial, percentages, measurements)
  • ALP-RD: For high-precision floats (scientific, ML weights)

Technical Characteristics

  • Encoding Speed: ~0.5 doubles per CPU cycle
  • Decoding Speed: ~2.6 doubles per CPU cycle
  • Compression: 100% lossless (zero precision loss)
  • Block Size: 1024 values (optimized for CPU cache)

Good Use Cases

Use Case Description Expected Ratio 10GB Scenario
Financial Data Prices: $10.12, $99.95, $1234.56 3-4x 10GB → 2.5-3.3GB
Sensor Readings Temperature: 23.5°C, 24.1°C, 23.8°C 3-4x 10GB → 2.5-3.3GB
Percentages Values: 0.25, 0.50, 0.75, 0.33 4x 10GB → 2.5GB
Coordinates GPS: -122.4194, 37.7749 2.5-3x 10GB → 3.3-4GB
Measurements Scientific: 9.81, 3.14159, 2.718 2-3x 10GB → 3.3-5GB

Example - Price Data (10GB):

CREATE TABLE orders (
    id INT PRIMARY KEY,
    price FLOAT8,        -- ALP: 10GB → ~2.5GB
    quantity INT
) WITH (compression = 'alp');

Bad Use Cases

Use Case Why It's Bad Expected Ratio Recommendation
Random doubles No patterns to exploit 1.0-1.2x Don't compress
ML weights Full precision, random distribution 1.2-1.5x Consider storing as binary
Encrypted data Appears random ~1.0x Don't compress
Already compressed No further reduction ~1.0x Store raw

Example - Poor Compression:

Input: Random f64 values from rand::random()
Original: 10GB
Compressed: ~9GB (only 10% savings)
Overhead may not be worth it

10GB Data Estimates

Data Pattern Compression Ratio Compressed Size Space Saved
Financial prices (2 decimals) 4.0x 2.5 GB 7.5 GB (75%)
Scientific measurements 3.0x 3.3 GB 6.7 GB (67%)
GPS coordinates 2.5x 4.0 GB 6.0 GB (60%)
Time-series sensor data 3.5x 2.9 GB 7.1 GB (71%)
Random doubles 1.2x 8.3 GB 1.7 GB (17%)

FSST - Fast Static Symbol Table

Description

FSST (Fast Static Symbol Table) is a lightweight string compression algorithm that encodes common substrings (1-8 bytes) using a symbol table trained on sample data. It provides random access to individual strings.

Technical Characteristics

  • Compression Speed: 1-3 GB/sec
  • Decompression Speed: 1-3 GB/sec
  • Symbol Table Size: ~2-3 KB per column
  • Random Access: Yes (individual strings decompressible)

Good Use Cases

Use Case Description Expected Ratio 10GB Scenario
Email Addresses user@example.com patterns 2.5-3x 10GB → 3.3-4GB
URLs https://example.com/path patterns 2-3x 10GB → 3.3-5GB
Log Messages Repetitive log formats 2.5-3x 10GB → 3.3-4GB
JSON Records Structured text patterns 2-2.5x 10GB → 4-5GB
File Paths /home/user/docs patterns 2.5-3x 10GB → 3.3-4GB
Request Logs GET /api/v1/users patterns 2.5-3.5x 10GB → 2.9-4GB

Example - Email Data (10GB):

CREATE TABLE users (
    id INT PRIMARY KEY,
    email TEXT,          -- FSST: 10GB → ~3.5GB
    name TEXT
) WITH (compression = 'fsst');

Bad Use Cases

Use Case Why It's Bad Expected Ratio Recommendation
UUIDs No substring patterns 1.1-1.3x Use binary storage
Base64 data Uniform character distribution 1.0-1.2x Store raw
Hashes (SHA/MD5) Random character patterns ~1.0x Store raw
Encrypted text No compressible patterns ~1.0x Don't compress
Random strings No common substrings 1.0-1.2x Don't compress

Example - Poor Compression:

Input: 10GB of UUIDs (550a8400-e29b-41d4-a716-446655440000)
Compressed: ~8.5GB (only 15% savings)
Symbol table overhead may exceed savings

10GB Data Estimates

Data Pattern Compression Ratio Compressed Size Space Saved
Email addresses (common domains) 3.0x 3.3 GB 6.7 GB (67%)
URLs (same site) 2.5x 4.0 GB 6.0 GB (60%)
Server logs (structured) 3.0x 3.3 GB 6.7 GB (67%)
JSON records 2.0x 5.0 GB 5.0 GB (50%)
UUIDs 1.2x 8.3 GB 1.7 GB (17%)
Random text 1.1x 9.1 GB 0.9 GB (9%)

Dictionary Encoding

Description

Dictionary encoding replaces repeated values with compact integer indices into a dictionary of unique values. Ideal for columns with few unique values (low cardinality).

Technical Characteristics

  • Max Dictionary Size: 65,536 unique values
  • Index Width: 1, 2, or 4 bytes (auto-selected)
  • Encoding Speed: Very fast (hash lookup)
  • Decoding Speed: Very fast (array index)

Good Use Cases

Use Case Description Expected Ratio 10GB Scenario
Status Fields active/inactive/pending 10-50x 10GB → 200MB-1GB
Country Codes US, UK, DE, FR (~200 values) 5-10x 10GB → 1-2GB
Category Tags electronics/clothing/food 8-20x 10GB → 500MB-1.25GB
Boolean-like yes/no, true/false 50-100x 10GB → 100-200MB
Day of Week Mon, Tue, Wed... (7 values) 15-30x 10GB → 333-666MB
Enum Fields Predefined value sets 10-50x 10GB → 200MB-1GB

Example - Status Field (10GB):

CREATE TABLE orders (
    id INT PRIMARY KEY,
    status TEXT,         -- Dictionary: 10GB → ~200MB (3 unique values)
    product_id INT
) WITH (compression_columns = 'status:dictionary');

Bad Use Cases

Use Case Why It's Bad Expected Ratio Recommendation
High cardinality >50% unique values 0.8-1.5x Use FSST
User IDs Mostly unique values ~1.0x Don't compress
Timestamps All different ~1.0x Use Delta
Free-text fields High uniqueness ~1.0x Use FSST
>65,536 unique Exceeds dictionary limit Fails Use FSST

Example - Poor Compression:

Input: 10GB of unique user IDs (user_12345, user_12346, ...)
Dictionary Size: Would exceed 65,536 limit or be nearly 1:1
Recommendation: Use FSST or store uncompressed

10GB Data Estimates

Data Pattern Unique Values Compression Ratio Compressed Size Space Saved
Status (3 values) 3 50x 200 MB 9.8 GB (98%)
Country codes 200 8x 1.25 GB 8.75 GB (87.5%)
Product categories 500 6x 1.67 GB 8.33 GB (83.3%)
User types 10 20x 500 MB 9.5 GB (95%)
City names 10,000 4x 2.5 GB 7.5 GB (75%)
Unique emails 1,000,000+ ~1.0x ~10 GB ~0 GB (0%)

RLE - Run-Length Encoding

Description

Run-Length Encoding compresses sequences of repeated values into (value, count) pairs. Extremely effective for sorted columns or data with long runs of identical values.

Technical Characteristics

  • Minimum Run Length: 3 (shorter runs stored verbatim)
  • Maximum Run Length: 4.2 billion per entry
  • Encoding Speed: Fastest (simple counting)
  • Decoding Speed: Fastest (simple expansion)

Good Use Cases

Use Case Description Expected Ratio 10GB Scenario
Sorted partition keys Same value for millions of rows 100-10,000x 10GB → 1-100MB
Time-bucketed data Same hour/day for many rows 50-500x 10GB → 20-200MB
Flag columns (sorted) 0,0,0,...,1,1,1 100-1000x 10GB → 10-100MB
Sparse data Mostly NULLs or zeros 50-200x 10GB → 50-200MB
Clustered keys Same foreign key in batches 20-100x 10GB → 100-500MB

Example - Sorted Partition (10GB):

-- Data sorted by region (4 regions, 2.5GB each)
CREATE TABLE events (
    region TEXT,         -- RLE: 10GB → ~1MB (only 4 runs!)
    event_time TIMESTAMP,
    data TEXT
) WITH (compression_columns = 'region:rle');

Extreme Example:

Input: 10GB of "active" status (all same value)
Runs: 1
Compressed: ~20 bytes (value + count)
Ratio: ~500,000,000x

Bad Use Cases

Use Case Why It's Bad Expected Ratio Recommendation
Random/unsorted No consecutive duplicates 0.5-1.0x Use Dictionary
High cardinality unsorted Every value different ~0.5x (worse!) Don't use RLE
Alternating values A,B,A,B,A,B... ~0.3x (worse!) Use Dictionary
UUIDs All unique ~0.5x (worse!) Don't compress

Example - Poor Compression:

Input: 10GB of alternating true/false values
Runs: 5 billion (one per value)
Compressed: ~40GB (4x LARGER due to overhead!)
CRITICAL: RLE makes this data WORSE

10GB Data Estimates

Data Pattern Run Count Compression Ratio Compressed Size Space Saved
Sorted partition (4 values) 4 10,000x+ ~1 MB 9.999 GB (99.99%)
Hourly buckets (8760/year) 8,760 1,000x 10 MB 9.99 GB (99.9%)
Daily flags (sorted) 365 5,000x 2 MB 9.998 GB (99.98%)
Clustered FK (1000 groups) 1,000 500x 20 MB 9.98 GB (99.8%)
Unsorted random 5 billion 0.5x 20 GB -10 GB (WORSE)

Delta Encoding

Description

Delta encoding stores differences between consecutive values instead of absolute values. Uses zigzag + variable-length encoding for compact storage of small deltas.

Technical Characteristics

  • Supported Types: INT4, INT8 (32/64-bit integers)
  • Encoding: Zigzag encoding for signed deltas
  • Storage: Variable-length integers (1-10 bytes per delta)
  • Decoding: Sequential (requires reading from start)

Good Use Cases

Use Case Description Expected Ratio 10GB Scenario
Auto-increment IDs 1, 2, 3, 4, 5... (delta=1) 6-8x 10GB → 1.25-1.67GB
Timestamps (ordered) Regular intervals (delta~1000ms) 4-8x 10GB → 1.25-2.5GB
Counters Monotonically increasing 5-10x 10GB → 1-2GB
Sequence numbers 100, 101, 102... 6-8x 10GB → 1.25-1.67GB
Version numbers 1, 2, 3... with gaps 3-6x 10GB → 1.67-3.3GB

Example - Timestamps (10GB):

CREATE TABLE events (
    id INT,
    event_time BIGINT,   -- Delta: 10GB → ~1.5GB (uniform intervals)
    data TEXT
) WITH (compression_columns = 'event_time:delta');

Optimal Case - Sequential IDs:

Input: 10GB of sequential integers (1, 2, 3, 4, ...)
Base: 1, Deltas: [1, 1, 1, 1, ...]
Each delta = 1 byte (varint encoding)
Compressed: ~1.25GB (8x compression)

Bad Use Cases

Use Case Why It's Bad Expected Ratio Recommendation
Random integers Large deltas need more bytes 0.8-1.2x Don't compress
Unsorted data Deltas vary wildly ~1.0x Sort first or skip
Floating-point Not supported N/A Use ALP
Sparse sequences Large gaps = large deltas 1.0-1.5x Use Dictionary
Non-sequential [1000, 5, 999999, 100] ~1.0x Don't use Delta

Example - Poor Compression:

Input: 10GB of random integers [1000000, 5, 999999, 100, 888888]
Deltas: [-999995, 999994, -999899, 888788]
Each delta = 4-5 bytes (large varints)
Compressed: ~10GB (no savings)

10GB Data Estimates

Data Pattern Average Delta Compression Ratio Compressed Size Space Saved
Sequential IDs (delta=1) 1 8x 1.25 GB 8.75 GB (87.5%)
Timestamps (1s intervals) 1,000 5x 2 GB 8 GB (80%)
Timestamps (1ms intervals) 1 8x 1.25 GB 8.75 GB (87.5%)
Version numbers (gaps) ~100 4x 2.5 GB 7.5 GB (75%)
Random integers varies 1.0x 10 GB 0 GB (0%)

Codec Selection Guide

Decision Tree

Is your data...

FLOATING-POINT (FLOAT4/FLOAT8)?
├─ Yes → Use ALP
│   └─ Expected: 2-4x compression
└─ No ↓

TEXT/VARCHAR?
├─ Yes → Check cardinality
│   ├─ < 65,536 unique AND > 50% repetition → Use Dictionary (5-20x)
│   ├─ Has substring patterns → Use FSST (2-3x)
│   └─ Random/encrypted → Don't compress
└─ No ↓

INTEGER (INT4/INT8)?
├─ Yes → Check pattern
│   ├─ Sequential/sorted → Use Delta (2-10x)
│   ├─ Sorted with long runs → Use RLE (10-10000x)
│   ├─ Low cardinality → Use Dictionary (5-20x)
│   └─ Random → Don't compress
└─ No ↓

BINARY/BLOB?
└─ Don't compress (already efficient or encrypted)

Quick Reference Matrix

Data Characteristic Best Codec Second Choice Avoid
Financial prices ALP - RLE
Status flags (sorted) RLE Dictionary Delta
Status flags (unsorted) Dictionary - RLE
Email addresses FSST Dictionary RLE
Sequential IDs Delta RLE (if sorted) Dictionary
Timestamps (ordered) Delta - RLE
Country codes Dictionary FSST RLE
UUIDs - (don't compress) FSST RLE, Dictionary
Sensor readings ALP Delta (if integer) -

Performance Comparison

Compression Speed (GB/sec)

Codec Encode Speed Decode Speed Notes
RLE 5-10 GB/s 5-10 GB/s Fastest, CPU-bound
Dictionary 2-4 GB/s 4-8 GB/s Fast hash lookup
Delta 3-6 GB/s 4-8 GB/s Simple arithmetic
ALP 1-2 GB/s 3-5 GB/s SIMD accelerated
FSST 1-3 GB/s 1-3 GB/s Symbol table lookup

Memory Overhead

Codec Per-Column Overhead Per-Value Overhead
RLE 12 bytes (header) 8 bytes/run
Dictionary Dictionary size + 16 bytes 1-4 bytes/value
Delta 20 bytes (header) 1-10 bytes/delta
ALP ~100 bytes (metadata) Variable
FSST 2-3 KB (symbol table) Variable

SQL Configuration

CREATE TABLE WITH Clause

-- Single codec for entire table
CREATE TABLE measurements (
    id INT PRIMARY KEY,
    value FLOAT8,
    label TEXT
) WITH (compression = 'auto');

-- Per-column codec specification
CREATE TABLE events (
    id INT,
    status TEXT,
    event_time BIGINT,
    temperature FLOAT8
) WITH (
    compression = 'auto',
    compression_level = 6,
    compression_columns = 'status:dictionary,event_time:delta,temperature:alp'
);

ALTER TABLE Configuration

-- Enable/disable compression
ALTER TABLE events SET COMPRESSION = 'auto';
ALTER TABLE events SET COMPRESSION = 'none';

-- Set compression level (1-9)
ALTER TABLE events SET COMPRESSION_LEVEL = 9;

-- Configure per-column
ALTER TABLE events SET COMPRESSION_COLUMN status = 'dictionary';
ALTER TABLE events SET COMPRESSION_COLUMN event_time = 'delta';

Monitor Compression Statistics

-- Overall compression stats
SELECT * FROM heliosdb_compression_stats;

-- Pattern analysis
SELECT * FROM heliosdb_pattern_stats;

-- Recent compression events
SELECT * FROM heliosdb_compression_events;

-- Current configuration
SELECT * FROM heliosdb_config WHERE setting LIKE 'compression%';

Summary: 10GB Compression Estimates

Codec Best Case Typical Case Worst Case
ALP 4x (2.5 GB) 3x (3.3 GB) 1.2x (8.3 GB)
FSST 3x (3.3 GB) 2.5x (4 GB) 1.1x (9.1 GB)
Dictionary 50x (200 MB) 8x (1.25 GB) 1x (10 GB)
RLE 10000x+ (1 MB) 100x (100 MB) 0.5x (20 GB)*
Delta 8x (1.25 GB) 5x (2 GB) 1x (10 GB)

*RLE can make data larger if used incorrectly!


Last Updated: 2026-01-16 Version: HeliosDB-Lite v3.0.1