Architecture¶

Module Map¶

flowchart TD
    subgraph api ["Public API — __init__.py"]
        direction LR
        API([turboquant_vllm])
    end

    subgraph core ["Core Quantization"]
        direction LR
        LM["lloyd_max.py\nOptimal scalar codebook"]
        QZ["quantizer.py\nTurboQuantMSE · TurboQuantProd"]
    end

    subgraph wrappers ["Production Wrappers"]
        direction LR
        CP["compressors.py\nTensor shape wrappers"]
        KV["kv_cache.py\nDynamicCache integration"]
    end

    subgraph triton ["Triton Kernels"]
        direction LR
        FA["flash_attention.py\nVanilla FA v2"]
        TQ["flash_attention_tq4_kv.py\nFused TQ4 K+V"]
    end

    subgraph vllm ["vLLM Plugin"]
        BE["tq4_backend.py\nTQ4AttentionBackend"]
    end

    LM --> QZ --> CP --> KV
    KV --> API
    QZ --> triton
    KV --> BE
    BE --> API

Dependency Flow¶

Strict DAG — no circular dependencies:

lloyd_max → quantizer → compressors → kv_cache → benchmark
                 ↓
            triton/ (fused GPU kernels)
                 ↓
            vllm/ (serving plugin)

Data Flow¶

Compression (cache.update)¶

input tensor (batch, heads, seq, head_dim)
    → normalize (extract fp32 norms)
    → rotate (Haar-random orthogonal matrix)
    → quantize (Lloyd-Max codebook lookup → uint8 indices)
    → nibble pack (two 4-bit indices per byte)
    → store (uint8 indices + fp32 norms)

Decompression (cache read)¶

stored (uint8 indices + fp32 norms)
    → nibble unpack
    → centroid lookup (uint8 → float via codebook)
    → inverse rotate
    → scale (multiply by stored norms)
    → output tensor (original dtype)

Design Decisions¶

Decision	Rationale
MSE-only for drop-in mode	QJL correction invisible to standard `Q @ K.T` attention
TQ4 nibble packing over TQ3	Trivial pack/unpack, 3.76x compression, ~97% quality
fp32 norms, not fp16	fp16 precision loss compounds across 36 layers at 10K+ tokens
Non-invasive monkey-patching	Avoids subclassing DynamicCache across transformers versions
`@lru_cache` on Lloyd-Max	64 compressor instances share one codebook computation
Incremental dequantization	Only new tokens dequantized per decode step

For the full architecture deep-dive, see docs/ARCHITECTURE.md.