turboquant-vllm¶
TurboQuant KV cache compression as a drop-in vLLM plugin. 3.76x compression, near-identical output quality, one CLI flag to enable.
First open-source TurboQuant implementation (ICLR 2026) — paper to working vLLM plugin in 72 hours.
Install¶
Quick Start¶
vLLM (zero code changes)¶
The TQ4 attention backend registers automatically via vLLM's plugin system. KV cache pages are compressed to 68 bytes/token/head (vs 256 bytes FP16).
HuggingFace¶
from transformers import DynamicCache
from turboquant_vllm import CompressedDynamicCache
cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
# Pass cache (not the wrapper) to model.generate()
# Compression happens transparently on every cache.update()
Benchmark Results¶
Molmo2-4B (bfloat16, 36 layers) on RTX 4090 — 11K visual tokens from 2fps video + 256 generation tokens:
| Mode | KV Cache | Compression | Output Quality | Overhead |
|---|---|---|---|---|
| FP16 baseline | 1,639 MiB | 1.0x | -- | -- |
| TQ3 (3-bit) | 845 MiB | 1.94x | ~95% cosine similarity | 2.35x |
| TQ4 (full dequant) | 435 MiB | 3.76x | ~97% cosine similarity | 3.36x |
| TQ4 (incremental) | 435 MiB | 3.76x | ~97% cosine, 100+ matching tokens | 1.78x |
How It Works¶
- Random orthogonal rotation maps each KV vector onto coordinates that follow a known Beta distribution
- Lloyd-Max scalar quantization finds optimal centroids for that distribution at 3-4 bits per coordinate
- Nibble packing stores two 4-bit indices per byte for 3.76x compression
- Incremental dequantization only decompresses new tokens each decode step, keeping overhead at 1.78x
Next Steps¶
- vLLM Plugin Guide — serving with TQ4 compression
- Container Deployment — pre-built image with turboquant-vllm baked in
- HuggingFace Guide — direct integration with DynamicCache
- API Reference — auto-generated from docstrings