turboquant_vllm
turboquant_vllm ¶
TurboQuant KV cache compression for consumer GPUs.
Implements Google's TurboQuant algorithm (ICLR 2026) for compressing transformer key-value caches to 3-4 bits per coordinate with near-zero accuracy loss. Designed for benchmarking on consumer hardware (RTX 4090).
Reference: arXiv 2504.19874 — "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate"
Attributes:
| Name | Type | Description |
|---|---|---|
CompressedDynamicCache |
KV cache with real VRAM savings (uint8 + fp32). |
|
TurboQuantKVCache |
Accuracy-only KV cache wrapper (no VRAM savings). |
|
TurboQuantCompressorMSE |
Value cache compressor (MSE-optimal). |
|
TurboQuantCompressorV2 |
Key cache compressor (QJL-corrected). |
|
TurboQuantMSE |
Stage 1 quantizer (rotation + Lloyd-Max). |
|
TurboQuantProd |
Stage 1 + 2 quantizer (MSE + QJL). |
|
LloydMaxCodebook |
Precomputed optimal scalar quantizer. |
|
solve_lloyd_max |
tuple[Tensor, Tensor]
|
Factory for Lloyd-Max codebooks (cached). |
Examples:
from turboquant_vllm import TurboQuantKVCache
wrapper = TurboQuantKVCache(cache, head_dim=128, bits=3)
See Also
:mod:turboquant_vllm.benchmark: CLI harness for benchmarking.
:mod:turboquant_vllm.lloyd_max: Lloyd-Max codebook solver.
Classes¶
TurboQuantCompressorMSE ¶
Value cache compressor with MSE-optimal reconstruction.
Uses Stage 1 only (TurboQuantMSE) for value vectors. Values appear
in the softmax(scores) @ V multiplication where reconstruction
quality matters but inner-product structure does not.
Attributes:
| Name | Type | Description |
|---|---|---|
quantizer |
TurboQuantMSE
|
TurboQuantMSE instance. |
bits |
int
|
Bits per coordinate. |
head_dim |
int
|
Model head dimension. |
Examples:
Compress and reconstruct value tensors:
comp = TurboQuantCompressorMSE(head_dim=128, bits=3)
compressed = comp.compress(value_states)
reconstructed = comp.decompress(compressed)
Initialize the value compressor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
head_dim
|
int
|
Dimension of each attention head. |
required |
bits
|
int
|
Bits per coordinate (default 3). |
3
|
seed
|
int
|
Random seed for reproducibility. |
42
|
Source code in src/turboquant_vllm/compressors.py
Functions¶
compress ¶
compress(values: Tensor) -> CompressedValues
Compress value tensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
Tensor
|
Value tensor of shape (batch, heads, seq_len, head_dim). |
required |
Returns:
| Type | Description |
|---|---|
CompressedValues
|
CompressedValues containing indices and norms. |
Source code in src/turboquant_vllm/compressors.py
decompress ¶
decompress(compressed: CompressedValues) -> Tensor
Reconstruct value tensors from compressed representation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
compressed
|
CompressedValues
|
CompressedValues from compress(). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Reconstructed value tensor in the original dtype. |
Source code in src/turboquant_vllm/compressors.py
TurboQuantCompressorV2 ¶
Key cache compressor with unbiased attention score estimation.
Uses the full two-stage TurboQuantProd algorithm to compress key vectors while preserving accurate inner product estimation for attention computation (Q·K^T).
Attributes:
| Name | Type | Description |
|---|---|---|
quantizer |
TurboQuantProd
|
Two-stage TurboQuantProd instance. |
bits |
int
|
Total bit budget per coordinate. |
head_dim |
int
|
Model head dimension. |
Examples:
Compress keys and compute attention scores directly:
comp = TurboQuantCompressorV2(head_dim=128, bits=3)
compressed = comp.compress(key_states)
scores = comp.asymmetric_attention_scores(query, compressed)
Initialize the key compressor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
head_dim
|
int
|
Dimension of each attention head. |
required |
bits
|
int
|
Total bits per coordinate (default 3). |
3
|
seed
|
int
|
Random seed for reproducibility. |
42
|
Source code in src/turboquant_vllm/compressors.py
Functions¶
compress ¶
compress(keys: Tensor) -> CompressedKeys
Compress key tensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keys
|
Tensor
|
Key tensor of shape (batch, heads, seq_len, head_dim). |
required |
Returns:
| Type | Description |
|---|---|
CompressedKeys
|
CompressedKeys containing all components for attention estimation. |
Source code in src/turboquant_vllm/compressors.py
decompress ¶
decompress(compressed: CompressedKeys) -> Tensor
Reconstruct key tensors from compressed representation.
Note: For attention, prefer asymmetric_attention_scores() which
uses the QJL-corrected inner product estimator for better accuracy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
compressed
|
CompressedKeys
|
CompressedKeys from compress(). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Reconstructed key tensor in the original dtype. |
Source code in src/turboquant_vllm/compressors.py
asymmetric_attention_scores ¶
asymmetric_attention_scores(query: Tensor, compressed: CompressedKeys) -> Tensor
Compute attention scores directly from compressed keys.
Uses the unbiased two-stage inner product estimator rather than decompressing keys and computing standard dot products. This is both more memory-efficient and more accurate.
.. warning:: MEMORY SCALING
The current implementation expands tensors to
(batch, heads, q_len, kv_len, dim) for broadcasting.
This allocates ~5 intermediate tensors at that shape.
For real sequence lengths (kv_len=6144, heads=32, dim=128)
this would use ~500MB+ per call. Suitable for correctness
testing on short sequences only.
TODO: Replace with a chunked or fused Triton kernel for
production use at real sequence lengths.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
Tensor
|
Query tensor, shape (batch, heads, q_len, head_dim). |
required |
compressed
|
CompressedKeys
|
CompressedKeys from compress(). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Attention logits, shape (batch, heads, q_len, kv_len). |
Source code in src/turboquant_vllm/compressors.py
CompressedDynamicCache ¶
KV cache with real VRAM savings via compressed index storage.
Stores TurboQuant-compressed representations and dequantizes lazily on each cache read. Only one layer's decompressed tensors are held in memory at a time — previous layers are freed on the next update.
Storage per token per head (head_dim=128):
============ ======= ===== =========== =========== Mode Dtype Bytes Compression Quality ============ ======= ===== =========== =========== FP16 baseline fp16 256 1.0x — TQ3 (3-bit) uint8 132 1.94x ~95% cosine TQ4 (4-bit) nibble 68 3.76x ~97% cosine ============ ======= ===== =========== ===========
At bits=4, indices are nibble-packed (two 4-bit values per
byte), nearly doubling compression over TQ3 with better quality.
Float32 norms are required — fp16 causes output degradation at
10K+ token sequences due to accumulated precision loss.
Integration strategy: non-invasive method replacement (same pattern
as TurboQuantKVCache). Patches update() and get_seq_length()
on the wrapped DynamicCache.
Attributes:
| Name | Type | Description |
|---|---|---|
cache |
Any
|
The wrapped DynamicCache instance. |
key_compressor |
TurboQuantCompressorMSE
|
Compressor for key tensors. |
value_compressor |
TurboQuantCompressorMSE
|
Compressor for value tensors. |
bits |
int
|
Quantization bits per coordinate. |
head_dim |
int
|
Model head dimension. |
enabled |
bool
|
Whether compression is active. |
fused_mode |
bool
|
When True, skip decompression in |
rotation |
Tensor
|
Shared rotation matrix |
centroids |
Tensor
|
Shared codebook |
Examples:
from transformers import DynamicCache
cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
compressed.vram_bytes() # 0
Initialize the compressed KV cache wrapper.
Sets up compressors, internal storage for compressed representations,
and incremental decompressed buffers. fused_mode starts disabled.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache
|
Any
|
A HuggingFace DynamicCache instance to wrap. |
required |
head_dim
|
int
|
Dimension of each attention head. Must be even
when |
required |
bits
|
int
|
Quantization bits per coordinate (default 3). Use 4 for nibble-packed storage (3.76x compression). |
3
|
seed
|
int
|
Random seed for reproducibility. |
42
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Source code in src/turboquant_vllm/kv_cache.py
Attributes¶
rotation
property
¶
Shared orthogonal rotation matrix [head_dim, head_dim] fp32.
K and V use the same rotation (same seed).
Returns:
| Type | Description |
|---|---|
Tensor
|
The rotation matrix from the key compressor's quantizer. |
centroids
property
¶
Shared Lloyd-Max codebook [2^bits] fp32.
Returns:
| Type | Description |
|---|---|
Tensor
|
Centroid values from the key compressor's quantizer. |
Functions¶
get_compressed ¶
Return compressed K and V for a layer (fused kernel API).
Provides the raw nibble-packed indices and norms without dequantization, for use by the fused TQ4 Flash Attention kernel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
layer_idx
|
int
|
Transformer layer index. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
|
Tensor
|
are uint8 |
Tensor
|
fp32 |
Source code in src/turboquant_vllm/kv_cache.py
disable ¶
enable ¶
restore ¶
Restore original methods on the wrapped cache.
Call this to fully unwrap the cache and remove all TurboQuant interception.
Source code in src/turboquant_vllm/kv_cache.py
vram_bytes ¶
Calculate total VRAM used by compressed storage.
Returns:
| Type | Description |
|---|---|
int
|
Total bytes across all compressed layers (keys + values). |
Source code in src/turboquant_vllm/kv_cache.py
baseline_vram_bytes ¶
Estimate FP16 VRAM that would be used without compression.
Accounts for nibble-packed indices by doubling the last dimension to recover the original head_dim.
Returns:
| Type | Description |
|---|---|
int
|
Total bytes if keys and values were stored as FP16 tensors. |
Source code in src/turboquant_vllm/kv_cache.py
compression_stats ¶
Return compression statistics for reporting.
Reports the true head_dim (not the packed index dimension)
and includes a nibble_packed flag.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with layer count, sequence length, compressed/baseline |
dict[str, Any]
|
sizes in MiB, compression ratio, packing mode, and VRAM savings. |
Source code in src/turboquant_vllm/kv_cache.py
TurboQuantKVCache ¶
TurboQuantKVCache(
cache: Any,
head_dim: int,
bits: int = 3,
*,
seed: int = 42,
compress_keys: bool = True,
compress_values: bool = True,
)
Transparent KV cache compression wrapper (drop-in mode).
Intercepts cache updates to compress key/value tensors before they are stored. Both keys and values use TurboQuantCompressorMSE (full MSE-optimal quantization at the configured bit-width).
This is the "drop-in" approach where standard attention (Q @ K^T) operates on decompressed keys. For the QJL-corrected inner product path (TurboQuantProd), a custom attention kernel would be needed — see TurboQuantCompressorV2.asymmetric_attention_scores().
Attributes:
| Name | Type | Description |
|---|---|---|
cache |
Any
|
The wrapped DynamicCache instance. |
key_compressor |
TurboQuantCompressorMSE
|
Compressor for key tensors. |
value_compressor |
TurboQuantCompressorMSE
|
Compressor for value tensors. |
bits |
int
|
Quantization bits per coordinate. |
head_dim |
int
|
Model head dimension. |
enabled |
bool
|
Whether compression is active. |
Examples:
from transformers import DynamicCache
cache = DynamicCache()
tq = TurboQuantKVCache(cache, head_dim=128, bits=3)
tq.enabled # True
Initialize the TurboQuant KV cache wrapper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache
|
Any
|
A HuggingFace DynamicCache instance to wrap. |
required |
head_dim
|
int
|
Dimension of each attention head. |
required |
bits
|
int
|
Quantization bits per coordinate (default 3). |
3
|
seed
|
int
|
Random seed for reproducibility. |
42
|
compress_keys
|
bool
|
Whether to compress key tensors. |
True
|
compress_values
|
bool
|
Whether to compress value tensors. |
True
|
Source code in src/turboquant_vllm/kv_cache.py
LloydMaxCodebook
dataclass
¶
Precomputed optimal scalar quantizer for a given dimension and bit-width.
The codebook stores centroids and boundaries computed by the Lloyd-Max algorithm. It maps continuous coordinate values to discrete indices and back via nearest-centroid lookup.
Attributes:
| Name | Type | Description |
|---|---|---|
centroids |
Tensor
|
Reconstruction values, shape |
boundaries |
Tensor
|
Partition boundaries, shape |
bits |
int
|
Number of quantization bits. |
dim |
int
|
Vector dimension used to compute the codebook. |
Examples:
Round-trip quantize and dequantize a tensor:
codebook = LloydMaxCodebook(centroids, boundaries, bits=3, dim=128)
indices = codebook.quantize(x)
x_hat = codebook.dequantize(indices)
Functions¶
quantize ¶
Map continuous values to nearest centroid indices.
Uses bucket search on partition boundaries for O(log n) lookup.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input tensor of any shape. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Integer tensor of same shape with centroid indices in |
Tensor
|
[0, 2^bits - 1]. |
Source code in src/turboquant_vllm/lloyd_max.py
dequantize ¶
Reconstruct continuous values from centroid indices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices
|
Tensor
|
Integer tensor of centroid indices. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Float tensor of reconstructed values with same shape as indices. |
Source code in src/turboquant_vllm/lloyd_max.py
TurboQuantMSE ¶
Stage 1 quantizer: rotation + Lloyd-Max scalar quantization.
Achieves near-optimal MSE distortion rate for high-dimensional vectors by exploiting the concentrated Beta distribution that emerges after random rotation.
Attributes:
| Name | Type | Description |
|---|---|---|
dim |
int
|
Vector dimension. |
bits |
int
|
Quantization bit-width. |
codebook |
LloydMaxCodebook
|
Precomputed Lloyd-Max codebook. |
rotation |
Tensor
|
Orthogonal rotation matrix, shape (dim, dim). |
Examples:
quantizer = TurboQuantMSE(dim=64, bits=4)
indices, norms = quantizer.quantize(torch.randn(8, 64))
reconstructed = quantizer.dequantize(indices, norms)
Initialize the MSE quantizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
int
|
Vector dimension (head dimension of the model). |
required |
bits
|
int
|
Quantization bits per coordinate (2-4 typical). |
required |
seed
|
int
|
Random seed for the rotation matrix. |
42
|
Source code in src/turboquant_vllm/quantizer.py
Functions¶
quantize ¶
Quantize vectors to centroid indices.
Applies rotation, extracts norms, normalizes to unit sphere, then quantizes each coordinate independently.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input tensor of shape (..., dim). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Tuple of (indices, norms) where indices is a long tensor of |
Tensor
|
shape (..., dim) and norms is a float tensor of shape (..., 1). |
Source code in src/turboquant_vllm/quantizer.py
dequantize ¶
Reconstruct vectors from centroid indices and norms.
Looks up centroids, applies inverse rotation, and rescales by stored norms.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices
|
Tensor
|
Long tensor of centroid indices, shape (..., dim). |
required |
norms
|
Tensor
|
Float tensor of vector norms, shape (..., 1). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Reconstructed float tensor of shape (..., dim). |
Source code in src/turboquant_vllm/quantizer.py
TurboQuantProd ¶
Two-stage quantizer with QJL correction for unbiased inner products.
Allocates (bits-1) bits to Lloyd-Max MSE quantization and 1 bit to Quantized Johnson-Lindenstrauss residual correction. The QJL step eliminates bias in dot-product estimation, which is critical for attention score computation (Q·K^T).
The unbiased estimator
~
+ ||r|| * sqrt(pi/2) / m * <S@q, sign(S@r)>
where r is the quantization residual and S is a random Gaussian projection matrix.
Attributes:
| Name | Type | Description |
|---|---|---|
dim |
int
|
Vector dimension. |
bits |
int
|
Total bit budget (bits-1 for MSE, 1 for QJL). |
mse_quantizer |
TurboQuantMSE
|
Stage 1 quantizer with (bits-1) bits. |
qjl_dim |
int
|
Number of QJL projection dimensions. |
qjl_matrix |
Tensor
|
Random Gaussian projection matrix. |
Examples:
quantizer = TurboQuantProd(dim=64, bits=4)
indices, norms, signs, res_norms = quantizer.quantize(torch.randn(8, 64))
scores = quantizer.estimate_inner_product(
torch.randn(1, 64), indices, norms, signs, res_norms
)
Initialize the two-stage quantizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
int
|
Vector dimension (head dimension of the model). |
required |
bits
|
int
|
Total bit budget per coordinate. Must be >= 2 (1 bit for MSE + 1 bit for QJL minimum). |
required |
qjl_dim
|
int | None
|
Number of QJL projection dimensions. Defaults to dim (standard JL dimensionality). |
None
|
seed
|
int
|
Random seed for rotation and projection matrices. |
42
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If bits < 2. |
Source code in src/turboquant_vllm/quantizer.py
Functions¶
quantize ¶
Quantize vectors with MSE + QJL correction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input tensor of shape (..., dim). |
required |
Returns:
| Type | Description |
|---|---|
tuple[Tensor, Tensor, Tensor, Tensor]
|
Tuple of (indices, norms, qjl_signs, residual_norms): - indices: Lloyd-Max centroid indices, shape (..., dim) - norms: Vector norms, shape (..., 1) - qjl_signs: Sign bits of projected residuals, shape (..., qjl_dim) - residual_norms: Norms of quantization residuals, shape (..., 1) |
Source code in src/turboquant_vllm/quantizer.py
dequantize ¶
Reconstruct vectors from compressed representation.
Note: Full reconstruction is approximate. For attention computation,
use estimate_inner_product instead — it's more accurate because
QJL corrects inner-product bias, not reconstruction bias.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices
|
Tensor
|
Lloyd-Max centroid indices, shape (..., dim). |
required |
norms
|
Tensor
|
Vector norms, shape (..., 1). |
required |
qjl_signs
|
Tensor
|
QJL sign bits, shape (..., qjl_dim). |
required |
residual_norms
|
Tensor
|
Residual norms, shape (..., 1). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Approximately reconstructed tensor of shape (..., dim). |
Source code in src/turboquant_vllm/quantizer.py
estimate_inner_product ¶
estimate_inner_product(
query: Tensor,
indices: Tensor,
norms: Tensor,
qjl_signs: Tensor,
residual_norms: Tensor,
) -> Tensor
Compute unbiased inner product estimate between query and compressed key.
Uses the two-stage estimator
~
+ ||r|| * sqrt(pi/2) / m * <S@q, signs>
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
Tensor
|
Query vectors, shape (..., dim). |
required |
indices
|
Tensor
|
Compressed key indices, shape (..., dim). |
required |
norms
|
Tensor
|
Key norms, shape (..., 1). |
required |
qjl_signs
|
Tensor
|
QJL sign bits for keys, shape (..., qjl_dim). |
required |
residual_norms
|
Tensor
|
Key residual norms, shape (..., 1). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Inner product estimates, shape matching broadcast of query and key |
Tensor
|
batch dimensions. |
Source code in src/turboquant_vllm/quantizer.py
Functions¶
solve_lloyd_max ¶
solve_lloyd_max(
d: int,
bits: int,
*,
use_exact: bool = False,
max_iter: int = 200,
tol: float = 1e-10,
) -> tuple[Tensor, Tensor]
Solve the Lloyd-Max conditions for optimal scalar quantization.
Results are cached by (d, bits, use_exact) so that multi-layer models (e.g., 32 layers × 2 K/V compressors = 64 calls) pay the scipy integration cost only once. Without caching, initialization takes 2+ minutes for models like Molmo2-8B.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
d
|
int
|
Vector dimension (determines the distribution shape). |
required |
bits
|
int
|
Number of quantization bits (produces 2^bits centroids). |
required |
use_exact
|
bool
|
If True, use exact Beta PDF. If False, use Gaussian approximation (faster, accurate for d >= 64). |
False
|
max_iter
|
int
|
Maximum Lloyd-Max iterations. |
200
|
tol
|
float
|
Convergence tolerance on centroid movement. |
1e-10
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Tuple of (centroids, boundaries) as 1-D tensors. Centroids has |
Tensor
|
length 2^bits, boundaries has length 2^bits - 1. |