turboquant_vllm
turboquant_vllm ¶
TurboQuant KV cache compression for consumer GPUs.
Implements Google's TurboQuant algorithm (ICLR 2026) for compressing transformer key-value caches to 3-4 bits per coordinate with near-zero accuracy loss. Designed for benchmarking on consumer hardware (RTX 4090).
Reference: arXiv 2504.19874 — "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate"
Note
Gemma 3/4 models require transformers>=5.5.0. Since vLLM 0.19
pins transformers<5, users must upgrade manually
(pip install 'transformers>=5.5'). A runtime warning is emitted
when an older version is detected.
Attributes:
| Name | Type | Description |
|---|---|---|
CompressedDynamicCache |
KV cache with real VRAM savings (uint8 + fp32). |
|
TurboQuantKVCache |
Accuracy-only KV cache wrapper (no VRAM savings). |
|
TurboQuantCompressorMSE |
Value cache compressor (MSE-optimal). |
|
TurboQuantCompressorV2 |
Key cache compressor (QJL-corrected). |
|
TurboQuantMSE |
Stage 1 quantizer (rotation + Lloyd-Max). |
|
TurboQuantProd |
Stage 1 + 2 quantizer (MSE + QJL). |
|
LloydMaxCodebook |
Precomputed optimal scalar quantizer. |
|
solve_lloyd_max |
tuple[Tensor, Tensor]
|
Factory for Lloyd-Max codebooks (cached). |
Examples:
from turboquant_vllm import TurboQuantKVCache
wrapper = TurboQuantKVCache(cache, head_dim=128, bits=3)
See Also
:mod:turboquant_vllm.benchmark: CLI harness for benchmarking.
:mod:turboquant_vllm.lloyd_max: Lloyd-Max codebook solver.
Classes¶
TurboQuantCompressorMSE ¶
Value cache compressor with MSE-optimal reconstruction.
Uses Stage 1 only (TurboQuantMSE) for value vectors. Values appear
in the softmax(scores) @ V multiplication where reconstruction
quality matters but inner-product structure does not.
Attributes:
| Name | Type | Description |
|---|---|---|
quantizer |
TurboQuantMSE
|
TurboQuantMSE instance. |
bits |
int
|
Bits per coordinate. |
head_dim |
int
|
Model head dimension. |
Examples:
Compress and reconstruct value tensors:
comp = TurboQuantCompressorMSE(head_dim=128, bits=3)
compressed = comp.compress(value_states)
reconstructed = comp.decompress(compressed)
Initialize the value compressor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
head_dim
|
int
|
Dimension of each attention head. |
required |
bits
|
int
|
Bits per coordinate (default 3). |
3
|
seed
|
int
|
Random seed for reproducibility. |
42
|
Source code in src/turboquant_vllm/compressors.py
Functions¶
compress ¶
compress(values: Tensor) -> CompressedValues
Compress value tensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
Tensor
|
Value tensor of shape (batch, heads, seq_len, head_dim). |
required |
Returns:
| Type | Description |
|---|---|
CompressedValues
|
CompressedValues containing indices and norms. |
Source code in src/turboquant_vllm/compressors.py
decompress ¶
decompress(compressed: CompressedValues) -> Tensor
Reconstruct value tensors from compressed representation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
compressed
|
CompressedValues
|
CompressedValues from compress(). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Reconstructed value tensor in the original dtype. |
Source code in src/turboquant_vllm/compressors.py
TurboQuantCompressorV2 ¶
Key cache compressor with unbiased attention score estimation.
Uses the full two-stage TurboQuantProd algorithm to compress key vectors while preserving accurate inner product estimation for attention computation (Q·K^T).
Attributes:
| Name | Type | Description |
|---|---|---|
quantizer |
TurboQuantProd
|
Two-stage TurboQuantProd instance. |
bits |
int
|
Total bit budget per coordinate. |
head_dim |
int
|
Model head dimension. |
Examples:
Compress keys and compute attention scores directly:
comp = TurboQuantCompressorV2(head_dim=128, bits=3)
compressed = comp.compress(key_states)
scores = comp.asymmetric_attention_scores(query, compressed)
Initialize the key compressor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
head_dim
|
int
|
Dimension of each attention head. |
required |
bits
|
int
|
Total bits per coordinate (default 3). |
3
|
seed
|
int
|
Random seed for reproducibility. |
42
|
Source code in src/turboquant_vllm/compressors.py
Functions¶
compress ¶
compress(keys: Tensor) -> CompressedKeys
Compress key tensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keys
|
Tensor
|
Key tensor of shape (batch, heads, seq_len, head_dim). |
required |
Returns:
| Type | Description |
|---|---|
CompressedKeys
|
CompressedKeys containing all components for attention estimation. |
Source code in src/turboquant_vllm/compressors.py
decompress ¶
decompress(compressed: CompressedKeys) -> Tensor
Reconstruct key tensors from compressed representation.
Note: For attention, prefer asymmetric_attention_scores() which
uses the QJL-corrected inner product estimator for better accuracy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
compressed
|
CompressedKeys
|
CompressedKeys from compress(). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Reconstructed key tensor in the original dtype. |
Source code in src/turboquant_vllm/compressors.py
asymmetric_attention_scores ¶
asymmetric_attention_scores(query: Tensor, compressed: CompressedKeys) -> Tensor
Compute attention scores directly from compressed keys.
Uses the unbiased two-stage inner product estimator rather than decompressing keys and computing standard dot products. This is both more memory-efficient and more accurate.
.. warning:: MEMORY SCALING
The current implementation expands tensors to
(batch, heads, q_len, kv_len, dim) for broadcasting.
This allocates ~5 intermediate tensors at that shape.
For real sequence lengths (kv_len=6144, heads=32, dim=128)
this would use ~500MB+ per call. Suitable for correctness
testing on short sequences only.
TODO: Replace with a chunked or fused Triton kernel for
production use at real sequence lengths.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
Tensor
|
Query tensor, shape (batch, heads, q_len, head_dim). |
required |
compressed
|
CompressedKeys
|
CompressedKeys from compress(). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Attention logits, shape (batch, heads, q_len, kv_len). |
Source code in src/turboquant_vllm/compressors.py
CompressedDynamicCache ¶
CompressedDynamicCache(
cache: Any,
head_dim: int,
bits: int | None = 3,
*,
k_bits: int | None = None,
v_bits: int | None = None,
seed: int = 42,
model_config: Any = None,
)
KV cache with real VRAM savings via compressed index storage.
Stores TurboQuant-compressed representations and dequantizes lazily on each cache read. Only one layer's decompressed tensors are held in memory at a time — previous layers are freed on the next update.
Supports heterogeneous head dimensions for the lazy dequantized
(non-fused) cache-read path via per-head_dim compressors created
lazily on first use. The fused path consumes shared rotation
and centroids for the primary head_dim only, so it must not be
used for models with mixed head dimensions (e.g. Gemma 4: d=256
sliding, d=512 global).
Storage per token per head (head_dim=128):
============ ======= ===== =========== =========== Mode Dtype Bytes Compression Quality ============ ======= ===== =========== =========== FP16 baseline fp16 256 1.0x — TQ3 (3-bit) uint8 132 1.94x ~95% cosine TQ4 (4-bit) nibble 68 3.76x ~97% cosine ============ ======= ===== =========== ===========
At bits=4, indices are nibble-packed (two 4-bit values per
byte), nearly doubling compression over TQ3 with better quality.
Float32 norms are required — fp16 causes output degradation at
10K+ token sequences due to accumulated precision loss.
For models with mixed global and sliding window attention layers
(e.g. Gemma-2, Gemma-3), SWA layers automatically bypass compression
via the is_sliding attribute on DynamicSlidingWindowLayer.
Only global attention layers are compressed. Pass model_config
to enable a diagnostic warning when the cache lacks SWA metadata.
Integration strategy: non-invasive method replacement (same pattern
as TurboQuantKVCache). Patches update() and get_seq_length()
on the wrapped DynamicCache. Supports the context manager protocol
for automatic restore() on scope exit, and warns on double-wrap.
Compatible with both transformers 4.x and 5.x lazy_initialization
signatures via try/except fallback in _ensure_layer_initialized.
Attributes:
| Name | Type | Description |
|---|---|---|
cache |
Any
|
The wrapped DynamicCache instance. |
key_compressor |
TurboQuantCompressorMSE
|
Compressor for key tensors. |
value_compressor |
TurboQuantCompressorMSE
|
Compressor for value tensors. |
bits |
int
|
Quantization bits per coordinate. |
head_dim |
int
|
Model head dimension. |
enabled |
bool
|
Whether compression is active. |
fused_mode |
bool
|
When True, skip decompression in |
rotation |
Tensor
|
Shared rotation matrix |
centroids |
Tensor
|
Shared codebook |
Examples:
from transformers import DynamicCache
cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
compressed.vram_bytes() # 0
Initialize the compressed KV cache wrapper.
Sets up per-head_dim compressors (lazily created via
_get_compressors()), internal storage for compressed
representations, and incremental decompressed buffers.
fused_mode starts disabled. When model_config has
mixed sliding/full attention layer_types, full attention
layers are bypassed (with list padding) to preserve retrieval
quality while allowing get_seq_length to delegate correctly.
Keys and values can use different bit-widths via k_bits and
v_bits. When both are None, bits applies to both
(backward compatible). Any 4-bit component requires even
head_dim for nibble packing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache
|
Any
|
A HuggingFace DynamicCache instance to wrap. |
required |
head_dim
|
int
|
Dimension of each attention head. Must be even when any component uses 4-bit (nibble packing). |
required |
bits
|
int | None
|
Shorthand for |
3
|
k_bits
|
int | None
|
Key quantization bits (overrides |
None
|
v_bits
|
int | None
|
Value quantization bits (overrides |
None
|
seed
|
int
|
Random seed for reproducibility. |
42
|
model_config
|
Any
|
Optional model config (e.g. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If no bit-width is specified (all three are None). |
ValueError
|
If any 4-bit component has odd |
Warns:
| Type | Description |
|---|---|
UserWarning
|
If |
UserWarning
|
If |
Source code in src/turboquant_vllm/kv_cache.py
339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 | |
Attributes¶
key_compressor
property
¶
key_compressor: TurboQuantCompressorMSE
Primary key compressor (backward compat).
value_compressor
property
¶
value_compressor: TurboQuantCompressorMSE
Primary value compressor (backward compat).
rotation
property
¶
Shared orthogonal rotation matrix [head_dim, head_dim] fp32.
K and V use the same rotation (same seed).
Returns:
| Type | Description |
|---|---|
Tensor
|
The rotation matrix from the key compressor's quantizer. |
centroids
property
¶
Shared Lloyd-Max codebook [2^bits] fp32.
Returns:
| Type | Description |
|---|---|
Tensor
|
Centroid values from the key compressor's quantizer. |
Functions¶
get_compressed ¶
Return compressed K and V for a layer (fused kernel API).
Provides the raw nibble-packed indices and norms without dequantization, for use by the fused TQ4 Flash Attention kernel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
layer_idx
|
int
|
Transformer layer index. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
|
Tensor
|
are uint8 |
Tensor
|
fp32 |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Source code in src/turboquant_vllm/kv_cache.py
disable ¶
enable ¶
restore ¶
Restore original methods on the wrapped cache.
Call this to fully unwrap the cache and remove all TurboQuant interception.
Source code in src/turboquant_vllm/kv_cache.py
__enter__ ¶
__enter__() -> CompressedDynamicCache
Enter the context manager.
Returns:
| Type | Description |
|---|---|
CompressedDynamicCache
|
Self, for use in |
__exit__ ¶
Exit the context manager, restoring the original cache methods.
Returns:
| Type | Description |
|---|---|
bool
|
False — exceptions are never suppressed. |
vram_bytes ¶
Calculate total VRAM used by compressed storage.
SWA-bypassed layers (None entries) are excluded from the total.
Returns:
| Type | Description |
|---|---|
int
|
Total bytes across all compressed layers (keys + values). |
Source code in src/turboquant_vllm/kv_cache.py
baseline_vram_bytes ¶
Estimate FP16 VRAM that would be used without compression.
Accounts for nibble-packed indices by doubling the last dimension to recover the original head_dim. SWA-bypassed layers (None entries) are excluded.
Returns:
| Type | Description |
|---|---|
int
|
Total bytes if keys and values were stored as FP16 tensors. |
Source code in src/turboquant_vllm/kv_cache.py
compression_stats ¶
Return compression statistics for reporting.
Reports per-component bit-widths, the true head_dim, compression
ratio, and per-sequence VRAM estimates at representative context
lengths (4K, 16K, 32K tokens). Only counts compressed (non-SWA)
layers. VRAM estimates are per sequence — multiply by batch size
for total memory.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with layer count, sequence length, per-component bit-widths, |
dict[str, Any]
|
compressed/baseline sizes in MiB, compression ratio, VRAM savings, |
dict[str, Any]
|
and per-sequence VRAM estimates at representative context lengths. |
Source code in src/turboquant_vllm/kv_cache.py
TurboQuantKVCache ¶
TurboQuantKVCache(
cache: Any,
head_dim: int,
bits: int = 3,
*,
seed: int = 42,
compress_keys: bool = True,
compress_values: bool = True,
)
Transparent KV cache compression wrapper (drop-in mode).
Intercepts cache updates to compress key/value tensors before they are stored. Both keys and values use TurboQuantCompressorMSE (full MSE-optimal quantization at the configured bit-width).
This is the "drop-in" approach where standard attention (Q @ K^T) operates on decompressed keys. For the QJL-corrected inner product path (TurboQuantProd), a custom attention kernel would be needed — see TurboQuantCompressorV2.asymmetric_attention_scores().
Supports the context manager protocol for automatic restore()
on scope exit, and warns if the cache is already wrapped.
Attributes:
| Name | Type | Description |
|---|---|---|
cache |
Any
|
The wrapped DynamicCache instance. |
key_compressor |
TurboQuantCompressorMSE
|
Compressor for key tensors. |
value_compressor |
TurboQuantCompressorMSE
|
Compressor for value tensors. |
bits |
int
|
Quantization bits per coordinate. |
head_dim |
int
|
Model head dimension. |
enabled |
bool
|
Whether compression is active. |
Examples:
from transformers import DynamicCache
cache = DynamicCache()
tq = TurboQuantKVCache(cache, head_dim=128, bits=3)
tq.enabled # True
Initialize the TurboQuant KV cache wrapper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache
|
Any
|
A HuggingFace DynamicCache instance to wrap. |
required |
head_dim
|
int
|
Dimension of each attention head. |
required |
bits
|
int
|
Quantization bits per coordinate (default 3). |
3
|
seed
|
int
|
Random seed for reproducibility. |
42
|
compress_keys
|
bool
|
Whether to compress key tensors. |
True
|
compress_values
|
bool
|
Whether to compress value tensors. |
True
|
Warns:
| Type | Description |
|---|---|
UserWarning
|
If |
Source code in src/turboquant_vllm/kv_cache.py
Functions¶
disable ¶
Disable compression, passing through to original update.
Useful for A/B benchmarking within the same run.
enable ¶
restore ¶
Restore the original update method on the wrapped cache.
Call this to fully unwrap the cache and remove all TurboQuant interception.
__enter__ ¶
__enter__() -> TurboQuantKVCache
Enter the context manager.
Returns:
| Type | Description |
|---|---|
TurboQuantKVCache
|
Self, for use in |
__exit__ ¶
Exit the context manager, restoring the original cache methods.
Returns:
| Type | Description |
|---|---|
bool
|
False — exceptions are never suppressed. |
LloydMaxCodebook
dataclass
¶
Precomputed optimal scalar quantizer for a given dimension and bit-width.
The codebook stores centroids and boundaries computed by the Lloyd-Max algorithm. It maps continuous coordinate values to discrete indices and back via nearest-centroid lookup.
Attributes:
| Name | Type | Description |
|---|---|---|
centroids |
Tensor
|
Reconstruction values, shape |
boundaries |
Tensor
|
Partition boundaries, shape |
bits |
int
|
Number of quantization bits. |
dim |
int
|
Vector dimension used to compute the codebook. |
Examples:
Round-trip quantize and dequantize a tensor:
codebook = LloydMaxCodebook(centroids, boundaries, bits=3, dim=128)
indices = codebook.quantize(x)
x_hat = codebook.dequantize(indices)
Functions¶
quantize ¶
Map continuous values to nearest centroid indices.
Uses bucket search on partition boundaries for O(log n) lookup.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input tensor of any shape. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Integer tensor of same shape with centroid indices in |
Tensor
|
[0, 2^bits - 1]. |
Source code in src/turboquant_vllm/lloyd_max.py
dequantize ¶
Reconstruct continuous values from centroid indices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices
|
Tensor
|
Integer tensor of centroid indices. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Float tensor of reconstructed values with same shape as indices. |
Source code in src/turboquant_vllm/lloyd_max.py
TurboQuantMSE ¶
Stage 1 quantizer: rotation + Lloyd-Max scalar quantization.
Achieves near-optimal MSE distortion rate for high-dimensional vectors by exploiting the concentrated Beta distribution that emerges after random rotation.
Attributes:
| Name | Type | Description |
|---|---|---|
dim |
int
|
Vector dimension. |
bits |
int
|
Quantization bit-width. |
codebook |
LloydMaxCodebook
|
Precomputed Lloyd-Max codebook. |
rotation |
Tensor
|
Orthogonal rotation matrix, shape (dim, dim). |
Examples:
quantizer = TurboQuantMSE(dim=64, bits=4)
indices, norms = quantizer.quantize(torch.randn(8, 64))
reconstructed = quantizer.dequantize(indices, norms)
Initialize the MSE quantizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
int
|
Vector dimension (head dimension of the model). |
required |
bits
|
int
|
Quantization bits per coordinate (2-4 typical). |
required |
seed
|
int
|
Random seed for the rotation matrix. |
42
|
Source code in src/turboquant_vllm/quantizer.py
Functions¶
quantize ¶
Quantize vectors to centroid indices.
Applies rotation, extracts norms, normalizes to unit sphere, then quantizes each coordinate independently.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input tensor of shape (..., dim). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Tuple of (indices, norms) where indices is a long tensor of |
Tensor
|
shape (..., dim) and norms is a float tensor of shape (..., 1). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Source code in src/turboquant_vllm/quantizer.py
dequantize ¶
Reconstruct vectors from centroid indices and norms.
Looks up centroids, applies inverse rotation, and rescales by stored norms.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices
|
Tensor
|
Long tensor of centroid indices, shape (..., dim). |
required |
norms
|
Tensor
|
Float tensor of vector norms, shape (..., 1). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Reconstructed float tensor of shape (..., dim). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Source code in src/turboquant_vllm/quantizer.py
TurboQuantProd ¶
Two-stage quantizer with QJL correction for unbiased inner products.
Allocates (bits-1) bits to Lloyd-Max MSE quantization and 1 bit to Quantized Johnson-Lindenstrauss residual correction. The QJL step eliminates bias in dot-product estimation, which is critical for attention score computation (Q·K^T).
The unbiased estimator
~
+ ||r|| * sqrt(pi/2) / m * <S@q, sign(S@r)>
where r is the quantization residual and S is a random Gaussian projection matrix.
Attributes:
| Name | Type | Description |
|---|---|---|
dim |
int
|
Vector dimension. |
bits |
int
|
Total bit budget (bits-1 for MSE, 1 for QJL). |
mse_quantizer |
TurboQuantMSE
|
Stage 1 quantizer with (bits-1) bits. |
qjl_dim |
int
|
Number of QJL projection dimensions. |
qjl_matrix |
Tensor
|
Random Gaussian projection matrix. |
Examples:
quantizer = TurboQuantProd(dim=64, bits=4)
indices, norms, signs, res_norms = quantizer.quantize(torch.randn(8, 64))
scores = quantizer.estimate_inner_product(
torch.randn(1, 64), indices, norms, signs, res_norms
)
Initialize the two-stage quantizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
int
|
Vector dimension (head dimension of the model). |
required |
bits
|
int
|
Total bit budget per coordinate. Must be >= 2 (1 bit for MSE + 1 bit for QJL minimum). |
required |
qjl_dim
|
int | None
|
Number of QJL projection dimensions. Defaults to dim (standard JL dimensionality). |
None
|
seed
|
int
|
Random seed for rotation and projection matrices. |
42
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If bits < 2. |
Source code in src/turboquant_vllm/quantizer.py
Functions¶
quantize ¶
Quantize vectors with MSE + QJL correction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input tensor of shape (..., dim). |
required |
Returns:
| Type | Description |
|---|---|
tuple[Tensor, Tensor, Tensor, Tensor]
|
Tuple of (indices, norms, qjl_signs, residual_norms): - indices: Lloyd-Max centroid indices, shape (..., dim) - norms: Vector norms, shape (..., 1) - qjl_signs: Sign bits of projected residuals, shape (..., qjl_dim) - residual_norms: Norms of quantization residuals, shape (..., 1) |
Source code in src/turboquant_vllm/quantizer.py
dequantize ¶
Reconstruct vectors from compressed representation.
Note: Full reconstruction is approximate. For attention computation,
use estimate_inner_product instead — it's more accurate because
QJL corrects inner-product bias, not reconstruction bias.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices
|
Tensor
|
Lloyd-Max centroid indices, shape (..., dim). |
required |
norms
|
Tensor
|
Vector norms, shape (..., 1). |
required |
qjl_signs
|
Tensor
|
QJL sign bits, shape (..., qjl_dim). |
required |
residual_norms
|
Tensor
|
Residual norms, shape (..., 1). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Approximately reconstructed tensor of shape (..., dim). |
Source code in src/turboquant_vllm/quantizer.py
estimate_inner_product ¶
estimate_inner_product(
query: Tensor,
indices: Tensor,
norms: Tensor,
qjl_signs: Tensor,
residual_norms: Tensor,
) -> Tensor
Compute unbiased inner product estimate between query and compressed key.
Uses the two-stage estimator
~
+ ||r|| * sqrt(pi/2) / m * <S@q, signs>
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
Tensor
|
Query vectors, shape (..., dim). |
required |
indices
|
Tensor
|
Compressed key indices, shape (..., dim). |
required |
norms
|
Tensor
|
Key norms, shape (..., 1). |
required |
qjl_signs
|
Tensor
|
QJL sign bits for keys, shape (..., qjl_dim). |
required |
residual_norms
|
Tensor
|
Key residual norms, shape (..., 1). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Inner product estimates, shape matching broadcast of query and key |
Tensor
|
batch dimensions. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Source code in src/turboquant_vllm/quantizer.py
Functions¶
solve_lloyd_max ¶
solve_lloyd_max(
d: int,
bits: int,
*,
use_exact: bool = False,
max_iter: int = 200,
tol: float = 1e-10,
) -> tuple[Tensor, Tensor]
Solve the Lloyd-Max conditions for optimal scalar quantization.
Results are cached by (d, bits, use_exact) so that multi-layer models (e.g., 32 layers × 2 K/V compressors = 64 calls) pay the scipy integration cost only once. Without caching, initialization takes 2+ minutes for models like Molmo2-8B.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
d
|
int
|
Vector dimension (determines the distribution shape). |
required |
bits
|
int
|
Number of quantization bits (produces 2^bits centroids). |
required |
use_exact
|
bool
|
If True, use exact Beta PDF. If False, use Gaussian approximation (faster, accurate for d >= 64). |
False
|
max_iter
|
int
|
Maximum Lloyd-Max iterations. |
200
|
tol
|
float
|
Convergence tolerance on centroid movement. |
1e-10
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Tuple of (centroids, boundaries) as 1-D tensors. Centroids has |
Tensor
|
length 2^bits, boundaries has length 2^bits - 1. |