kv_cache
turboquant_vllm.kv_cache ¶
TurboQuant-compressed KV cache for HuggingFace transformers.
Two integration modes:
-
TurboQuantKVCache — Accuracy benchmark only (no VRAM savings). Compresses then immediately decompresses, storing lossy FP32 back into the standard DynamicCache. Measures quantization quality impact.
-
CompressedDynamicCache — Real VRAM savings. Stores uint8 indices + fp16 norms in compressed form. Dequantizes lazily on each cache read (one layer at a time). Supports asymmetric K/V bit-widths via
k_bitsandv_bitsparameters.
Both use non-invasive method replacement: we save a reference to the
original update() method and replace it with a wrapper. This avoids
subclassing DynamicCache, which is fragile across transformers versions.
Both classes support the context manager protocol (with statement)
for automatic restore() on scope exit, and detect double-wrapping.
Usage
# Mode 1: Accuracy benchmark (no VRAM savings)
cache = DynamicCache()
tq_cache = TurboQuantKVCache(cache, head_dim=128, bits=3)
# Mode 2: Real VRAM savings (with context manager)
cache = DynamicCache()
with CompressedDynamicCache(cache, head_dim=128, bits=3) as compressed:
pass # cache.update is patched inside the block
# cache.update is restored here
Examples:
from transformers import DynamicCache
cache = DynamicCache()
tq = TurboQuantKVCache(cache, head_dim=128, bits=3)
See Also
:mod:turboquant_vllm.compressors: TurboQuantCompressorMSE and CompressedValues.
arXiv 2504.19874, Section 5.2: TurboQuant algorithm reference.
Classes¶
TurboQuantKVCache ¶
TurboQuantKVCache(
cache: Any,
head_dim: int,
bits: int = 3,
*,
seed: int = 42,
compress_keys: bool = True,
compress_values: bool = True,
)
Transparent KV cache compression wrapper (drop-in mode).
Intercepts cache updates to compress key/value tensors before they are stored. Both keys and values use TurboQuantCompressorMSE (full MSE-optimal quantization at the configured bit-width).
This is the "drop-in" approach where standard attention (Q @ K^T) operates on decompressed keys. For the QJL-corrected inner product path (TurboQuantProd), a custom attention kernel would be needed — see TurboQuantCompressorV2.asymmetric_attention_scores().
Supports the context manager protocol for automatic restore()
on scope exit, and warns if the cache is already wrapped.
Attributes:
| Name | Type | Description |
|---|---|---|
cache |
Any
|
The wrapped DynamicCache instance. |
key_compressor |
TurboQuantCompressorMSE
|
Compressor for key tensors. |
value_compressor |
TurboQuantCompressorMSE
|
Compressor for value tensors. |
bits |
int
|
Quantization bits per coordinate. |
head_dim |
int
|
Model head dimension. |
enabled |
bool
|
Whether compression is active. |
Examples:
from transformers import DynamicCache
cache = DynamicCache()
tq = TurboQuantKVCache(cache, head_dim=128, bits=3)
tq.enabled # True
Initialize the TurboQuant KV cache wrapper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache
|
Any
|
A HuggingFace DynamicCache instance to wrap. |
required |
head_dim
|
int
|
Dimension of each attention head. |
required |
bits
|
int
|
Quantization bits per coordinate (default 3). |
3
|
seed
|
int
|
Random seed for reproducibility. |
42
|
compress_keys
|
bool
|
Whether to compress key tensors. |
True
|
compress_values
|
bool
|
Whether to compress value tensors. |
True
|
Warns:
| Type | Description |
|---|---|
UserWarning
|
If |
Source code in src/turboquant_vllm/kv_cache.py
Functions¶
disable ¶
Disable compression, passing through to original update.
Useful for A/B benchmarking within the same run.
enable ¶
restore ¶
Restore the original update method on the wrapped cache.
Call this to fully unwrap the cache and remove all TurboQuant interception.
__enter__ ¶
__enter__() -> TurboQuantKVCache
Enter the context manager.
Returns:
| Type | Description |
|---|---|
TurboQuantKVCache
|
Self, for use in |
__exit__ ¶
Exit the context manager, restoring the original cache methods.
Returns:
| Type | Description |
|---|---|
bool
|
False — exceptions are never suppressed. |
CompressedDynamicCache ¶
CompressedDynamicCache(
cache: Any,
head_dim: int,
bits: int | None = 3,
*,
k_bits: int | None = None,
v_bits: int | None = None,
seed: int = 42,
model_config: Any = None,
)
KV cache with real VRAM savings via compressed index storage.
Stores TurboQuant-compressed representations and dequantizes lazily on each cache read. Only one layer's decompressed tensors are held in memory at a time — previous layers are freed on the next update.
Supports heterogeneous head dimensions for the lazy dequantized
(non-fused) cache-read path via per-head_dim compressors created
lazily on first use. The fused path consumes shared rotation
and centroids for the primary head_dim only, so it must not be
used for models with mixed head dimensions (e.g. Gemma 4: d=256
sliding, d=512 global).
Storage per token per head (head_dim=128):
============ ======= ===== =========== =========== Mode Dtype Bytes Compression Quality ============ ======= ===== =========== =========== FP16 baseline fp16 256 1.0x — TQ3 (3-bit) uint8 132 1.94x ~95% cosine TQ4 (4-bit) nibble 68 3.76x ~97% cosine ============ ======= ===== =========== ===========
At bits=4, indices are nibble-packed (two 4-bit values per
byte), nearly doubling compression over TQ3 with better quality.
Float32 norms are required — fp16 causes output degradation at
10K+ token sequences due to accumulated precision loss.
For models with mixed global and sliding window attention layers
(e.g. Gemma-2, Gemma-3), SWA layers automatically bypass compression
via the is_sliding attribute on DynamicSlidingWindowLayer.
Only global attention layers are compressed. Pass model_config
to enable a diagnostic warning when the cache lacks SWA metadata.
Integration strategy: non-invasive method replacement (same pattern
as TurboQuantKVCache). Patches update() and get_seq_length()
on the wrapped DynamicCache. Supports the context manager protocol
for automatic restore() on scope exit, and warns on double-wrap.
Compatible with both transformers 4.x and 5.x lazy_initialization
signatures via try/except fallback in _ensure_layer_initialized.
Attributes:
| Name | Type | Description |
|---|---|---|
cache |
Any
|
The wrapped DynamicCache instance. |
key_compressor |
TurboQuantCompressorMSE
|
Compressor for key tensors. |
value_compressor |
TurboQuantCompressorMSE
|
Compressor for value tensors. |
bits |
int
|
Quantization bits per coordinate. |
head_dim |
int
|
Model head dimension. |
enabled |
bool
|
Whether compression is active. |
fused_mode |
bool
|
When True, skip decompression in |
rotation |
Tensor
|
Shared rotation matrix |
centroids |
Tensor
|
Shared codebook |
Examples:
from transformers import DynamicCache
cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
compressed.vram_bytes() # 0
Initialize the compressed KV cache wrapper.
Sets up per-head_dim compressors (lazily created via
_get_compressors()), internal storage for compressed
representations, and incremental decompressed buffers.
fused_mode starts disabled. When model_config has
mixed sliding/full attention layer_types, full attention
layers are bypassed (with list padding) to preserve retrieval
quality while allowing get_seq_length to delegate correctly.
Keys and values can use different bit-widths via k_bits and
v_bits. When both are None, bits applies to both
(backward compatible). Any 4-bit component requires even
head_dim for nibble packing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache
|
Any
|
A HuggingFace DynamicCache instance to wrap. |
required |
head_dim
|
int
|
Dimension of each attention head. Must be even when any component uses 4-bit (nibble packing). |
required |
bits
|
int | None
|
Shorthand for |
3
|
k_bits
|
int | None
|
Key quantization bits (overrides |
None
|
v_bits
|
int | None
|
Value quantization bits (overrides |
None
|
seed
|
int
|
Random seed for reproducibility. |
42
|
model_config
|
Any
|
Optional model config (e.g. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If no bit-width is specified (all three are None). |
ValueError
|
If any 4-bit component has odd |
Warns:
| Type | Description |
|---|---|
UserWarning
|
If |
UserWarning
|
If |
Source code in src/turboquant_vllm/kv_cache.py
339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 | |
Attributes¶
key_compressor
property
¶
key_compressor: TurboQuantCompressorMSE
Primary key compressor (backward compat).
value_compressor
property
¶
value_compressor: TurboQuantCompressorMSE
Primary value compressor (backward compat).
rotation
property
¶
Shared orthogonal rotation matrix [head_dim, head_dim] fp32.
K and V use the same rotation (same seed).
Returns:
| Type | Description |
|---|---|
Tensor
|
The rotation matrix from the key compressor's quantizer. |
centroids
property
¶
Shared Lloyd-Max codebook [2^bits] fp32.
Returns:
| Type | Description |
|---|---|
Tensor
|
Centroid values from the key compressor's quantizer. |
Functions¶
get_compressed ¶
Return compressed K and V for a layer (fused kernel API).
Provides the raw nibble-packed indices and norms without dequantization, for use by the fused TQ4 Flash Attention kernel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
layer_idx
|
int
|
Transformer layer index. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
|
Tensor
|
are uint8 |
Tensor
|
fp32 |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Source code in src/turboquant_vllm/kv_cache.py
disable ¶
enable ¶
restore ¶
Restore original methods on the wrapped cache.
Call this to fully unwrap the cache and remove all TurboQuant interception.
Source code in src/turboquant_vllm/kv_cache.py
__enter__ ¶
__enter__() -> CompressedDynamicCache
Enter the context manager.
Returns:
| Type | Description |
|---|---|
CompressedDynamicCache
|
Self, for use in |
__exit__ ¶
Exit the context manager, restoring the original cache methods.
Returns:
| Type | Description |
|---|---|
bool
|
False — exceptions are never suppressed. |
vram_bytes ¶
Calculate total VRAM used by compressed storage.
SWA-bypassed layers (None entries) are excluded from the total.
Returns:
| Type | Description |
|---|---|
int
|
Total bytes across all compressed layers (keys + values). |
Source code in src/turboquant_vllm/kv_cache.py
baseline_vram_bytes ¶
Estimate FP16 VRAM that would be used without compression.
Accounts for nibble-packed indices by doubling the last dimension to recover the original head_dim. SWA-bypassed layers (None entries) are excluded.
Returns:
| Type | Description |
|---|---|
int
|
Total bytes if keys and values were stored as FP16 tensors. |
Source code in src/turboquant_vllm/kv_cache.py
compression_stats ¶
Return compression statistics for reporting.
Reports per-component bit-widths, the true head_dim, compression
ratio, and per-sequence VRAM estimates at representative context
lengths (4K, 16K, 32K tokens). Only counts compressed (non-SWA)
layers. VRAM estimates are per sequence — multiply by batch size
for total memory.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with layer count, sequence length, per-component bit-widths, |
dict[str, Any]
|
compressed/baseline sizes in MiB, compression ratio, VRAM savings, |
dict[str, Any]
|
and per-sequence VRAM estimates at representative context lengths. |