kv_cache
turboquant_vllm.kv_cache ¶
TurboQuant-compressed KV cache for HuggingFace transformers.
Two integration modes:
-
TurboQuantKVCache — Accuracy benchmark only (no VRAM savings). Compresses then immediately decompresses, storing lossy FP32 back into the standard DynamicCache. Measures quantization quality impact.
-
CompressedDynamicCache — Real VRAM savings. Stores uint8 indices + fp16 norms in compressed form. Dequantizes lazily on each cache read (one layer at a time). Achieves ~2x compression vs FP16 KV cache.
Both use non-invasive method replacement: we save a reference to the original update() method and replace it with a wrapper. This avoids subclassing DynamicCache, which is fragile across transformers versions.
Usage
# Mode 1: Accuracy benchmark (no VRAM savings)
cache = DynamicCache()
tq_cache = TurboQuantKVCache(cache, head_dim=128, bits=3)
# Mode 2: Real VRAM savings
cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=3)
# In both cases, pass cache (not the wrapper) to model.generate()
Examples:
from transformers import DynamicCache
cache = DynamicCache()
tq = TurboQuantKVCache(cache, head_dim=128, bits=3)
See Also
:mod:turboquant_vllm.compressors: TurboQuantCompressorMSE and CompressedValues.
arXiv 2504.19874, Section 5.2: TurboQuant algorithm reference.
Classes¶
TurboQuantKVCache ¶
TurboQuantKVCache(
cache: Any,
head_dim: int,
bits: int = 3,
*,
seed: int = 42,
compress_keys: bool = True,
compress_values: bool = True,
)
Transparent KV cache compression wrapper (drop-in mode).
Intercepts cache updates to compress key/value tensors before they are stored. Both keys and values use TurboQuantCompressorMSE (full MSE-optimal quantization at the configured bit-width).
This is the "drop-in" approach where standard attention (Q @ K^T) operates on decompressed keys. For the QJL-corrected inner product path (TurboQuantProd), a custom attention kernel would be needed — see TurboQuantCompressorV2.asymmetric_attention_scores().
Attributes:
| Name | Type | Description |
|---|---|---|
cache |
Any
|
The wrapped DynamicCache instance. |
key_compressor |
TurboQuantCompressorMSE
|
Compressor for key tensors. |
value_compressor |
TurboQuantCompressorMSE
|
Compressor for value tensors. |
bits |
int
|
Quantization bits per coordinate. |
head_dim |
int
|
Model head dimension. |
enabled |
bool
|
Whether compression is active. |
Examples:
from transformers import DynamicCache
cache = DynamicCache()
tq = TurboQuantKVCache(cache, head_dim=128, bits=3)
tq.enabled # True
Initialize the TurboQuant KV cache wrapper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache
|
Any
|
A HuggingFace DynamicCache instance to wrap. |
required |
head_dim
|
int
|
Dimension of each attention head. |
required |
bits
|
int
|
Quantization bits per coordinate (default 3). |
3
|
seed
|
int
|
Random seed for reproducibility. |
42
|
compress_keys
|
bool
|
Whether to compress key tensors. |
True
|
compress_values
|
bool
|
Whether to compress value tensors. |
True
|
Source code in src/turboquant_vllm/kv_cache.py
CompressedDynamicCache ¶
KV cache with real VRAM savings via compressed index storage.
Stores TurboQuant-compressed representations and dequantizes lazily on each cache read. Only one layer's decompressed tensors are held in memory at a time — previous layers are freed on the next update.
Storage per token per head (head_dim=128):
============ ======= ===== =========== =========== Mode Dtype Bytes Compression Quality ============ ======= ===== =========== =========== FP16 baseline fp16 256 1.0x — TQ3 (3-bit) uint8 132 1.94x ~95% cosine TQ4 (4-bit) nibble 68 3.76x ~97% cosine ============ ======= ===== =========== ===========
At bits=4, indices are nibble-packed (two 4-bit values per
byte), nearly doubling compression over TQ3 with better quality.
Float32 norms are required — fp16 causes output degradation at
10K+ token sequences due to accumulated precision loss.
Integration strategy: non-invasive method replacement (same pattern
as TurboQuantKVCache). Patches update() and get_seq_length()
on the wrapped DynamicCache.
Attributes:
| Name | Type | Description |
|---|---|---|
cache |
Any
|
The wrapped DynamicCache instance. |
key_compressor |
TurboQuantCompressorMSE
|
Compressor for key tensors. |
value_compressor |
TurboQuantCompressorMSE
|
Compressor for value tensors. |
bits |
int
|
Quantization bits per coordinate. |
head_dim |
int
|
Model head dimension. |
enabled |
bool
|
Whether compression is active. |
fused_mode |
bool
|
When True, skip decompression in |
rotation |
Tensor
|
Shared rotation matrix |
centroids |
Tensor
|
Shared codebook |
Examples:
from transformers import DynamicCache
cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
compressed.vram_bytes() # 0
Initialize the compressed KV cache wrapper.
Sets up compressors, internal storage for compressed representations,
and incremental decompressed buffers. fused_mode starts disabled.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache
|
Any
|
A HuggingFace DynamicCache instance to wrap. |
required |
head_dim
|
int
|
Dimension of each attention head. Must be even
when |
required |
bits
|
int
|
Quantization bits per coordinate (default 3). Use 4 for nibble-packed storage (3.76x compression). |
3
|
seed
|
int
|
Random seed for reproducibility. |
42
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Source code in src/turboquant_vllm/kv_cache.py
Attributes¶
rotation
property
¶
Shared orthogonal rotation matrix [head_dim, head_dim] fp32.
K and V use the same rotation (same seed).
Returns:
| Type | Description |
|---|---|
Tensor
|
The rotation matrix from the key compressor's quantizer. |
centroids
property
¶
Shared Lloyd-Max codebook [2^bits] fp32.
Returns:
| Type | Description |
|---|---|
Tensor
|
Centroid values from the key compressor's quantizer. |
Functions¶
get_compressed ¶
Return compressed K and V for a layer (fused kernel API).
Provides the raw nibble-packed indices and norms without dequantization, for use by the fused TQ4 Flash Attention kernel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
layer_idx
|
int
|
Transformer layer index. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
|
Tensor
|
are uint8 |
Tensor
|
fp32 |
Source code in src/turboquant_vllm/kv_cache.py
disable ¶
enable ¶
restore ¶
Restore original methods on the wrapped cache.
Call this to fully unwrap the cache and remove all TurboQuant interception.
Source code in src/turboquant_vllm/kv_cache.py
vram_bytes ¶
Calculate total VRAM used by compressed storage.
Returns:
| Type | Description |
|---|---|
int
|
Total bytes across all compressed layers (keys + values). |
Source code in src/turboquant_vllm/kv_cache.py
baseline_vram_bytes ¶
Estimate FP16 VRAM that would be used without compression.
Accounts for nibble-packed indices by doubling the last dimension to recover the original head_dim.
Returns:
| Type | Description |
|---|---|
int
|
Total bytes if keys and values were stored as FP16 tensors. |
Source code in src/turboquant_vllm/kv_cache.py
compression_stats ¶
Return compression statistics for reporting.
Reports the true head_dim (not the packed index dimension)
and includes a nibble_packed flag.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with layer count, sequence length, compressed/baseline |
dict[str, Any]
|
sizes in MiB, compression ratio, packing mode, and VRAM savings. |