HuggingFace Integration¶

Use turboquant-vllm directly with HuggingFace's DynamicCache for research, benchmarking, or non-vLLM inference.

Install¶

pip install turboquant-vllm

Usage¶

from transformers import DynamicCache
from turboquant_vllm import CompressedDynamicCache

cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)

# Pass cache (not the wrapper) to model.generate()
# Compression happens transparently on every cache.update()

Bit Width Options¶

Bits	Compression	Quality	Use Case
2	~8x	~75% cosine	Experimental
3	1.94x	~95% cosine	Memory-constrained
4	3.76x	~97% cosine	Recommended
5	~1.6x	~99% cosine	Quality-critical

TQ4 (bits=4) is the sweet spot

Nibble packing at 4-bit gives 3.76x compression with ~97% cosine similarity. 3-bit gives only 1.94x because indices are stored as full bytes (no cross-byte packing).

Accuracy-Only Mode¶

For measuring compression quality without VRAM savings:

from turboquant_vllm import TurboQuantKVCache

cache = DynamicCache()
wrapper = TurboQuantKVCache(cache, head_dim=128, bits=4)

# Keys and values are compressed then immediately decompressed
# No VRAM savings, but measures the quality impact of compression

Compression Stats¶

stats = compressed.compression_stats()
# {
#   'num_layers': 36,
#   'seq_len': 1024,
#   'num_heads': 8,
#   'head_dim': 128,
#   'bits': 4,
#   'nibble_packed': True,
#   'compression_ratio': 3.76,
#   'savings_mib': 150.2,
# }

Benchmark CLI¶

Run A/B comparisons on Molmo2 models:

uv run python -m turboquant_vllm.benchmark \
    --model allenai/Molmo2-4B \
    --bits 4 --compressed \
    --video /path/to/clip.mp4 \
    --max-new-tokens 256