vLLM Plugin¶
turboquant-vllm registers as a custom attention backend via vLLM's plugin system. No code changes are needed — just install and pass a CLI flag.
Install¶
Serve¶
The tq4_backend entry point registers automatically on import. vLLM will log:
How It Works¶
The TQ4 attention backend replaces vLLM's default KV cache page format:
| FP16 (default) | TQ4 (turboquant-vllm) | |
|---|---|---|
| Bytes per token per KV head | 256 | 68 |
| Compression ratio | 1.0x | 3.76x |
| Storage format | float16 | uint8 nibble-packed + fp32 norms |
On each attention step, the backend:
- Decompresses TQ4 blocks back to float16
- Delegates to Flash Attention for the actual attention computation
- Only decompresses new tokens incrementally (not the full cache)
Configuration¶
The plugin uses sensible defaults. No additional configuration is needed beyond --attention-backend CUSTOM.
| vLLM Flag | Recommended | Notes |
|---|---|---|
--attention-backend CUSTOM |
Required | Enables TQ4 |
--enforce-eager |
Recommended | CUDA graphs not yet validated with TQ4 |
--max-model-len |
Model-specific | Unchanged from standard vLLM |
--gpu-memory-utilization |
0.85-0.90 | Unchanged from standard vLLM |
Manual Registration¶
If you need to register the backend programmatically (e.g., in a custom launcher):
Supported Models¶
Any model supported by vLLM should work with the TQ4 backend. Validated on:
- Molmo2-4B — 11K visual tokens, video inference
- Molmo2-8B — 6K context, video + text inference