vLLM Plugin¶

turboquant-vllm registers as a custom attention backend via vLLM's plugin system. No code changes are needed — just install and pass a CLI flag.

Install¶

pip install turboquant-vllm[vllm]

vllm serve allenai/Molmo2-8B --attention-backend CUSTOM

The tq4_backend entry point registers automatically on import. vLLM will log:

INFO [cuda.py:257] Using AttentionBackendEnum.CUSTOM backend.

The TQ4 attention backend replaces vLLM's default KV cache page format:

On each attention step, the backend:

The plugin uses sensible defaults. No additional configuration is needed beyond --attention-backend CUSTOM.

vLLM Flag	Recommended	Notes
`--attention-backend CUSTOM`	Required	Enables TQ4
`--enforce-eager`	Recommended	CUDA graphs not yet validated with TQ4
`--max-model-len`	Model-specific	Unchanged from standard vLLM
`--gpu-memory-utilization`	0.85-0.90	Unchanged from standard vLLM

If you need to register the backend programmatically (e.g., in a custom launcher):

from turboquant_vllm.vllm import register_tq4_backend

register_tq4_backend()

Any model supported by vLLM should work with the TQ4 backend. Validated on: