Skip to content

Container Deployment

turboquant-vllm ships a Containerfile that bakes the plugin into the official vLLM image. Build once, deploy anywhere — no runtime pip installs.

Build the Image

podman build -t vllm-turboquant -f infra/Containerfile.vllm .

The build installs turboquant-vllm from PyPI and verifies the plugin entry point registers correctly.

Run with TQ4 Compression

podman run --rm \
  --device nvidia.com/gpu=all \
  --shm-size=8g \
  -v vllm-models:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm-turboquant \
  --model allenai/Molmo2-8B \
  --attention-backend CUSTOM \
  --dtype auto \
  --max-model-len 6144 \
  --max-num-batched-tokens 6144 \
  --enforce-eager \
  --gpu-memory-utilization 0.90

No code changes required

The --attention-backend CUSTOM flag is the only difference from a standard vLLM deployment. The plugin registers automatically via vllm.general_plugins.

Quadlet (systemd Integration)

For persistent deployments on Podman, use a Quadlet container file at ~/.config/containers/systemd/vllm-turboquant.container:

[Container]
Image=localhost/vllm-turboquant:latest
ContainerName=vllm-tq
SecurityLabelDisable=true
ShmSize=8g

AddDevice=nvidia.com/gpu=all

Exec=allenai/Molmo2-8B \
    --attention-backend CUSTOM \
    --dtype auto \
    --max-model-len 6144 \
    --max-num-batched-tokens 6144 \
    --enforce-eager \
    --gpu-memory-utilization 0.90 \
    --trust-remote-code

Volume=vllm-models.volume:/root/.cache/huggingface
PublishPort=8000:8000

HealthCmd=bash -c 'echo > /dev/tcp/localhost/8000'
HealthInterval=30s
HealthTimeout=10s
HealthRetries=5
HealthStartPeriod=300s

[Service]
Restart=always
TimeoutStartSec=900

[Install]
WantedBy=default.target

Then reload and start:

systemctl --user daemon-reload
systemctl --user start vllm-turboquant

What Gets Compressed

Data Compressed Format
Key cache vectors Yes uint8 nibble-packed indices + fp32 norms
Value cache vectors Yes uint8 nibble-packed indices + fp32 norms
Rotation matrices No Generated once per layer from fixed seed
Lloyd-Max codebook No Computed once, shared across all layers

Memory Considerations

The TQ4 backend compresses KV cache pages to 68 bytes/token/head vs 256 bytes for FP16 (3.76x compression). This is most impactful at long context lengths where KV cache dominates memory.

GPU memory for model weights is unchanged

TurboQuant only compresses the KV cache, not model weights. Peak VRAM during prefill is activation-dominated — compression savings are most visible in the permanent KV cache storage during generation.

GPU Model Max Context Notes
RTX 4090 (24 GB) Molmo2-8B 6144 --gpu-memory-utilization 0.90
RTX 4090 (24 GB) Molmo2-4B 11264 Validated in experiments

Verifying the Plugin

Confirm the TQ4 backend is active in the container logs:

INFO [cuda.py:257] Using AttentionBackendEnum.CUSTOM backend.

Or check from inside the container:

podman exec <container> python3 -c "
from turboquant_vllm.vllm import TQ4AttentionBackend
import importlib.metadata
v = importlib.metadata.version('turboquant-vllm')
print(f'turboquant-vllm {v} — plugin loaded')
"