verify
turboquant_vllm.verify ¶
Verify TurboQuant compression quality on a specific model and environment.
Runs a 128-token random Gaussian prefill through CompressedDynamicCache and reports per-layer cosine similarity vs uncompressed DynamicCache. Outputs PASS/FAIL against a configurable threshold (default 0.99, compression quality tier).
Gated HuggingFace models (e.g. Llama-3.2) are supported via the HF_TOKEN
environment variable, which is passed to all from_pretrained calls.
Validated model families (Molmo2, Mistral, Llama, Qwen2.5, Phi, Gemma 2, Gemma 3,
Gemma 4) report "validation": "VALIDATED" in the output; unvalidated models
report "UNVALIDATED" as a warning.
Models with shared KV cache layers (e.g., Gemma 4 with num_kv_shared_layers)
are handled by iterating only over unique cache layers.
Usage
Examples:
from turboquant_vllm.verify import main
main(["--model", "allenai/Molmo2-4B", "--bits", "4", "--json"])
See Also
:mod:turboquant_vllm.benchmark: Full inference benchmark harness.
:class:turboquant_vllm.CompressedDynamicCache: Compressed cache wrapper.
Functions¶
main ¶
CLI entry point for the verify command.
Parses --model, --bits (or --k-bits and --v-bits together),
--threshold (default 0.99), and --json flags, runs verification,
and exits 0 (PASS) or 1 (FAIL). --k-bits and --v-bits must be
used together; --bits cannot be combined with per-component flags.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
argv
|
list[str] | None
|
Command-line arguments. Uses sys.argv[1:] if None. |
None
|
Source code in src/turboquant_vllm/verify.py
368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 | |