Skip to content

verify

turboquant_vllm.verify

Verify TurboQuant compression quality on a specific model and environment.

Runs a 128-token random Gaussian prefill through CompressedDynamicCache and reports per-layer cosine similarity vs uncompressed DynamicCache. Outputs PASS/FAIL against a configurable threshold (default 0.99, compression quality tier).

Gated HuggingFace models (e.g. Llama-3.2) are supported via the HF_TOKEN environment variable, which is passed to all from_pretrained calls.

Validated model families (Molmo2, Mistral, Llama, Qwen2.5, Phi, Gemma 2, Gemma 3, Gemma 4) report "validation": "VALIDATED" in the output; unvalidated models report "UNVALIDATED" as a warning.

Models with shared KV cache layers (e.g., Gemma 4 with num_kv_shared_layers) are handled by iterating only over unique cache layers.

Usage
# Human-readable summary to stdout
python -m turboquant_vllm.verify --model allenai/Molmo2-4B --bits 4

# JSON to stdout, human summary to stderr (pipe-friendly)
python -m turboquant_vllm.verify --model mistralai/Mistral-7B-v0.1 --bits 4 --json

Examples:

from turboquant_vllm.verify import main

main(["--model", "allenai/Molmo2-4B", "--bits", "4", "--json"])
See Also

:mod:turboquant_vllm.benchmark: Full inference benchmark harness. :class:turboquant_vllm.CompressedDynamicCache: Compressed cache wrapper.

Functions

main

main(argv: list[str] | None = None) -> None

CLI entry point for the verify command.

Parses --model, --bits (or --k-bits and --v-bits together), --threshold (default 0.99), and --json flags, runs verification, and exits 0 (PASS) or 1 (FAIL). --k-bits and --v-bits must be used together; --bits cannot be combined with per-component flags.

Parameters:

Name Type Description Default
argv list[str] | None

Command-line arguments. Uses sys.argv[1:] if None.

None
Source code in src/turboquant_vllm/verify.py
def main(argv: list[str] | None = None) -> None:
    """CLI entry point for the verify command.

    Parses ``--model``, ``--bits`` (or ``--k-bits`` and ``--v-bits`` together),
    ``--threshold`` (default 0.99), and ``--json`` flags, runs verification,
    and exits 0 (PASS) or 1 (FAIL).  ``--k-bits`` and ``--v-bits`` must be
    used together; ``--bits`` cannot be combined with per-component flags.

    Args:
        argv: Command-line arguments. Uses sys.argv[1:] if None.
    """
    parser = argparse.ArgumentParser(
        description="Verify TurboQuant compression quality on a model"
    )
    parser.add_argument(
        "--model",
        required=True,
        help="HuggingFace model ID (e.g., allenai/Molmo2-4B)",
    )
    parser.add_argument(
        "--bits",
        type=int,
        choices=[2, 3, 4, 5],
        default=None,
        help="Quantization bits per coordinate (shorthand for --k-bits and --v-bits)",
    )
    parser.add_argument(
        "--k-bits",
        type=int,
        choices=[2, 3, 4, 5],
        default=None,
        help="Key quantization bits (overrides --bits for keys)",
    )
    parser.add_argument(
        "--v-bits",
        type=int,
        choices=[2, 3, 4, 5],
        default=None,
        help="Value quantization bits (overrides --bits for values)",
    )
    parser.add_argument(
        "--threshold",
        type=float,
        default=COMPRESSION_QUALITY_THRESHOLD,
        help=f"Minimum cosine similarity for PASS (default: {COMPRESSION_QUALITY_THRESHOLD})",
    )
    parser.add_argument(
        "--json",
        action="store_true",
        default=False,
        dest="json_output",
        help="Output JSON to stdout (human summary to stderr)",
    )

    args = parser.parse_args(argv)

    # Validate argument combinations
    if args.bits is not None and (args.k_bits is not None or args.v_bits is not None):
        parser.error("--bits cannot be used with --k-bits or --v-bits")
    if args.bits is None and args.k_bits is None and args.v_bits is None:
        parser.error("Specify --bits or --k-bits/--v-bits")
    if (args.k_bits is None) != (args.v_bits is None):
        parser.error("--k-bits and --v-bits must be used together")

    result = _run_verification(
        args.model,
        args.bits if args.bits is not None else args.k_bits,
        args.threshold,
        k_bits=args.k_bits,
        v_bits=args.v_bits,
    )
    human_summary = _format_human_summary(result)

    if args.json_output:
        # JSON to stdout, human to stderr
        print(json.dumps(result, indent=2), file=sys.stdout)
        print(human_summary, file=sys.stderr)
    else:
        # Human-readable to stdout only
        print(human_summary, file=sys.stdout)

    sys.exit(0 if result["status"] == "PASS" else 1)