molmo2_integration
turboquant_vllm.triton.molmo2_integration ¶
Fused TurboQuant attention integration for Molmo2 models.
Patches Molmo2 attention layers to compute Q @ K^T directly from nibble-packed 4-bit compressed keys using the fused Triton kernel. Keys are never materialized as full fp16 tensors during attention.
Values are stored uncompressed in fp16 (the softmax @ V path benefits less from compression and doesn't need a fused kernel).
Attributes:
| Name | Type | Description |
|---|---|---|
FusedTurboQuantRunner |
High-level runner that patches a Molmo2 model, generates text, and cleans up. |
|
install_fused_attention |
CompressedKVStore
|
Low-level function to patch attention layers. |
Examples:
runner = FusedTurboQuantRunner(model, processor, bits=4)
text, stats = runner.generate(
prompt="Describe this scene.",
video_path="/path/to/video.mp4",
max_new_tokens=256,
)
See Also
:mod:turboquant_vllm.triton.fused_qk_attention: The Triton kernel.
:mod:turboquant_vllm.kv_cache: Unfused CompressedDynamicCache.
Classes¶
CompressedKVStore ¶
CompressedKVStore(quantizer: TurboQuantMSE)
Bases: DynamicCache
KV store with compressed keys and standard values.
Keys are compressed into nibble-packed uint8 indices + fp32 norms
in side storage for the fused Triton kernel. Values and all
DynamicLayer bookkeeping are managed by the base DynamicCache
via the overridden update() method.
This cache is passed as past_key_values to model.generate().
Attributes:
| Name | Type | Description |
|---|---|---|
quantizer |
TurboQuantMSE
|
The TQ4 quantizer instance. |
rotation_T |
Tensor
|
Transposed rotation matrix for
query pre-rotation, shape |
centroids |
Tensor
|
Lloyd-Max centroid values,
shape |
Examples:
Initialize the compressed KV store.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
quantizer
|
TurboQuantMSE
|
TurboQuantMSE instance for key compression. |
required |
Source code in src/turboquant_vllm/triton/molmo2_integration.py
Functions¶
update ¶
update(
key_states: Tensor,
value_states: Tensor,
layer_idx: int,
cache_kwargs: dict[str, Any] | None = None,
) -> tuple[Tensor, Tensor]
Compress keys on write, store values normally via DynamicCache.
Overrides DynamicCache.update() to intercept key storage.
Keys are nibble-packed into compressed side storage for the
fused Triton kernel. Values and all DynamicLayer bookkeeping
(seq_length, layer creation, offloading) are handled by the
base class, keeping model.generate() happy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key_states
|
Tensor
|
Key tensor, shape
|
required |
value_states
|
Tensor
|
Value tensor, same shape as key_states. |
required |
layer_idx
|
int
|
Transformer layer index. |
required |
cache_kwargs
|
dict[str, Any] | None
|
Additional cache arguments (passed to base). |
None
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Tuple of (full_keys, full_values) from the base class. |
Tensor
|
The returned keys are uncompressed (for compatibility); |
tuple[Tensor, Tensor]
|
the fused kernel reads from compressed side storage instead. |
Source code in src/turboquant_vllm/triton/molmo2_integration.py
get_compressed_key ¶
Return compressed key data for a layer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
layer_idx
|
int
|
Transformer layer index. |
required |
Returns:
| Type | Description |
|---|---|
tuple[Tensor, Tensor]
|
Tuple of (packed_indices, norms). |
Source code in src/turboquant_vllm/triton/molmo2_integration.py
FusedTurboQuantRunner ¶
High-level runner for fused TurboQuant inference on Molmo2.
Patches the model, runs inference, and cleans up. Handles both text-only and video inputs.
Attributes:
| Name | Type | Description |
|---|---|---|
model |
Module
|
The Molmo2 model. |
processor |
Any
|
The Molmo2 processor. |
bits |
int
|
Quantization bit width. |
Examples:
runner = FusedTurboQuantRunner(model, processor, bits=4)
text, stats = runner.generate("Describe this.", max_new_tokens=256)
print(text)
Initialize the runner.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
A loaded Molmo2 model. |
required |
processor
|
Any
|
The corresponding Molmo2 processor. |
required |
bits
|
int
|
Quantization bits (default 4 for nibble packing). |
4
|
seed
|
int
|
Random seed for reproducibility. |
42
|
Source code in src/turboquant_vllm/triton/molmo2_integration.py
Functions¶
generate ¶
generate(
prompt: str, video_path: str | None = None, max_new_tokens: int = 256
) -> tuple[str, dict]
Generate text with fused TurboQuant attention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
Text prompt. |
required |
video_path
|
str | None
|
Optional path to a video file. |
None
|
max_new_tokens
|
int
|
Maximum tokens to generate. |
256
|
Returns:
| Type | Description |
|---|---|
tuple[str, dict]
|
Tuple of (generated_text, stats_dict). |
Source code in src/turboquant_vllm/triton/molmo2_integration.py
455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 | |
Functions¶
install_fused_attention ¶
install_fused_attention(
model: Module, bits: int = 4, *, seed: int = 42
) -> CompressedKVStore
Patch all Molmo2 text attention layers to use fused TurboQuant.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
A loaded Molmo2 model. |
required |
bits
|
int
|
Quantization bits per coordinate (default 4 for nibble packing). |
4
|
seed
|
int
|
Random seed for reproducibility. |
42
|
Returns:
| Type | Description |
|---|---|
CompressedKVStore
|
A CompressedKVStore to pass as |
CompressedKVStore
|
|
Source code in src/turboquant_vllm/triton/molmo2_integration.py
uninstall_fused_attention ¶
Restore original attention forwards.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
The patched Molmo2 model. |
required |