attention_interface
turboquant_vllm.triton.attention_interface ¶
HuggingFace AttentionInterface registration for Triton Flash Attention.
Registers attention backends that resolve at forward time via
ALL_ATTENTION_FUNCTIONS[config._attn_implementation].
Two backends:
triton_fa: Phase 1 vanilla kernel (standard fp16 K/V).triton_fa_tq4_kv: Phase 3 fused TQ4 kernel (compressed K+V read fromCompressedDynamicCachevia side-channel cache reference).
Attributes:
| Name | Type | Description |
|---|---|---|
triton_fa_forward |
tuple[Tensor, None]
|
Phase 1 vanilla attention function. |
triton_fa_tq4_kv_forward |
tuple[Tensor, None]
|
Phase 3 fused TQ4 K+V attention function. |
register_triton_fa |
None
|
Register vanilla backend. |
install_triton_fa |
None
|
Activate vanilla backend on a model. |
install_fused_tq4_kv |
None
|
Activate fused TQ4 K+V backend with cache stash. |
Examples:
from turboquant_vllm.triton.attention_interface import install_fused_tq4_kv
install_fused_tq4_kv(model, compressed_cache)
output = model.generate(inputs)
See Also
:mod:turboquant_vllm.triton.flash_attention: Phase 1 kernel.
:mod:turboquant_vllm.triton.flash_attention_tq4_kv: Phase 3 kernel.
:mod:turboquant_vllm.kv_cache: CompressedDynamicCache storage.
Classes¶
Functions¶
triton_fa_forward ¶
triton_fa_forward(
module: Module,
query: Tensor,
key: Tensor,
value: Tensor,
attention_mask: Optional[Tensor],
dropout: float = 0.0,
scaling: Optional[float] = None,
**kwargs: object,
) -> tuple[Tensor, None]
HF-compatible attention forward using Triton Flash Attention.
Signature matches transformers.integrations.sdpa_attention.sdpa_attention_forward.
Handles GQA natively (no KV repeat expansion needed).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
module
|
Module
|
The attention layer module. Used to read |
required |
query
|
Tensor
|
|
required |
key
|
Tensor
|
|
required |
value
|
Tensor
|
|
required |
attention_mask
|
Optional[Tensor]
|
Optional additive mask |
required |
dropout
|
float
|
Dropout rate (must be 0 -- inference only). |
0.0
|
scaling
|
Optional[float]
|
Softmax scale. Defaults to |
None
|
Other Parameters:
| Name | Type | Description |
|---|---|---|
is_causal |
bool | None
|
Override causal mode. If |
**kwargs |
object
|
Additional model-specific arguments (ignored). |
Returns:
| Type | Description |
|---|---|
Tensor
|
|
None
|
(transposed to match HF convention). |
Source code in src/turboquant_vllm/triton/attention_interface.py
register_triton_fa ¶
Register triton_fa as a global attention backend in HuggingFace.
Safe to call multiple times -- overwrites the previous registration.
Source code in src/turboquant_vllm/triton/attention_interface.py
install_triton_fa ¶
Register the backend and activate it on model.
Changes model.config._attn_implementation to "triton_fa".
The model resolves the attention function at forward time, so this
takes effect on the next forward call.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
A HuggingFace model with a |
required |
Raises:
| Type | Description |
|---|---|
AttributeError
|
If model has no |
Source code in src/turboquant_vllm/triton/attention_interface.py
triton_fa_tq4_kv_forward ¶
triton_fa_tq4_kv_forward(
module: Module,
query: Tensor,
key: Tensor,
value: Tensor,
attention_mask: Optional[Tensor],
dropout: float = 0.0,
scaling: Optional[float] = None,
**kwargs: object,
) -> tuple[Tensor, None]
Fused TQ4 K+V attention via cache side-channel.
Reads compressed K/V from the CompressedDynamicCache stashed on
module._tq4_cache (ignoring the decompressed key/value args).
Falls back to vanilla Triton FA if no cache reference is found.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
module
|
Module
|
Attention layer with |
required |
query
|
Tensor
|
|
required |
key
|
Tensor
|
|
required |
value
|
Tensor
|
|
required |
attention_mask
|
Optional[Tensor]
|
Optional additive mask. |
required |
dropout
|
float
|
Must be 0 (inference only). |
0.0
|
scaling
|
Optional[float]
|
Softmax scale. |
None
|
Other Parameters:
| Name | Type | Description |
|---|---|---|
is_causal |
bool | None
|
Override causal mode. |
**kwargs |
object
|
Additional model-specific arguments (ignored). |
Returns:
| Type | Description |
|---|---|
tuple[Tensor, None]
|
|
Source code in src/turboquant_vllm/triton/attention_interface.py
install_fused_tq4_kv ¶
install_fused_tq4_kv(model: Module, cache: CompressedDynamicCache) -> None
Activate fused TQ4 K+V attention on model with cache side-channel.
Registers the triton_fa_tq4_kv backend, stashes cache on each
attention layer as module._tq4_cache, sets the model's
_attn_implementation, and enables fused_mode on the cache
to skip wasted decompression (P5b optimization).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
HuggingFace model with attention layers that have |
required |
cache
|
CompressedDynamicCache
|
CompressedDynamicCache instance that stores compressed K/V. |
required |
Raises:
| Type | Description |
|---|---|
AttributeError
|
If model has no |
Source code in src/turboquant_vllm/triton/attention_interface.py
uninstall_fused_tq4_kv ¶
Remove fused TQ4 attention and restore SDPA.
Removes _tq4_cache from attention layers, disables fused_mode
on the cache, and resets _attn_implementation to "sdpa".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
Model previously configured with |
required |