tq4_compress
turboquant_vllm.triton.tq4_compress ¶
Fused Triton kernel for TQ4 compression (norm + rotate + quantize + pack).
Phase 3c.9: Replaces the multi-op PyTorch compress path with a single fused kernel. The rotation matrix is pre-split into even/odd column halves so the kernel writes packed nibble output directly without a separate interleave step.
Experiment 015 (post-3c.8) showed compress accounts for 53% of decode time (~0.149ms for K+V at 1 token). The PyTorch path launches 6+ CUDA kernels (norm, divide, matmul, bucketize, clamp, pack). Fusing into one Triton launch eliminates kernel-launch overhead.
Attributes:
| Name | Type | Description |
|---|---|---|
tq4_compress |
tuple[Tensor, Tensor]
|
Python wrapper that launches the fused kernel. |
Examples:
from turboquant_vllm.triton.tq4_compress import tq4_compress
packed, norms = tq4_compress(
x,
rotation_T_even,
rotation_T_odd,
boundaries,
)
# packed: (N, H, D//2) uint8, norms: (N, H, 1) fp32
See Also
:mod:turboquant_vllm.triton.tq4_decompress: Phase 3c.8 fused decompress.
:mod:turboquant_vllm.vllm.tq4_backend: vLLM backend that calls this kernel.
Functions¶
tq4_compress ¶
tq4_compress(
x: Tensor, rotation_T_even: Tensor, rotation_T_odd: Tensor, boundaries: Tensor
) -> tuple[Tensor, Tensor]
Compress vectors to TQ4 nibble-packed format.
Fused Triton path: norm + normalize + tiled rotation + bucketize + nibble-pack in a single kernel launch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
|
required |
rotation_T_even
|
Tensor
|
|
required |
rotation_T_odd
|
Tensor
|
|
required |
boundaries
|
Tensor
|
|
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Tuple of |
Tensor
|
uint8 and norms is |