tq4_backend
turboquant_vllm.vllm.tq4_backend ¶
TQ4 compressed KV cache attention backend for vLLM.
Phase 3c: Packed TQ4 cache layout with real VRAM savings.
The KV cache is stored as uint8 bytes in a packed TQ4 format (68 bytes
per token per head per K/V = 136 bytes total vs 512 bytes FP16 = 3.76x
compression). Buffer allocation uses a custom TQ4FullAttentionSpec
that overrides page_size_bytes so the block allocator provisions
3.76x more blocks in the same VRAM budget. Each forward() call
decompresses the relevant blocks to FP16 and delegates to Flash Attention.
Implementation phases
3a (done): Passthrough skeleton -- validated plugin wiring. 3b (done): Compress-decompress round-trip in standard FP16 cache. 3c (this): Packed uint8 cache with real VRAM savings. 3d: Production benchmark against vLLM baseline.
Classes¶
TQ4FullAttentionSpec
dataclass
¶
Bases: FullAttentionSpec
KV cache spec with TQ4 packed page size.
Overrides real_page_size_bytes so the block allocator provisions
buffers sized for the packed TQ4 format. Supports asymmetric K/V
bit-widths via TQ4_K_BITS / TQ4_V_BITS env vars.
Follows the same pattern as MLAAttentionSpec which overrides
page size for the 656-byte FlashMLA format.
TQ4MetadataBuilder ¶
Bases: FlashAttentionMetadataBuilder
Metadata builder for TQ4 with conditional CUDA graph support.
CUDA graphs are supported for single-token decode only when the fused paged kernel is available; otherwise CG support is NEVER (the paged decompress path has dynamic allocations). Inherits all metadata-building logic from Flash Attention; only the CUDA graph support level differs.
Functions¶
get_cudagraph_support
classmethod
¶
Report CUDA graph support: single-token decode when fused available.
When fused paged decode is available, decode goes through
_fused_decode_path (CG-safe). Otherwise, decode uses
_decompress_cache_paged which has 10+ non-CG-safe operations
(torch.unique, boolean indexing, dynamic allocations).
Source code in src/turboquant_vllm/vllm/tq4_backend.py
TQ4AttentionBackend ¶
Bases: FlashAttentionBackend
TQ4 compressed KV cache attention backend.
Phase 3c: packed uint8 cache layout with real VRAM savings.
The cache stores nibble-packed TQ4 indices + fp32 norms as raw bytes.
get_kv_cache_shape() returns a 3D (NB, BS, bytes_per_token)
layout matching the packed format.
Functions¶
supports_mm_prefix
classmethod
¶
get_name
staticmethod
¶
get_impl_cls
staticmethod
¶
get_builder_cls
staticmethod
¶
get_kv_cache_shape
staticmethod
¶
get_kv_cache_shape(
num_blocks: int,
block_size: int,
num_kv_heads: int,
head_size: int,
cache_dtype_str: str = "auto",
) -> tuple[int, ...]
Packed TQ4 cache: (num_blocks, block_size, padded_bytes).
The last dimension packs K and V data for all heads as raw bytes
with padding for hybrid model page alignment. Only the first
num_kv_heads * _tq4_bytes_per_token_kv(head_size) bytes per
token contain packed data; trailing bytes are unused padding.
Source code in src/turboquant_vllm/vllm/tq4_backend.py
get_kv_cache_stride_order
staticmethod
¶
Raise to trigger identity fallback in reshape.
The inherited FlashAttentionBackend returns a 5-element stride
order for the standard (2, NB, BS, H, D) shape. Our 3D
packed layout (NB, BS, total_bytes) needs identity ordering.
Raising NotImplementedError triggers the fallback in
_reshape_kv_cache_tensors (same pattern as FlashMLA which
does not implement this method at all).
Source code in src/turboquant_vllm/vllm/tq4_backend.py
TQ4AttentionImpl ¶
Bases: FlashAttentionImpl
TQ4 attention: compress -> store -> decompress -> Flash Attention.
Phase 3c: stores packed TQ4 bytes in a uint8 cache for real VRAM
savings. Each forward() call:
- Compresses incoming K/V tokens to TQ4 packed bytes.
- Scatter-writes packed bytes to the uint8 cache via
slot_mapping. - Decompresses the full cache to FP16 for Flash Attention.
- Calls
flash_attn_varlen_funcdirectly with the FP16 data.
Initialize TQ4 attention with compression primitives.
Source code in src/turboquant_vllm/vllm/tq4_backend.py
362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 | |
Functions¶
forward ¶
forward(
layer,
query,
key,
value,
kv_cache,
attn_metadata,
output=None,
output_scale=None,
output_block_scale=None,
)
TQ4 attention: compress -> store -> pre-rotate Q -> decompress -> FA -> post-rotate.
Phase 3c.8: Uses fused Triton decompress (no rotation). The rotation is applied to Q before attention and to the output after, saving O(cache_len) matmuls per decode step.
Source code in src/turboquant_vllm/vllm/tq4_backend.py
963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 | |
Functions¶
register_tq4_backend ¶
Register TQ4 as the CUSTOM attention backend.
In addition to registering the backend class, this monkey-patches
Attention.get_kv_cache_spec so that decoder attention layers
return :class:TQ4FullAttentionSpec (with dtype=torch.uint8
and TQ4-sized pages) instead of the standard FullAttentionSpec.
Called automatically by the vllm.general_plugins entry point,
or manually before starting vLLM::
from turboquant_vllm.vllm import register_tq4_backend
register_tq4_backend()
# then start vLLM with --attention-backend CUSTOM