quantizer
turboquant_vllm.quantizer ¶
TurboQuant two-stage vector quantizer.
Implements the core TurboQuant algorithm: random orthogonal rotation followed by optimal scalar quantization (Stage 1, MSE) and optional QJL residual correction (Stage 2, unbiased inner products).
Stage 1 (TurboQuantMSE): Rotate → quantize each coordinate independently → store indices. Minimizes mean squared error. Best for value cache reconstruction.
Stage 2 (TurboQuantProd): Allocate (bits-1) to Lloyd-Max + 1 bit to QJL sign correction. Produces unbiased inner product estimates. Best for key cache where attention scores depend on Q·K^T dot products.
Reference: Sections 3-4 of arXiv 2504.19874.
Examples:
MSE quantization for value cache reconstruction:
quantizer = TurboQuantMSE(dim=64, bits=4)
indices, norms = quantizer.quantize(values)
reconstructed = quantizer.dequantize(indices, norms)
Unbiased inner products for key cache attention:
quantizer = TurboQuantProd(dim=64, bits=4)
indices, norms, signs, res_norms = quantizer.quantize(keys)
scores = quantizer.estimate_inner_product(query, indices, norms, signs, res_norms)
See Also
:mod:turboquant_vllm.lloyd_max: Lloyd-Max codebook solver.
Classes¶
TurboQuantMSE ¶
Stage 1 quantizer: rotation + Lloyd-Max scalar quantization.
Achieves near-optimal MSE distortion rate for high-dimensional vectors by exploiting the concentrated Beta distribution that emerges after random rotation.
Attributes:
| Name | Type | Description |
|---|---|---|
dim |
int
|
Vector dimension. |
bits |
int
|
Quantization bit-width. |
codebook |
LloydMaxCodebook
|
Precomputed Lloyd-Max codebook. |
rotation |
Tensor
|
Orthogonal rotation matrix, shape (dim, dim). |
Examples:
quantizer = TurboQuantMSE(dim=64, bits=4)
indices, norms = quantizer.quantize(torch.randn(8, 64))
reconstructed = quantizer.dequantize(indices, norms)
Initialize the MSE quantizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
int
|
Vector dimension (head dimension of the model). |
required |
bits
|
int
|
Quantization bits per coordinate (2-4 typical). |
required |
seed
|
int
|
Random seed for the rotation matrix. |
42
|
Source code in src/turboquant_vllm/quantizer.py
Functions¶
quantize ¶
Quantize vectors to centroid indices.
Applies rotation, extracts norms, normalizes to unit sphere, then quantizes each coordinate independently.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input tensor of shape (..., dim). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Tuple of (indices, norms) where indices is a long tensor of |
Tensor
|
shape (..., dim) and norms is a float tensor of shape (..., 1). |
Source code in src/turboquant_vllm/quantizer.py
dequantize ¶
Reconstruct vectors from centroid indices and norms.
Looks up centroids, applies inverse rotation, and rescales by stored norms.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices
|
Tensor
|
Long tensor of centroid indices, shape (..., dim). |
required |
norms
|
Tensor
|
Float tensor of vector norms, shape (..., 1). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Reconstructed float tensor of shape (..., dim). |
Source code in src/turboquant_vllm/quantizer.py
TurboQuantProd ¶
Two-stage quantizer with QJL correction for unbiased inner products.
Allocates (bits-1) bits to Lloyd-Max MSE quantization and 1 bit to Quantized Johnson-Lindenstrauss residual correction. The QJL step eliminates bias in dot-product estimation, which is critical for attention score computation (Q·K^T).
The unbiased estimator
~
+ ||r|| * sqrt(pi/2) / m * <S@q, sign(S@r)>
where r is the quantization residual and S is a random Gaussian projection matrix.
Attributes:
| Name | Type | Description |
|---|---|---|
dim |
int
|
Vector dimension. |
bits |
int
|
Total bit budget (bits-1 for MSE, 1 for QJL). |
mse_quantizer |
TurboQuantMSE
|
Stage 1 quantizer with (bits-1) bits. |
qjl_dim |
int
|
Number of QJL projection dimensions. |
qjl_matrix |
Tensor
|
Random Gaussian projection matrix. |
Examples:
quantizer = TurboQuantProd(dim=64, bits=4)
indices, norms, signs, res_norms = quantizer.quantize(torch.randn(8, 64))
scores = quantizer.estimate_inner_product(
torch.randn(1, 64), indices, norms, signs, res_norms
)
Initialize the two-stage quantizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
int
|
Vector dimension (head dimension of the model). |
required |
bits
|
int
|
Total bit budget per coordinate. Must be >= 2 (1 bit for MSE + 1 bit for QJL minimum). |
required |
qjl_dim
|
int | None
|
Number of QJL projection dimensions. Defaults to dim (standard JL dimensionality). |
None
|
seed
|
int
|
Random seed for rotation and projection matrices. |
42
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If bits < 2. |
Source code in src/turboquant_vllm/quantizer.py
Functions¶
quantize ¶
Quantize vectors with MSE + QJL correction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input tensor of shape (..., dim). |
required |
Returns:
| Type | Description |
|---|---|
tuple[Tensor, Tensor, Tensor, Tensor]
|
Tuple of (indices, norms, qjl_signs, residual_norms): - indices: Lloyd-Max centroid indices, shape (..., dim) - norms: Vector norms, shape (..., 1) - qjl_signs: Sign bits of projected residuals, shape (..., qjl_dim) - residual_norms: Norms of quantization residuals, shape (..., 1) |
Source code in src/turboquant_vllm/quantizer.py
dequantize ¶
Reconstruct vectors from compressed representation.
Note: Full reconstruction is approximate. For attention computation,
use estimate_inner_product instead — it's more accurate because
QJL corrects inner-product bias, not reconstruction bias.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices
|
Tensor
|
Lloyd-Max centroid indices, shape (..., dim). |
required |
norms
|
Tensor
|
Vector norms, shape (..., 1). |
required |
qjl_signs
|
Tensor
|
QJL sign bits, shape (..., qjl_dim). |
required |
residual_norms
|
Tensor
|
Residual norms, shape (..., 1). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Approximately reconstructed tensor of shape (..., dim). |
Source code in src/turboquant_vllm/quantizer.py
estimate_inner_product ¶
estimate_inner_product(
query: Tensor,
indices: Tensor,
norms: Tensor,
qjl_signs: Tensor,
residual_norms: Tensor,
) -> Tensor
Compute unbiased inner product estimate between query and compressed key.
Uses the two-stage estimator
~
+ ||r|| * sqrt(pi/2) / m * <S@q, signs>
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
Tensor
|
Query vectors, shape (..., dim). |
required |
indices
|
Tensor
|
Compressed key indices, shape (..., dim). |
required |
norms
|
Tensor
|
Key norms, shape (..., 1). |
required |
qjl_signs
|
Tensor
|
QJL sign bits for keys, shape (..., qjl_dim). |
required |
residual_norms
|
Tensor
|
Key residual norms, shape (..., 1). |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Inner product estimates, shape matching broadcast of query and key |
Tensor
|
batch dimensions. |