lloyd_max
turboquant_vllm.lloyd_max ¶
Lloyd-Max optimal scalar quantizer for Beta-distributed coordinates.
After random orthogonal rotation, each coordinate of a unit-norm vector follows a Beta distribution concentrated near zero. The Lloyd-Max algorithm finds the optimal set of centroids (reconstruction points) that minimize mean squared error for this known distribution.
For dimensions d >= 64, the Beta distribution is well-approximated by a Gaussian N(0, 1/d), which simplifies the codebook computation.
Reference: Section 3.1 of arXiv 2504.19874.
Examples:
from turboquant_vllm.lloyd_max import solve_lloyd_max, LloydMaxCodebook
centroids, boundaries = solve_lloyd_max(d=128, bits=3)
codebook = LloydMaxCodebook(centroids, boundaries, bits=3, dim=128)
See Also
:func:solve_lloyd_max: Factory that computes centroids and boundaries.
:class:LloydMaxCodebook: Dataclass wrapping a precomputed codebook.
Classes¶
LloydMaxCodebook
dataclass
¶
Precomputed optimal scalar quantizer for a given dimension and bit-width.
The codebook stores centroids and boundaries computed by the Lloyd-Max algorithm. It maps continuous coordinate values to discrete indices and back via nearest-centroid lookup.
Attributes:
| Name | Type | Description |
|---|---|---|
centroids |
Tensor
|
Reconstruction values, shape |
boundaries |
Tensor
|
Partition boundaries, shape |
bits |
int
|
Number of quantization bits. |
dim |
int
|
Vector dimension used to compute the codebook. |
Examples:
Round-trip quantize and dequantize a tensor:
codebook = LloydMaxCodebook(centroids, boundaries, bits=3, dim=128)
indices = codebook.quantize(x)
x_hat = codebook.dequantize(indices)
Functions¶
quantize ¶
Map continuous values to nearest centroid indices.
Uses bucket search on partition boundaries for O(log n) lookup.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input tensor of any shape. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Integer tensor of same shape with centroid indices in |
Tensor
|
[0, 2^bits - 1]. |
Source code in src/turboquant_vllm/lloyd_max.py
dequantize ¶
Reconstruct continuous values from centroid indices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices
|
Tensor
|
Integer tensor of centroid indices. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Float tensor of reconstructed values with same shape as indices. |
Source code in src/turboquant_vllm/lloyd_max.py
Functions¶
solve_lloyd_max ¶
solve_lloyd_max(
d: int,
bits: int,
*,
use_exact: bool = False,
max_iter: int = 200,
tol: float = 1e-10,
) -> tuple[Tensor, Tensor]
Solve the Lloyd-Max conditions for optimal scalar quantization.
Results are cached by (d, bits, use_exact) so that multi-layer models (e.g., 32 layers × 2 K/V compressors = 64 calls) pay the scipy integration cost only once. Without caching, initialization takes 2+ minutes for models like Molmo2-8B.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
d
|
int
|
Vector dimension (determines the distribution shape). |
required |
bits
|
int
|
Number of quantization bits (produces 2^bits centroids). |
required |
use_exact
|
bool
|
If True, use exact Beta PDF. If False, use Gaussian approximation (faster, accurate for d >= 64). |
False
|
max_iter
|
int
|
Maximum Lloyd-Max iterations. |
200
|
tol
|
float
|
Convergence tolerance on centroid movement. |
1e-10
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Tuple of (centroids, boundaries) as 1-D tensors. Centroids has |
Tensor
|
length 2^bits, boundaries has length 2^bits - 1. |