Skip to content

lloyd_max

turboquant_vllm.lloyd_max

Lloyd-Max optimal scalar quantizer for Beta-distributed coordinates.

After random orthogonal rotation, each coordinate of a unit-norm vector follows a Beta distribution concentrated near zero. The Lloyd-Max algorithm finds the optimal set of centroids (reconstruction points) that minimize mean squared error for this known distribution.

For dimensions d >= 64, the Beta distribution is well-approximated by a Gaussian N(0, 1/d), which simplifies the codebook computation.

Reference: Section 3.1 of arXiv 2504.19874.

Examples:

from turboquant_vllm.lloyd_max import solve_lloyd_max, LloydMaxCodebook

centroids, boundaries = solve_lloyd_max(d=128, bits=3)
codebook = LloydMaxCodebook(centroids, boundaries, bits=3, dim=128)
See Also

:func:solve_lloyd_max: Factory that computes centroids and boundaries. :class:LloydMaxCodebook: Dataclass wrapping a precomputed codebook.

Classes

LloydMaxCodebook dataclass

LloydMaxCodebook(centroids: Tensor, boundaries: Tensor, bits: int, dim: int)

Precomputed optimal scalar quantizer for a given dimension and bit-width.

The codebook stores centroids and boundaries computed by the Lloyd-Max algorithm. It maps continuous coordinate values to discrete indices and back via nearest-centroid lookup.

Attributes:

Name Type Description
centroids Tensor

Reconstruction values, shape (2^bits,).

boundaries Tensor

Partition boundaries, shape (2^bits - 1,).

bits int

Number of quantization bits.

dim int

Vector dimension used to compute the codebook.

Examples:

Round-trip quantize and dequantize a tensor:

codebook = LloydMaxCodebook(centroids, boundaries, bits=3, dim=128)
indices = codebook.quantize(x)
x_hat = codebook.dequantize(indices)
Functions
quantize
quantize(x: Tensor) -> Tensor

Map continuous values to nearest centroid indices.

Uses bucket search on partition boundaries for O(log n) lookup.

Parameters:

Name Type Description Default
x Tensor

Input tensor of any shape.

required

Returns:

Type Description
Tensor

Integer tensor of same shape with centroid indices in

Tensor

[0, 2^bits - 1].

Source code in src/turboquant_vllm/lloyd_max.py
def quantize(self, x: torch.Tensor) -> torch.Tensor:
    """Map continuous values to nearest centroid indices.

    Uses bucket search on partition boundaries for O(log n) lookup.

    Args:
        x: Input tensor of any shape.

    Returns:
        Integer tensor of same shape with centroid indices in
        [0, 2^bits - 1].
    """
    bounds = self.boundaries.to(x.device)
    # bucketize returns the index of the bucket each value falls into
    return torch.bucketize(x, bounds)
dequantize
dequantize(indices: Tensor) -> Tensor

Reconstruct continuous values from centroid indices.

Parameters:

Name Type Description Default
indices Tensor

Integer tensor of centroid indices.

required

Returns:

Type Description
Tensor

Float tensor of reconstructed values with same shape as indices.

Source code in src/turboquant_vllm/lloyd_max.py
def dequantize(self, indices: torch.Tensor) -> torch.Tensor:
    """Reconstruct continuous values from centroid indices.

    Args:
        indices: Integer tensor of centroid indices.

    Returns:
        Float tensor of reconstructed values with same shape as indices.
    """
    cents = self.centroids.to(indices.device)
    return cents[indices]

Functions

solve_lloyd_max

solve_lloyd_max(
    d: int,
    bits: int,
    *,
    use_exact: bool = False,
    max_iter: int = 200,
    tol: float = 1e-10,
) -> tuple[Tensor, Tensor]

Solve the Lloyd-Max conditions for optimal scalar quantization.

Results are cached by (d, bits, use_exact) so that multi-layer models (e.g., 32 layers × 2 K/V compressors = 64 calls) pay the scipy integration cost only once. Without caching, initialization takes 2+ minutes for models like Molmo2-8B.

Parameters:

Name Type Description Default
d int

Vector dimension (determines the distribution shape).

required
bits int

Number of quantization bits (produces 2^bits centroids).

required
use_exact bool

If True, use exact Beta PDF. If False, use Gaussian approximation (faster, accurate for d >= 64).

False
max_iter int

Maximum Lloyd-Max iterations.

200
tol float

Convergence tolerance on centroid movement.

1e-10

Returns:

Type Description
Tensor

Tuple of (centroids, boundaries) as 1-D tensors. Centroids has

Tensor

length 2^bits, boundaries has length 2^bits - 1.

Source code in src/turboquant_vllm/lloyd_max.py
def solve_lloyd_max(
    d: int,
    bits: int,
    *,
    use_exact: bool = False,
    max_iter: int = 200,
    tol: float = 1e-10,
) -> tuple[torch.Tensor, torch.Tensor]:
    """Solve the Lloyd-Max conditions for optimal scalar quantization.

    Results are cached by (d, bits, use_exact) so that multi-layer models
    (e.g., 32 layers × 2 K/V compressors = 64 calls) pay the scipy
    integration cost only once. Without caching, initialization takes
    2+ minutes for models like Molmo2-8B.

    Args:
        d: Vector dimension (determines the distribution shape).
        bits: Number of quantization bits (produces 2^bits centroids).
        use_exact: If True, use exact Beta PDF. If False, use Gaussian
            approximation (faster, accurate for d >= 64).
        max_iter: Maximum Lloyd-Max iterations.
        tol: Convergence tolerance on centroid movement.

    Returns:
        Tuple of (centroids, boundaries) as 1-D tensors. Centroids has
        length 2^bits, boundaries has length 2^bits - 1.
    """
    return _solve_lloyd_max_cached(d, bits, use_exact, max_iter, tol)