lloyd_max

turboquant_vllm.lloyd_max ¶

Lloyd-Max optimal scalar quantizer for Beta-distributed coordinates.

After random orthogonal rotation, each coordinate of a unit-norm vector follows a Beta distribution concentrated near zero. The Lloyd-Max algorithm finds the optimal set of centroids (reconstruction points) that minimize mean squared error for this known distribution.

For dimensions d >= 64, the Beta distribution is well-approximated by a Gaussian N(0, 1/d), which simplifies the codebook computation.

Reference: Section 3.1 of arXiv 2504.19874.

Examples:

from turboquant_vllm.lloyd_max import solve_lloyd_max, LloydMaxCodebook

centroids, boundaries = solve_lloyd_max(d=128, bits=3)
codebook = LloydMaxCodebook(centroids, boundaries, bits=3, dim=128)

Classes¶

LloydMaxCodebook `dataclass` ¶

LloydMaxCodebook(centroids: Tensor, boundaries: Tensor, bits: int, dim: int)

Precomputed optimal scalar quantizer for a given dimension and bit-width.

The codebook stores centroids and boundaries computed by the Lloyd-Max algorithm. It maps continuous coordinate values to discrete indices and back via nearest-centroid lookup.

Attributes:

Name	Type	Description
`centroids`	`Tensor`	Reconstruction values, shape `(2^bits,)`.
`boundaries`	`Tensor`	Partition boundaries, shape `(2^bits - 1,)`.
`bits`	`int`	Number of quantization bits.
`dim`	`int`	Vector dimension used to compute the codebook.

Examples:

Round-trip quantize and dequantize a tensor:

codebook = LloydMaxCodebook(centroids, boundaries, bits=3, dim=128)
indices = codebook.quantize(x)
x_hat = codebook.dequantize(indices)

Functions¶

quantize ¶

quantize(x: Tensor) -> Tensor

Map continuous values to nearest centroid indices.

Uses bucket search on partition boundaries for O(log n) lookup.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of any shape.	required

Returns:

Type	Description
`Tensor`	Integer tensor of same shape with centroid indices in
`Tensor`	[0, 2^bits - 1].

Source code in src/turboquant_vllm/lloyd_max.py

def quantize(self, x: torch.Tensor) -> torch.Tensor:
    """Map continuous values to nearest centroid indices.

    Uses bucket search on partition boundaries for O(log n) lookup.

    Args:
        x: Input tensor of any shape.

    Returns:
        Integer tensor of same shape with centroid indices in
        [0, 2^bits - 1].
    """
    bounds = self.boundaries.to(x.device)
    # bucketize returns the index of the bucket each value falls into
    return torch.bucketize(x, bounds)

dequantize ¶

dequantize(indices: Tensor) -> Tensor

Reconstruct continuous values from centroid indices.

Parameters:

Name	Type	Description	Default
`indices`	`Tensor`	Integer tensor of centroid indices.	required

Returns:

Type	Description
`Tensor`	Float tensor of reconstructed values with same shape as indices.

Source code in src/turboquant_vllm/lloyd_max.py

def dequantize(self, indices: torch.Tensor) -> torch.Tensor:
    """Reconstruct continuous values from centroid indices.

    Args:
        indices: Integer tensor of centroid indices.

    Returns:
        Float tensor of reconstructed values with same shape as indices.
    """
    cents = self.centroids.to(indices.device)
    return cents[indices]

Functions¶

solve_lloyd_max ¶

solve_lloyd_max(
    d: int,
    bits: int,
    *,
    use_exact: bool = False,
    max_iter: int = 200,
    tol: float = 1e-10,
) -> tuple[Tensor, Tensor]

Solve the Lloyd-Max conditions for optimal scalar quantization.

Results are cached by (d, bits, use_exact) so that multi-layer models (e.g., 32 layers × 2 K/V compressors = 64 calls) pay the scipy integration cost only once. Without caching, initialization takes 2+ minutes for models like Molmo2-8B.

Parameters:

Name	Type	Description	Default
`d`	`int`	Vector dimension (determines the distribution shape).	required
`bits`	`int`	Number of quantization bits (produces 2^bits centroids).	required
`use_exact`	`bool`	If True, use exact Beta PDF. If False, use Gaussian approximation (faster, accurate for d >= 64).	`False`
`max_iter`	`int`	Maximum Lloyd-Max iterations.	`200`
`tol`	`float`	Convergence tolerance on centroid movement.	`1e-10`

Returns:

Type	Description
`Tensor`	Tuple of (centroids, boundaries) as 1-D tensors. Centroids has
`Tensor`	length 2^bits, boundaries has length 2^bits - 1.

Source code in src/turboquant_vllm/lloyd_max.py

def solve_lloyd_max(
    d: int,
    bits: int,
    *,
    use_exact: bool = False,
    max_iter: int = 200,
    tol: float = 1e-10,
) -> tuple[torch.Tensor, torch.Tensor]:
    """Solve the Lloyd-Max conditions for optimal scalar quantization.

    Results are cached by (d, bits, use_exact) so that multi-layer models
    (e.g., 32 layers × 2 K/V compressors = 64 calls) pay the scipy
    integration cost only once. Without caching, initialization takes
    2+ minutes for models like Molmo2-8B.

    Args:
        d: Vector dimension (determines the distribution shape).
        bits: Number of quantization bits (produces 2^bits centroids).
        use_exact: If True, use exact Beta PDF. If False, use Gaussian
            approximation (faster, accurate for d >= 64).
        max_iter: Maximum Lloyd-Max iterations.
        tol: Convergence tolerance on centroid movement.

    Returns:
        Tuple of (centroids, boundaries) as 1-D tensors. Centroids has
        length 2^bits, boundaries has length 2^bits - 1.
    """
    return _solve_lloyd_max_cached(d, bits, use_exact, max_iter, tol)

lloyd_max

turboquant_vllm.lloyd_max ¶

Classes¶

LloydMaxCodebook dataclass ¶

Functions¶

quantize ¶

dequantize ¶

Functions¶

solve_lloyd_max ¶

LloydMaxCodebook `dataclass` ¶