LLM Quantization

1. Core Concepts

Quantization converts high-precision numbers (like FP32) to low-precision ones (like INT8 or FP4) to save memory and speed up calculations. The method chosen determines how the original values are mapped to the new, smaller range.

Scale Factor

A positive float that acts as a conversion ratio. It maps the range of the original FP32 values to the target quantized range. For integer quantization, this is often calculated as (original_range / quantized_range).

Zero-Point

An integer offset used in asymmetric quantization to ensure that the FP32 value of 0.0 maps correctly to an integer in the new range. This is crucial for data that isn't centered on zero.

2. Walkthrough: INT8 Quantized Multiplication

This example shows how a simple matrix multiplication ($Output = W \cdot x$) is performed using 8-bit integer (INT8) quantization. The weights (W) are quantized asymmetrically and the input (x) symmetrically.

Initial State: FP32 Values

Here are our original 32-bit floating-point matrices. Our goal is to perform the multiplication using 8-bit integers to see how close we can get to the true result.

Weight Matrix (W)

0.7

-0.2

-0.5

0.1

Input Vector (x)

0.8

0.6

True FP32 Result: (0.44, -0.34)

Step 1: Quantize Weight Matrix (W) to INT8

We find the range of W and calculate the scale and zero-point for asymmetric quantization.

Range: max(0.7) - min(-0.5) = 1.2
Scale (W): 1.2 / 255 ≈ 0.0047
Zero-Point (W): -128 - round(-0.5 / 0.0047) = -21

Applying the formula round(value / scale) + zp gives the INT8 matrix.

Weight Matrix (W) - INT8

127

-64

-128

Step 2: Quantize Input Vector (x) to INT8

For the input, we use symmetric quantization (zero-point is 0).

Range: max(abs(x)) = 0.8
Scale (x): 0.8 / 127 ≈ 0.0063

The formula is just round(value / scale).

Input Vector (x) - INT8

127

Step 3: Perform Integer Multiplication

Now we multiply the INT8 matrices. This is computationally very fast on modern hardware.

(127 * 127) + (-64 * 95) = 10049

(-128 * 127) + (0 * 95) = -16256

Intermediate INT32 Result: (10049, -16256)

Step 4: De-quantize with Zero-Point Correction

We apply a correction based on the weight's zero-point and then de-quantize by multiplying by the combined scale (scale_W * scale_x).

Correction: -(-21) * (127 + 95) = 4662
Corrected Result: (10049+4662, -16256+4662) = (14711, -11594)

Final INT8 Quantized FP32 Result: (0.434, -0.342)

3. Walkthrough: MXFP4 Quantized Multiplication

This example demonstrates a more advanced technique using a 4-bit floating-point format, often called FP4. Specifically, we'll use a variant called "micro-scaling" floating-point (MXFP4). It is defined by a few key principles:

Block-based Quantization: Instead of applying a scaling factor to an entire tensor, MXFP4 divides a tensor's data into small, contiguous blocks, typically of 32 elements.
Shared Scale: Each 32-element block shares a single, 8-bit scale factor. This scale, which acts as a shared exponent, is calculated to best fit the dynamic range of the values within that specific block. This method efficiently manages the wide range of values present in large AI models.
E2M1 Encoding: Within each block, every individual value is represented using a 4-bit floating-point number with an E2M1 structure (1 sign bit, 2 exponent bits, 1 mantissa bit). This encoding provides a limited number of representable values, but the shared 8-bit scale dramatically extends the overall dynamic range of the block.
Reconstruction: A value $X_i$ from a block is reconstructed using the formula: $X_i = P_i \times 2^S$, where $P_i$ is the 4-bit E2M1 quantized value and S is the shared 8-bit scale for the block.

Step 1: Quantize Weight Matrix (W) to MXFP4

First, we find the maximum absolute value in the entire matrix (our "block") to determine a single, shared scaling factor. This factor is typically a power of 2 for hardware efficiency.

Max absolute value in W: max(abs(0.7), abs(-0.2), ...) = 0.7
Shared Scaling Factor (S_w): 0.5 (A power of 2 near the max value)

We divide each value by S_w and round to the nearest representable 4-bit float.

Weight Matrix (W) - Scaled MXFP4

1.5

-0.375

-1.0

0.25

Step 2: Quantize Input Vector (x) to MXFP4

We repeat the same process for the input vector block.

Max absolute value in x: max(abs(0.8), abs(0.6)) = 0.8
Shared Scaling Factor (S_x): 0.5

Divide by S_x and round.

Input Vector (x) - Scaled MXFP4

1.5

1.0

Step 3: Perform Low-Precision Floating-Point Multiplication

The hardware now performs a standard floating-point multiplication on these scaled 4-bit numbers.

(1.5 * 1.5) + (-0.375 * 1.0) = 1.875

(-1.0 * 1.5) + (0.25 * 1.0) = -1.25

Intermediate MXFP4 Result: (1.875, -1.25)

Step 4: De-quantize the Result

To get our final result, we simply multiply the intermediate result by the product of the two shared scaling factors.

Final De-quantize Scale: S_w * S_x = 0.5 * 0.5 = 0.25
1.875 * 0.25 = 0.46875
-1.25 * 0.25 = -0.3125

Final MXFP4 Quantized FP32 Result: (0.469, -0.313)

4. Final Comparison & Key Takeaways

The final quantized results are remarkably close to the original FP32 calculation, demonstrating the effectiveness of both processes. The small error is an acceptable trade-off for the significant performance and memory benefits.

Result Comparison

Key Takeaways

✓
Accuracy vs. Performance: Quantization introduces a tiny, manageable error in exchange for massive gains in speed and memory efficiency (4x for INT8, 8x for MXFP4).