A model optimization technique that reduces the numerical precision of model weights and activations from floating-point to lower-bit representations. Quantization dramatically reduces memory usage and inference latency for deploying multimodal AI models in production.
Quantization maps continuous floating-point values (FP32) to discrete lower-precision values (INT8, INT4, or even binary). Post-training quantization applies this mapping after training using calibration data to determine appropriate scaling factors. Quantization-aware training simulates quantization during training so the model adapts to the lower precision. The result is a smaller, faster model with minimal accuracy loss.
Common precision levels include FP16 (2x compression), INT8 (4x), INT4 (8x), and binary (32x). Symmetric quantization uses a single scale factor while asymmetric adds a zero-point offset. GPTQ, AWQ, and GGUF are popular quantization methods for large language models. Hardware accelerators (GPU tensor cores, Intel VNNI) provide optimized kernels for quantized computation. Mixed precision uses different precision levels for different layers based on sensitivity.