A compression technique that maps high-dimensional vectors to a finite set of representative codewords, reducing storage and speeding up similarity search. Essential for scaling multimodal vector databases to billions of embeddings.
Vector quantization partitions the embedding space into regions, each represented by a centroid (codeword). During encoding, each vector is assigned to its nearest centroid, and only the centroid index is stored instead of the full vector. At query time, distances are computed between the query and the centroids rather than all original vectors, dramatically reducing computation and memory.
Common approaches include k-means based scalar quantization, product quantization (PQ) that splits vectors into subspaces, and residual quantization that iteratively encodes quantization errors. Codebook sizes typically range from 256 to 65536 entries. The trade-off between compression ratio and recall accuracy is controlled by the number of subspaces and codebook size.