Vector Quantization - Compressing vectors by mapping to codebook entries
A compression technique that maps high-dimensional vectors to a finite set of representative codewords, reducing storage and speeding up similarity search. Essential for scaling multimodal vector databases to billions of embeddings.
How It Works
Vector quantization partitions the embedding space into regions, each represented by a centroid (codeword). During encoding, each vector is assigned to its nearest centroid, and only the centroid index is stored instead of the full vector. At query time, distances are computed between the query and the centroids rather than all original vectors, dramatically reducing computation and memory.
Technical Details
Common approaches include k-means based scalar quantization, product quantization (PQ) that splits vectors into subspaces, and residual quantization that iteratively encodes quantization errors. Codebook sizes typically range from 256 to 65536 entries. The trade-off between compression ratio and recall accuracy is controlled by the number of subspaces and codebook size.
Best Practices
Train codebooks on a representative sample of your actual data distribution
Use product quantization for high-dimensional vectors to maintain accuracy with high compression
Benchmark recall at different compression levels to find the right trade-off for your use case
Re-rank top candidates using original vectors after initial quantized search
Common Pitfalls
Training codebooks on data that does not represent the production distribution
Over-compressing vectors, leading to unacceptable recall degradation
Not re-training codebooks when the data distribution shifts significantly
Applying uniform quantization to vectors that have non-uniform value distributions
Advanced Tips
Combine vector quantization with graph-based indices (HNSW) for fast approximate search
Use optimized product quantization (OPQ) to rotate vectors before quantization for better accuracy
Implement asymmetric distance computation where queries remain unquantized for higher precision
Consider learned quantization methods that train end-to-end with the retrieval objective