What is Vector Quantization

Vector Quantization - Compressing vectors by mapping to codebook entries

A compression technique that maps high-dimensional vectors to a finite set of representative codewords, reducing storage and speeding up similarity search. Essential for scaling multimodal vector databases to billions of embeddings.

How It Works

Vector quantization partitions the embedding space into regions, each represented by a centroid (codeword). During encoding, each vector is assigned to its nearest centroid, and only the centroid index is stored instead of the full vector. At query time, distances are computed between the query and the centroids rather than all original vectors, dramatically reducing computation and memory.

Technical Details

Common approaches include k-means based scalar quantization, product quantization (PQ) that splits vectors into subspaces, and residual quantization that iteratively encodes quantization errors. Codebook sizes typically range from 256 to 65536 entries. The trade-off between compression ratio and recall accuracy is controlled by the number of subspaces and codebook size.

Best Practices

Train codebooks on a representative sample of your actual data distribution
Use product quantization for high-dimensional vectors to maintain accuracy with high compression
Benchmark recall at different compression levels to find the right trade-off for your use case
Re-rank top candidates using original vectors after initial quantized search

Common Pitfalls

Training codebooks on data that does not represent the production distribution
Over-compressing vectors, leading to unacceptable recall degradation
Not re-training codebooks when the data distribution shifts significantly
Applying uniform quantization to vectors that have non-uniform value distributions

Advanced Tips

Combine vector quantization with graph-based indices (HNSW) for fast approximate search
Use optimized product quantization (OPQ) to rotate vectors before quantization for better accuracy
Implement asymmetric distance computation where queries remain unquantized for higher precision
Consider learned quantization methods that train end-to-end with the retrieval objective

Related Terms

ACID API Blob Storage CLIP Embedding