Dense Passage Retrieval - Embedding-based passage retrieval for open-domain QA
Dense Passage Retrieval (DPR) is a neural retrieval method that uses dual-encoder models to encode questions and text passages into dense vector representations. Unlike traditional sparse retrieval (BM25), DPR captures semantic meaning, allowing it to match questions with relevant passages even when there is no keyword overlap between them.
How It Works
DPR uses two separate BERT-based encoders: one for questions and one for passages. Both encoders produce single dense vector embeddings. During indexing, all passages in the corpus are encoded and stored in a vector index. At query time, the question is encoded into a vector, and the most similar passage vectors are retrieved using approximate nearest neighbor search. The training uses in-batch negatives and hard negatives to teach the model to distinguish relevant passages from similar but irrelevant ones.
Technical Details
The original DPR model uses separate BERT-base encoders for questions and passages, each producing 768-dimensional CLS token embeddings. Training uses contrastive loss (NLL) with in-batch negatives and BM25-mined hard negatives. The passage index is built with FAISS (typically IVF-PQ for large corpora). At query time, top-k passages are retrieved in milliseconds even over millions of passages. DPR was originally designed for open-domain question answering and showed significant improvements over BM25 on datasets like Natural Questions and TriviaQA.
Best Practices
Use hard negative mining (BM25 negatives or mined from a previous DPR checkpoint) for significantly better training
Chunk documents into 100-word passages for optimal retrieval granularity
Pre-encode the entire passage corpus offline and update incrementally as new documents arrive
Use FAISS with IVF+PQ indexing for memory-efficient search over millions of passages
Evaluate with recall@k metrics to understand how many relevant passages appear in top results
Common Pitfalls
Using only random negatives during training, which produces weak retrievers that fail on hard examples
Chunking documents too coarsely (full pages) or too finely (individual sentences), reducing retrieval quality
Not fine-tuning on domain-specific question-passage pairs, relying solely on general-domain pretraining
Ignoring the passage freshness problem where new content requires re-indexing to become searchable
Assuming DPR always outperforms BM25 when BM25 is actually stronger for keyword-heavy queries
Advanced Tips
Combine DPR with BM25 in a hybrid retriever and use a cross-encoder reranker for the strongest pipeline
Use knowledge distillation from cross-encoders to train better dual-encoder models
Implement iterative retrieval where the reader model's output informs a second retrieval pass
Explore multi-vector DPR variants (like ColBERT) for better token-level matching
Use contrastive pre-training (ICT, Condenser) before fine-tuning for improved passage representations