Mixpeek Logo

    What is Dense Passage Retrieval

    Dense Passage Retrieval - Embedding-based passage retrieval for open-domain QA

    Dense Passage Retrieval (DPR) is a neural retrieval method that uses dual-encoder models to encode questions and text passages into dense vector representations. Unlike traditional sparse retrieval (BM25), DPR captures semantic meaning, allowing it to match questions with relevant passages even when there is no keyword overlap between them.

    How It Works

    DPR uses two separate BERT-based encoders: one for questions and one for passages. Both encoders produce single dense vector embeddings. During indexing, all passages in the corpus are encoded and stored in a vector index. At query time, the question is encoded into a vector, and the most similar passage vectors are retrieved using approximate nearest neighbor search. The training uses in-batch negatives and hard negatives to teach the model to distinguish relevant passages from similar but irrelevant ones.

    Technical Details

    The original DPR model uses separate BERT-base encoders for questions and passages, each producing 768-dimensional CLS token embeddings. Training uses contrastive loss (NLL) with in-batch negatives and BM25-mined hard negatives. The passage index is built with FAISS (typically IVF-PQ for large corpora). At query time, top-k passages are retrieved in milliseconds even over millions of passages. DPR was originally designed for open-domain question answering and showed significant improvements over BM25 on datasets like Natural Questions and TriviaQA.

    Best Practices

    • Use hard negative mining (BM25 negatives or mined from a previous DPR checkpoint) for significantly better training
    • Chunk documents into 100-word passages for optimal retrieval granularity
    • Pre-encode the entire passage corpus offline and update incrementally as new documents arrive
    • Use FAISS with IVF+PQ indexing for memory-efficient search over millions of passages
    • Evaluate with recall@k metrics to understand how many relevant passages appear in top results

    Common Pitfalls

    • Using only random negatives during training, which produces weak retrievers that fail on hard examples
    • Chunking documents too coarsely (full pages) or too finely (individual sentences), reducing retrieval quality
    • Not fine-tuning on domain-specific question-passage pairs, relying solely on general-domain pretraining
    • Ignoring the passage freshness problem where new content requires re-indexing to become searchable
    • Assuming DPR always outperforms BM25 when BM25 is actually stronger for keyword-heavy queries

    Advanced Tips

    • Combine DPR with BM25 in a hybrid retriever and use a cross-encoder reranker for the strongest pipeline
    • Use knowledge distillation from cross-encoders to train better dual-encoder models
    • Implement iterative retrieval where the reader model's output informs a second retrieval pass
    • Explore multi-vector DPR variants (like ColBERT) for better token-level matching
    • Use contrastive pre-training (ICT, Condenser) before fine-tuning for improved passage representations