The Failure Mode Dense Search Hides
Dense embedding search is so good at meaning that it is easy to forget what it throws away. Compress a passage into a single 1024-dimensional vector and you keep its gist; you lose the guarantee that a specific token survives. Ask a dense retriever for "error code E-4042 on the XR-9 valve" and it will happily return passages about valves, errors, and hardware faults -- semantically adjacent, lexically wrong. The one document that contains the literal string "E-4042" can rank below three documents that never mention it.
This is not a tuning problem. It is structural. Dense retrieval optimizes for semantic proximity, and rare tokens -- part numbers, error codes, person names, API method names, drug names, legal citations -- carry almost no semantic signal to spread across a dense vector. They are exactly the tokens an agent most needs to match exactly. Sparse retrieval is the half of the system that protects them.
This guide is about that half: what sparse retrieval computes, how learned sparse models fix BM25's biggest weakness, and how to fuse a sparse ranking with a dense one so an agent gets both vocabularies at once.
Two Vocabularies, Not Two Algorithms
The cleanest mental model is that dense and sparse retrievers speak different vocabularies and retrieve complementary sets of documents.
Because their strengths are disjoint, fusing them beats either alone. Published hybrid systems routinely report recall gains in the 15-30 percent range over dense-only on mixed query workloads, and the gap is largest exactly on the rare-term queries dense search fumbles.
What BM25 Actually Computes
BM25 is the workhorse sparse method, and it is worth understanding rather than treating as a black box. It scores a document for a query by summing, over each query term, three intuitions:
1. Term frequency, saturated. A document mentioning a term more is more relevant, but with diminishing returns -- the tenth occurrence adds little over the third. 2. Inverse document frequency. A term that appears in few documents is more discriminating than a common one. "valve" is worth less than "E-4042." 3. Length normalization. A long document has more chances to contain a term by accident, so its term frequencies are discounted.
In compact form:
score(q, d) = sum over terms t in q of
IDF(t) * ( f(t,d) * (k1 + 1) )
/ ( f(t,d) + k1 * (1 - b + b * d
/ avgdl) )
where `f(t,d)` is the term's frequency in the document, `
| d |
Learned Sparse Retrieval: BM25 With a Brain
Learned sparse models close the vocabulary gap without leaving the sparse index. The idea: instead of scoring only the terms that literally appear, use a language model to predict which vocabulary terms a passage is *about*, and assign each a learned weight. The output is still a sparse vector over the vocabulary -- so it still uses a fast inverted index -- but now it includes expansion terms the text never contained.
SPLADE (Sparse Lexical and Expansion) is the canonical example. It runs the text through a BERT-style encoder, projects each position onto the vocabulary via the masked-language-model head, and pools the result into one weight per vocabulary term. A regularizer pushes most weights to zero, keeping the vector sparse. The effect:
# Conceptual shape of a SPLADE document vector (term -> weight),
# most of the ~30k-term vocabulary is implicitly zero.
doc_vec = {
"notebook": 2.7, "computer": 2.3,
"laptop": 1.9, # expansion: never in the text
"portable": 1.1, # expansion
"pc": 0.8, # expansion
}
# Scoring against a query vector is a sparse dot product:
# score = sum(q_vec[t] * doc_vec[t] for t in shared terms)
Learned sparse models like SPLADE, uniCOIL, and DeepImpact sit between BM25 and dense retrieval: they recover much of dense retrieval's recall on paraphrases while keeping sparse retrieval's exact-term precision and inverted-index efficiency. The cost is a model forward pass at index and query time, and larger postings lists than raw BM25 because of the expansion terms.
Sparsity Beyond Text: Why Agents Care
This is not a text-only trick. The moment an agent searches unstructured multimodal content, exact-term matching reappears in new forms:
The general principle: any modality with a discrete, nameable layer (text, labels, codes) benefits from a sparse index, and fusing it with the dense semantic index is what gives an agent both "what does this mean" and "did this exact thing appear."
Fusing the Two Rankings
You have a dense top-k and a sparse top-k. They are scored on incompatible scales -- a cosine of 0.31 and a BM25 of 17 cannot be added. There are three standard ways to combine them.
Reciprocal Rank Fusion (RRF). Throw away the raw scores and combine by rank position. For each document, sum 1/(K + rank) across the lists it appears in (K is a constant, typically 60).
def rrf(rank_lists, K=60):
scores = {}
for ranking in rank_lists: # each is an ordered list of doc ids
for rank, doc_id in enumerate(ranking, start=1):
scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (K + rank)
return sorted(scores, key=scores.get, reverse=True)
RRF is unsupervised, score-scale-agnostic, and robust -- it is the right default when you have no labeled queries. Its limitation is that it ignores *how much* better rank 1 was than rank 2; a runaway-confident match and a barely-ahead match contribute the same.
Relative Score Fusion (RSF). Min-max normalize each list's scores into [0, 1], then take a weighted sum: `final = w * dense_norm + (1 - w) * sparse_norm`. This keeps score magnitude information RRF discards, at the cost of needing a sane normalization and a tuned weight `w`. It tends to win when one method is genuinely more confident on a given query and you have validation data to set `w`.
Learned / tensor fusion. Train a small model on the per-system scores, or rerank the fused candidate pool with a cross-encoder or a late-interaction (MaxSim) scorer. Highest ceiling, needs relevance labels, and is best applied only to the small fused candidate set rather than the whole corpus. (See Late Interaction Retrieval for the MaxSim mechanism.)
A practical default for an agent stack: RRF to merge dense and sparse candidates cheaply, then a reranker on the top ~100 if eval shows precision leaking at the top.
A Decision Guide
Evaluation: Prove the Halves Are Complementary
Do not assume hybrid helps -- measure it, broken out by query class. Run three systems on the same labeled set: dense-only, sparse-only, and fused.
The one number that ties it together is recall on the rare-term query class. If fusion does not move it, you have paid for a sparse index that is not earning its keep -- which usually means BM25 where you needed learned sparse, or a fusion weight that is drowning the sparse signal.
Doing This in Mixpeek
In Managed Mixpeek, the sparse half is a retriever configuration, not a second system to operate. A retriever can run dense vector search and BM25 lexical search over the same multimodal records -- including the text extracted from transcripts and OCR -- and fuse them with RRF in one call.
from mixpeek import Mixpeek
client = Mixpeek(api_key="mxp_sk_...")
retriever = client.retrievers.create(
namespace="ad_creatives",
stages=[
# dense semantic recall over the visual + transcript embeddings
{"stage_type": "vector_search", "field": "embedding", "top_k": 200},
# exact-term recall over the transcript / OCR text (BM25)
{"stage_type": "lexical_search", "field": "transcript", "top_k": 200},
# merge the two rankings without comparing raw scores
{"stage_type": "rank_fusion", "method": "rrf"},
],
)
results = client.retrievers.execute(
retriever_id=retriever.retriever_id,
inputs={"text": "ad that says use code SAVE20 over a kitchen demo"},
top_k=20,
)
The query "use code SAVE20 over a kitchen demo" needs both halves: the kitchen-demo clause is a dense semantic match, and "SAVE20" is an exact term that only the lexical stage can guarantee. If you run your own learned sparse model (SPLADE or similar) and already produce sparse vectors, bring them to MVS, the Mixpeek Vector Store, and run dense, sparse, and BM25 search side by side on object storage. Either way, the agent calls one MCP tool and gets a single fused ranking that respects both vocabularies.