Learned Sparse Retrieval and Dense-Sparse Hybrid: Why Agents Need Both Vocabularies

The Failure Mode Dense Search Hides

Dense embedding search is so good at meaning that it is easy to forget what it throws away. Compress a passage into a single 1024-dimensional vector and you keep its gist; you lose the guarantee that a specific token survives. Ask a dense retriever for "error code E-4042 on the XR-9 valve" and it will happily return passages about valves, errors, and hardware faults -- semantically adjacent, lexically wrong. The one document that contains the literal string "E-4042" can rank below three documents that never mention it.

This is not a tuning problem. It is structural. Dense retrieval optimizes for semantic proximity, and rare tokens -- part numbers, error codes, person names, API method names, drug names, legal citations -- carry almost no semantic signal to spread across a dense vector. They are exactly the tokens an agent most needs to match exactly. Sparse retrieval is the half of the system that protects them.

This guide is about that half: what sparse retrieval computes, how learned sparse models fix BM25's biggest weakness, and how to fuse a sparse ranking with a dense one so an agent gets both vocabularies at once.

Two Vocabularies, Not Two Algorithms

The cleanest mental model is that dense and sparse retrievers speak different vocabularies and retrieve complementary sets of documents.

Dense retrieval maps text into a low-dimensional continuous space. Two passages are close if their meanings are close, even with zero shared words. Strong on paraphrase, synonymy, and cross-lingual matching. Weak on rare exact terms.

Sparse retrieval maps text into a very high-dimensional vector over the vocabulary -- one dimension per term, mostly zeros. Two passages match when they share weighted terms. Strong on exact tokens, entities, and phrases. Weak when the query and document use different words for the same idea.

Because their strengths are disjoint, fusing them beats either alone. Published hybrid systems routinely report recall gains in the 15-30 percent range over dense-only on mixed query workloads, and the gap is largest exactly on the rare-term queries dense search fumbles.

What BM25 Actually Computes

BM25 is the workhorse sparse method, and it is worth understanding rather than treating as a black box. It scores a document for a query by summing, over each query term, three intuitions:

1. Term frequency, saturated. A document mentioning a term more is more relevant, but with diminishing returns -- the tenth occurrence adds little over the third. 2. Inverse document frequency. A term that appears in few documents is more discriminating than a common one. "valve" is worth less than "E-4042." 3. Length normalization. A long document has more chances to contain a term by accident, so its term frequencies are discounted.

In compact form:

score(q, d) = sum over terms t in q of
    IDF(t) * ( f(t,d) * (k1 + 1) )
            / ( f(t,d) + k1 * (1 - b + b * |d| / avgdl) )

where f(t,d) is the term's frequency in the document, |d| is document length, avgdl is the average length, and k1 (around 1.2-2.0) and b (around 0.75) control saturation and length normalization. The takeaway is not the formula; it is that BM25 needs the literal term to appear. Its strength is precision on exact matches; its weakness is the vocabulary gap -- a query for "laptop" will not match a document that only says "notebook computer."

Learned Sparse Retrieval: BM25 With a Brain

Learned sparse models close the vocabulary gap without leaving the sparse index. The idea: instead of scoring only the terms that literally appear, use a language model to predict which vocabulary terms a passage is *about*, and assign each a learned weight. The output is still a sparse vector over the vocabulary -- so it still uses a fast inverted index -- but now it includes expansion terms the text never contained.

SPLADE (Sparse Lexical and Expansion) is the canonical example. It runs the text through a BERT-style encoder, projects each position onto the vocabulary via the masked-language-model head, and pools the result into one weight per vocabulary term. A regularizer pushes most weights to zero, keeping the vector sparse. The effect:

A passage about "notebook computer" gets nonzero weight on "laptop," "PC," and "portable" -- so the dense-style synonym match now happens inside the sparse index.

Weights are learned from relevance data, not hand-set like IDF, so the model learns which expansions actually help retrieval.

Because the representation is still sparse and term-aligned, it stays interpretable: you can read off exactly which terms drove a match.

# Conceptual shape of a SPLADE document vector (term -> weight),
# most of the ~30k-term vocabulary is implicitly zero.
doc_vec = {
    "notebook": 2.7, "computer": 2.3,
    "laptop": 1.9,   # expansion: never in the text
    "portable": 1.1, # expansion
    "pc": 0.8,       # expansion
}
# Scoring against a query vector is a sparse dot product:
#   score = sum(q_vec[t] * doc_vec[t] for t in shared terms)

Learned sparse models like SPLADE, uniCOIL, and DeepImpact sit between BM25 and dense retrieval: they recover much of dense retrieval's recall on paraphrases while keeping sparse retrieval's exact-term precision and inverted-index efficiency. The cost is a model forward pass at index and query time, and larger postings lists than raw BM25 because of the expansion terms.

Sparsity Beyond Text: Why Agents Care

This is not a text-only trick. The moment an agent searches unstructured multimodal content, exact-term matching reappears in new forms:

Transcripts and captions. ASR output and OCR text carry product names, prices, promo codes, and spoken claims an agent must match verbatim. A dense audio or video embedding will not reliably surface "use code SAVE20"; a sparse index over the transcript will.

Extracted metadata. Detected object labels, tags, and structured fields from a perception pipeline form a natural sparse vocabulary.

Hybrid multimodal records. A video segment can have a dense visual vector, a dense audio vector, and a sparse text vector over its transcript. Fusing all three is the same machinery as text hybrid search, one rank below.

The general principle: any modality with a discrete, nameable layer (text, labels, codes) benefits from a sparse index, and fusing it with the dense semantic index is what gives an agent both "what does this mean" and "did this exact thing appear."

Fusing the Two Rankings

You have a dense top-k and a sparse top-k. They are scored on incompatible scales -- a cosine of 0.31 and a BM25 of 17 cannot be added. There are three standard ways to combine them.

Reciprocal Rank Fusion (RRF). Throw away the raw scores and combine by rank position. For each document, sum 1/(K + rank) across the lists it appears in (K is a constant, typically 60).

def rrf(rank_lists, K=60):
    scores = {}
    for ranking in rank_lists:            # each is an ordered list of doc ids
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (K + rank)
    return sorted(scores, key=scores.get, reverse=True)

RRF is unsupervised, score-scale-agnostic, and robust -- it is the right default when you have no labeled queries. Its limitation is that it ignores *how much* better rank 1 was than rank 2; a runaway-confident match and a barely-ahead match contribute the same.

Relative Score Fusion (RSF). Min-max normalize each list's scores into [0, 1], then take a weighted sum: final = w * dense_norm + (1 - w) * sparse_norm. This keeps score magnitude information RRF discards, at the cost of needing a sane normalization and a tuned weight w. It tends to win when one method is genuinely more confident on a given query and you have validation data to set w.

Learned / tensor fusion. Train a small model on the per-system scores, or rerank the fused candidate pool with a cross-encoder or a late-interaction (MaxSim) scorer. Highest ceiling, needs relevance labels, and is best applied only to the small fused candidate set rather than the whole corpus. (See Late Interaction Retrieval for the MaxSim mechanism.)

A practical default for an agent stack: RRF to merge dense and sparse candidates cheaply, then a reranker on the top ~100 if eval shows precision leaking at the top.

A Decision Guide

Pure semantic, paraphrase-heavy, few rare terms (broad "find similar" queries): dense alone may suffice. Add sparse the moment exact-term recall matters.

Entity, code, name, or phrase queries (support, legal, compliance, catalog lookups): never go dense-only. BM25 at minimum; learned sparse if the vocabulary gap also bites.

Mixed agent workloads (the common case): dense + learned-sparse, fused with RRF. This is the configuration that survives the widest query distribution, because agents issue both kinds of query, often in the same session.

Exact-ID lookups: detect them and route to sparse/filter only -- running dense search on "order #99213" wastes latency and adds noise. (See Query Transformation Pipelines for routing.)

Evaluation: Prove the Halves Are Complementary

Do not assume hybrid helps -- measure it, broken out by query class. Run three systems on the same labeled set: dense-only, sparse-only, and fused.

Recall@k per query class (semantic, entity, exact-phrase). The signature of a healthy hybrid is sparse winning the entity/phrase classes, dense winning the semantic class, and fusion at or above the max of the two everywhere.

Win/loss attribution. For queries where fusion beats dense-only, confirm the lift came from documents only the sparse list surfaced. If not, your sparse index is not pulling its weight.

Score calibration. Fusion weights and thresholds do not transfer across models or modalities; recalibrate when you change either. (See Calibrating Similarity Scores.)

The one number that ties it together is recall on the rare-term query class. If fusion does not move it, you have paid for a sparse index that is not earning its keep -- which usually means BM25 where you needed learned sparse, or a fusion weight that is drowning the sparse signal.

Doing This in Mixpeek

In Managed Mixpeek, the sparse half is a retriever configuration, not a second system to operate. A retriever can run dense vector search and BM25 lexical search over the same multimodal records -- including the text extracted from transcripts and OCR -- and fuse them with RRF in one call.

from mixpeek import Mixpeek

client = Mixpeek(api_key="mxp_sk_...")

retriever = client.retrievers.create(
    namespace="ad_creatives",
    stages=[
        # dense semantic recall over the visual + transcript embeddings
        {"stage_type": "vector_search", "field": "embedding", "top_k": 200},
        # exact-term recall over the transcript / OCR text (BM25)
        {"stage_type": "lexical_search", "field": "transcript", "top_k": 200},
        # merge the two rankings without comparing raw scores
        {"stage_type": "rank_fusion", "method": "rrf"},
    ],
)

results = client.retrievers.execute(
    retriever_id=retriever.retriever_id,
    inputs={"text": "ad that says use code SAVE20 over a kitchen demo"},
    top_k=20,
)

The query "use code SAVE20 over a kitchen demo" needs both halves: the kitchen-demo clause is a dense semantic match, and "SAVE20" is an exact term that only the lexical stage can guarantee. If you run your own learned sparse model (SPLADE or similar) and already produce sparse vectors, bring them to MVS, the Mixpeek Vector Store, and run dense, sparse, and BM25 search side by side on object storage. Either way, the agent calls one MCP tool and gets a single fused ranking that respects both vocabularies.

The Failure Mode Dense Search Hides

Two Vocabularies, Not Two Algorithms

What BM25 Actually Computes

Learned Sparse Retrieval: BM25 With a Brain

Sparsity Beyond Text: Why Agents Care

Fusing the Two Rankings

A Decision Guide

Evaluation: Prove the Halves Are Complementary

Doing This in Mixpeek

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Hybrid Search Fusion: How to Combine Dense and Lexical Retrieval Without Breaking Ranking

BM25 and the Inverted Index: The Lexical Retriever Every Hybrid Search Treats as a Black Box

Adaptive Indexing for Agentic Search: Query Logs, Payload Indexes, and Retrieval Routing