NEWVectors or files. Pick a path.Start →
    Retrieval
    18 min read
    Updated 2026-06-19

    Learned Sparse Retrieval and Dense-Sparse Hybrid: Why Agents Need Both Vocabularies

    A first-principles guide to the sparse half of hybrid search. Covers what BM25 actually computes, why dense embeddings silently drop exact terms, how learned sparse models like SPLADE expand queries in vocabulary space, and how to fuse dense and sparse rankings with RRF or relative-score fusion -- so an AI agent searching unstructured content keeps both semantic recall and exact-term precision.

    Hybrid Search
    Sparse Retrieval
    SPLADE
    BM25
    Rank Fusion

    The Failure Mode Dense Search Hides



    Dense embedding search is so good at meaning that it is easy to forget what it throws away. Compress a passage into a single 1024-dimensional vector and you keep its gist; you lose the guarantee that a specific token survives. Ask a dense retriever for "error code E-4042 on the XR-9 valve" and it will happily return passages about valves, errors, and hardware faults -- semantically adjacent, lexically wrong. The one document that contains the literal string "E-4042" can rank below three documents that never mention it.

    This is not a tuning problem. It is structural. Dense retrieval optimizes for semantic proximity, and rare tokens -- part numbers, error codes, person names, API method names, drug names, legal citations -- carry almost no semantic signal to spread across a dense vector. They are exactly the tokens an agent most needs to match exactly. Sparse retrieval is the half of the system that protects them.

    This guide is about that half: what sparse retrieval computes, how learned sparse models fix BM25's biggest weakness, and how to fuse a sparse ranking with a dense one so an agent gets both vocabularies at once.

    Two Vocabularies, Not Two Algorithms



    The cleanest mental model is that dense and sparse retrievers speak different vocabularies and retrieve complementary sets of documents.

  1. Dense retrieval maps text into a low-dimensional continuous space. Two passages are close if their meanings are close, even with zero shared words. Strong on paraphrase, synonymy, and cross-lingual matching. Weak on rare exact terms.
  2. Sparse retrieval maps text into a very high-dimensional vector over the vocabulary -- one dimension per term, mostly zeros. Two passages match when they share weighted terms. Strong on exact tokens, entities, and phrases. Weak when the query and document use different words for the same idea.


  3. Because their strengths are disjoint, fusing them beats either alone. Published hybrid systems routinely report recall gains in the 15-30 percent range over dense-only on mixed query workloads, and the gap is largest exactly on the rare-term queries dense search fumbles.

    What BM25 Actually Computes



    BM25 is the workhorse sparse method, and it is worth understanding rather than treating as a black box. It scores a document for a query by summing, over each query term, three intuitions:

    1. Term frequency, saturated. A document mentioning a term more is more relevant, but with diminishing returns -- the tenth occurrence adds little over the third. 2. Inverse document frequency. A term that appears in few documents is more discriminating than a common one. "valve" is worth less than "E-4042." 3. Length normalization. A long document has more chances to contain a term by accident, so its term frequencies are discounted.

    In compact form:

    score(q, d) = sum over terms t in q of
        IDF(t) * ( f(t,d) * (k1 + 1) )
                / ( f(t,d) + k1 * (1 - b + b * 
    d
    / avgdl) )


    where `f(t,d)` is the term's frequency in the document, `
    d
    ` is document length, `avgdl` is the average length, and `k1` (around 1.2-2.0) and `b` (around 0.75) control saturation and length normalization. The takeaway is not the formula; it is that BM25 needs the literal term to appear. Its strength is precision on exact matches; its weakness is the vocabulary gap -- a query for "laptop" will not match a document that only says "notebook computer."

    Learned Sparse Retrieval: BM25 With a Brain



    Learned sparse models close the vocabulary gap without leaving the sparse index. The idea: instead of scoring only the terms that literally appear, use a language model to predict which vocabulary terms a passage is *about*, and assign each a learned weight. The output is still a sparse vector over the vocabulary -- so it still uses a fast inverted index -- but now it includes expansion terms the text never contained.

    SPLADE (Sparse Lexical and Expansion) is the canonical example. It runs the text through a BERT-style encoder, projects each position onto the vocabulary via the masked-language-model head, and pools the result into one weight per vocabulary term. A regularizer pushes most weights to zero, keeping the vector sparse. The effect:

  4. A passage about "notebook computer" gets nonzero weight on "laptop," "PC," and "portable" -- so the dense-style synonym match now happens inside the sparse index.
  5. Weights are learned from relevance data, not hand-set like IDF, so the model learns which expansions actually help retrieval.
  6. Because the representation is still sparse and term-aligned, it stays interpretable: you can read off exactly which terms drove a match.


  7. # Conceptual shape of a SPLADE document vector (term -> weight),
    # most of the ~30k-term vocabulary is implicitly zero.
    doc_vec = {
        "notebook": 2.7, "computer": 2.3,
        "laptop": 1.9,   # expansion: never in the text
        "portable": 1.1, # expansion
        "pc": 0.8,       # expansion
    }
    # Scoring against a query vector is a sparse dot product:
    #   score = sum(q_vec[t] * doc_vec[t] for t in shared terms)
    


    Learned sparse models like SPLADE, uniCOIL, and DeepImpact sit between BM25 and dense retrieval: they recover much of dense retrieval's recall on paraphrases while keeping sparse retrieval's exact-term precision and inverted-index efficiency. The cost is a model forward pass at index and query time, and larger postings lists than raw BM25 because of the expansion terms.

    Sparsity Beyond Text: Why Agents Care



    This is not a text-only trick. The moment an agent searches unstructured multimodal content, exact-term matching reappears in new forms:

  8. Transcripts and captions. ASR output and OCR text carry product names, prices, promo codes, and spoken claims an agent must match verbatim. A dense audio or video embedding will not reliably surface "use code SAVE20"; a sparse index over the transcript will.
  9. Extracted metadata. Detected object labels, tags, and structured fields from a perception pipeline form a natural sparse vocabulary.
  10. Hybrid multimodal records. A video segment can have a dense visual vector, a dense audio vector, and a sparse text vector over its transcript. Fusing all three is the same machinery as text hybrid search, one rank below.


  11. The general principle: any modality with a discrete, nameable layer (text, labels, codes) benefits from a sparse index, and fusing it with the dense semantic index is what gives an agent both "what does this mean" and "did this exact thing appear."

    Fusing the Two Rankings



    You have a dense top-k and a sparse top-k. They are scored on incompatible scales -- a cosine of 0.31 and a BM25 of 17 cannot be added. There are three standard ways to combine them.

    Reciprocal Rank Fusion (RRF). Throw away the raw scores and combine by rank position. For each document, sum 1/(K + rank) across the lists it appears in (K is a constant, typically 60).

    def rrf(rank_lists, K=60):
        scores = {}
        for ranking in rank_lists:            # each is an ordered list of doc ids
            for rank, doc_id in enumerate(ranking, start=1):
                scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (K + rank)
        return sorted(scores, key=scores.get, reverse=True)
    


    RRF is unsupervised, score-scale-agnostic, and robust -- it is the right default when you have no labeled queries. Its limitation is that it ignores *how much* better rank 1 was than rank 2; a runaway-confident match and a barely-ahead match contribute the same.

    Relative Score Fusion (RSF). Min-max normalize each list's scores into [0, 1], then take a weighted sum: `final = w * dense_norm + (1 - w) * sparse_norm`. This keeps score magnitude information RRF discards, at the cost of needing a sane normalization and a tuned weight `w`. It tends to win when one method is genuinely more confident on a given query and you have validation data to set `w`.

    Learned / tensor fusion. Train a small model on the per-system scores, or rerank the fused candidate pool with a cross-encoder or a late-interaction (MaxSim) scorer. Highest ceiling, needs relevance labels, and is best applied only to the small fused candidate set rather than the whole corpus. (See Late Interaction Retrieval for the MaxSim mechanism.)

    A practical default for an agent stack: RRF to merge dense and sparse candidates cheaply, then a reranker on the top ~100 if eval shows precision leaking at the top.

    A Decision Guide



  12. Pure semantic, paraphrase-heavy, few rare terms (broad "find similar" queries): dense alone may suffice. Add sparse the moment exact-term recall matters.
  13. Entity, code, name, or phrase queries (support, legal, compliance, catalog lookups): never go dense-only. BM25 at minimum; learned sparse if the vocabulary gap also bites.
  14. Mixed agent workloads (the common case): dense + learned-sparse, fused with RRF. This is the configuration that survives the widest query distribution, because agents issue both kinds of query, often in the same session.
  15. Exact-ID lookups: detect them and route to sparse/filter only -- running dense search on "order #99213" wastes latency and adds noise. (See Query Transformation Pipelines for routing.)


  16. Evaluation: Prove the Halves Are Complementary



    Do not assume hybrid helps -- measure it, broken out by query class. Run three systems on the same labeled set: dense-only, sparse-only, and fused.

  17. Recall@k per query class (semantic, entity, exact-phrase). The signature of a healthy hybrid is sparse winning the entity/phrase classes, dense winning the semantic class, and fusion at or above the max of the two everywhere.
  18. Win/loss attribution. For queries where fusion beats dense-only, confirm the lift came from documents only the sparse list surfaced. If not, your sparse index is not pulling its weight.
  19. Score calibration. Fusion weights and thresholds do not transfer across models or modalities; recalibrate when you change either. (See Calibrating Similarity Scores.)


  20. The one number that ties it together is recall on the rare-term query class. If fusion does not move it, you have paid for a sparse index that is not earning its keep -- which usually means BM25 where you needed learned sparse, or a fusion weight that is drowning the sparse signal.

    Doing This in Mixpeek



    In Managed Mixpeek, the sparse half is a retriever configuration, not a second system to operate. A retriever can run dense vector search and BM25 lexical search over the same multimodal records -- including the text extracted from transcripts and OCR -- and fuse them with RRF in one call.

    from mixpeek import Mixpeek

    client = Mixpeek(api_key="mxp_sk_...")

    retriever = client.retrievers.create( namespace="ad_creatives", stages=[ # dense semantic recall over the visual + transcript embeddings {"stage_type": "vector_search", "field": "embedding", "top_k": 200}, # exact-term recall over the transcript / OCR text (BM25) {"stage_type": "lexical_search", "field": "transcript", "top_k": 200}, # merge the two rankings without comparing raw scores {"stage_type": "rank_fusion", "method": "rrf"}, ], )

    results = client.retrievers.execute( retriever_id=retriever.retriever_id, inputs={"text": "ad that says use code SAVE20 over a kitchen demo"}, top_k=20, )


    The query "use code SAVE20 over a kitchen demo" needs both halves: the kitchen-demo clause is a dense semantic match, and "SAVE20" is an exact term that only the lexical stage can guarantee. If you run your own learned sparse model (SPLADE or similar) and already produce sparse vectors, bring them to MVS, the Mixpeek Vector Store, and run dense, sparse, and BM25 search side by side on object storage. Either way, the agent calls one MCP tool and gets a single fused ranking that respects both vocabularies.

    Further Reading



  21. Multi-Stage Retrieval -- the staged pipeline this fused candidate set feeds into
  22. Calibrating Similarity Scores -- why fusion weights and thresholds must be recalibrated per model
  23. Late Interaction Retrieval -- the MaxSim reranker that can rescore a fused candidate pool
  24. Query Transformation Pipelines -- routing exact-ID queries to the sparse half
  25. MVS: Agent-native vector store on object storage
  26. Managed Mixpeek

    Put multimodal search to work

    Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.

    Start with Managed
    MVS · bring your own

    Already have vectors?

    Keep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.

    Start with MVS

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs

    Related guides

    Retrieval

    Hybrid Search Fusion: How to Combine Dense and Lexical Retrieval Without Breaking Ranking

    An agent searching transcripts, OCR text, and captions needs both meaning (dense vectors) and exact terms (BM25), but the two return scores on incompatible scales that you cannot simply add. This guide teaches the real fusion mechanics: why score distributions make naive normalization fail, the exact math of Reciprocal Rank Fusion and how its k parameter behaves, weighted convex combination with proper normalization, and how to choose and tune a fusion method against a labeled set.

    Read guide →
    Retrieval

    BM25 and the Inverted Index: The Lexical Retriever Every Hybrid Search Treats as a Black Box

    Every hybrid search pipeline pairs dense vectors with BM25, but almost no one can say where the BM25 number actually comes from, which is exactly why fusion, tuning, and exact-match failures stay mysterious. This guide opens the box: how an inverted index turns transcripts and OCR text into posting lists, the precise BM25 scoring formula with its term-frequency saturation and length normalization, what the k1 and b parameters really do, and why the tokenizer is the silent decider of whether an agent ever finds a serial number.

    Read guide →
    Retrieval

    Adaptive Indexing for Agentic Search: Query Logs, Payload Indexes, and Retrieval Routing

    Learn how retrieval systems decide which indexes to build as agents search unstructured content. Covers query logs, slow-query diagnosis, payload indexes, hybrid routing, and Mixpeek MVS examples.

    Read guide →