NEWVectors or files. Pick a path.Start →
    Retrieval
    16 min read
    Updated 2026-06-19

    Diversity-Aware Retrieval: How Agents Avoid Twenty Near-Duplicate Results

    Top-k similarity search hands an agent the same evidence twenty times over. This guide teaches the algorithms that fix it: Maximal Marginal Relevance, determinantal point processes, submodular facility-location selection, and redundancy-aware context packing, then shows how to wire diverse selection into a Mixpeek retriever.

    Diversity
    Maximal Marginal Relevance
    Determinantal Point Processes
    Submodular Selection
    Context Packing
    Retrieval

    The Problem: Top-k Is Not the Same as Top-k Useful



    A nearest-neighbor search returns the k items closest to the query vector. Sort by similarity, take the first ten, done. For a single fact lookup that is fine. For an agent assembling evidence it is quietly broken, because similarity to the query and usefulness as a set are not the same thing.

    Picture an agent searching a video archive for "the moment the speaker reveals the product." Scene segmentation produced eight near-identical keyframes of that one reveal, all embedded close to the query. Plain top-10 returns six copies of the reveal and nothing else. The agent now has high relevance and almost no information. The shots before and after the reveal, the audience reaction, the on-screen price that flashed two seconds later, all of it sits at rank eleven and beyond, crowded out by duplicates.

    This shows up everywhere an agent reads unstructured content:

  1. Video. Adjacent frames and shots of the same scene cluster together and flood the results.
  2. Documents. The same boilerplate clause appears in fifty contracts; a query for it returns fifty rows that say the same thing.
  3. Audio. A repeated jingle or a recurring phrase dominates the top of every related query.
  4. RAG context windows. You can fit maybe twelve chunks in the prompt. Three of them being paraphrases of each other wastes a third of the agent's working memory.


  5. The fix is to stop optimizing each result in isolation and start optimizing the set. You want the most relevant, least redundant subset that fits the budget. That is diversity-aware retrieval, and there is a clean body of theory behind it.

    Relevance of a Set Is Not the Sum of Relevances



    The core insight is that the value of adding an item depends on what you have already selected. The fifth photo of the same reveal adds almost nothing because the first four already carried that information. The first photo of the audience reaction adds a lot because nothing in the set covered it yet.

    Formally, the value of a set has diminishing returns: adding an item to a larger set helps less than adding it to a smaller one. A function with that property is called submodular, and submodularity is the mathematical backbone of every method below. It is exactly the right shape for "I already have this information, so the marginal worth of more of the same is low."

    Submodular set functions have a famous practical gift: you cannot maximize them exactly in reasonable time, but a simple greedy algorithm gets within a guaranteed factor of the optimum. Start with an empty set, and repeatedly add the single item that increases the objective the most, until you hit your budget. For monotone submodular objectives that greedy result is provably within (1 - 1/e), about 63 percent, of the best possible set. Almost every diversity method in production is some version of this greedy loop.

    Method 1: Maximal Marginal Relevance



    Maximal Marginal Relevance, or MMR, is the original and still the most widely deployed approach. It is the greedy loop made concrete with one tunable knob.

    At each step, score every not-yet-selected candidate by trading off how relevant it is to the query against how similar it is to the items already chosen:

    mmr_score(c) =
      lambda * sim(c, query)
      - (1 - lambda) * max over s in selected of sim(c, s)

    pick the candidate with the highest mmr_score, add it to selected, repeat.


    The first term is plain relevance. The second term is a redundancy penalty: an item that looks just like something already chosen gets docked by its closest match in the current set. The lambda knob slides between the two:

  6. lambda = 1 ignores diversity entirely. You get plain top-k back.
  7. lambda = 0 ignores the query and just spreads candidates apart.
  8. lambda around 0.5 to 0.7 is the usual sweet spot: relevance leads, redundancy is penalized.


  9. MMR is cheap, it needs only the candidate-to-query similarities you already computed plus pairwise similarities among the few items you select, and it is easy to reason about. Its weakness is that the redundancy term only looks at the single nearest selected item (the max), so it can miss the case where a candidate is moderately similar to several already-chosen items at once. That is the gap the next methods close.

    Method 2: Determinantal Point Processes



    A determinantal point process, or DPP, models diversity through geometry. Build a kernel matrix L where each entry L[i][j] combines the quality of items i and j with their similarity. The probability of selecting a particular subset is proportional to the determinant of the submatrix for that subset.

    Why the determinant? Geometrically the determinant of a set of vectors is the volume of the parallelepiped they span. Two vectors pointing the same way (redundant items) span almost no volume, so the determinant, and therefore the probability, is near zero. Vectors pointing in different directions (diverse items) span a large volume and get high probability. The determinant naturally punishes redundancy across the whole set at once, not just against the nearest neighbor, which is exactly the blind spot MMR has.

    L[i][j] = quality(i) * similarity(i, j) * quality(j)

    P(subset S) is proportional to determinant( L restricted to rows/cols in S )

    high volume = diverse, dissimilar items -> high probability near-zero volume = redundant items spanning no new space -> near-zero probability


    DPPs give a principled, global notion of diversity and you can find a high-probability fixed-size subset with, again, a greedy maximization of the log-determinant. The cost is real: building and operating on the kernel scales poorly as the candidate pool grows, which is why DPPs are usually applied to a few hundred candidates after a cheap first-stage retrieval, never to the whole corpus.

    Method 3: Submodular Facility Location and Coverage



    A third framing asks for coverage: pick a small set so that every relevant item in the candidate pool is close to something you chose. The classic objective is facility location:

    coverage(S) = sum over every candidate c of (
      max over s in S of similarity(c, s)
    )
    


    Read it as: for each candidate, how well is it represented by its nearest selected item. Maximizing this picks a set of "representatives" that collectively stand in for the whole pool, which is precisely choosing one keyframe per distinct scene instead of eight frames of the same scene. Facility location is monotone submodular, so the same greedy loop with the (1 - 1/e) guarantee applies.

    You can fold relevance in by weighting each candidate's contribution by its query relevance, so the representatives you pick are biased toward the parts of the pool that matter to the query. This coverage view is often more natural than MMR for media, because it directly answers "did I capture every distinct thing that happened" rather than "is each new item far from the last."

    Method 4: Redundancy-Aware Context Packing for RAG



    When the consumer is a language model, diversity is also a budget problem. The prompt holds a fixed number of tokens, and recent token-budgeted RAG work, such as redundancy-aware greedy context selection, optimizes a set-level objective explicitly:

    maximize  sum of relevance(chunk to query)
            - penalty * redundancy(chunk to chunks already packed)
    subject to  total tokens of selected chunks <= context budget
    


    This is MMR's idea generalized two ways: the constraint is a token budget rather than a fixed count, so a long, dense chunk competes against several short ones, and the redundancy term can be a set-level penalty rather than only the nearest neighbor. The selection is still greedy by marginal gain per token. The result is a context window where every chunk earns its space, which both raises answer quality and cuts cost, because you are not paying to feed the model three phrasings of the same sentence.

    Choosing a Method



    MethodRedundancy viewCostBest for
    MMRNearest selected item onlyLowestGeneral reranking, the safe default
    DPPWhole-set volume via determinantHigh; needs a candidate poolWhen subtle multi-way redundancy hurts
    Facility locationCoverage of the whole poolModerate, greedyMedia: one representative per distinct scene or cluster
    Redundancy-aware packingSet-level under a token budgetModerate, greedyFilling an LLM context window efficiently
    Three rules cut across all of them:

  10. Diversify after relevance, not instead of it. Run a normal relevance retrieval first to get a few hundred candidates, then apply diversity to pick the final handful. Diversity over the whole corpus just returns random scattered junk.
  11. Diversity is a knob, not a default-on switch. A query for "every distinct camera angle in this shoot" wants high diversity. A query for "the single best match" wants none. The agent should set the knob from the task.
  12. Define similarity in the space that matters. Visual diversity, semantic diversity, and temporal diversity are different. Two frames can be visually identical but seconds apart, or visually different but semantically the same event. Pick the embedding, or combine several, that captures the redundancy you actually want to suppress.


  13. What to Measure



    Plain recall and precision at k will not reveal a redundancy problem, because twenty copies of one relevant item still score as twenty relevant results. Measure the set, not the items:

  14. Subtopic or aspect recall. Over a set of queries with known distinct aspects, how many aspects does the top-k cover. This is what diversity is supposed to improve and what plain similarity ignores.
  15. Intra-set redundancy. The average pairwise similarity among returned items. Lower is more diverse. Watch it move as you turn the lambda or penalty knob.
  16. Downstream answer quality for RAG. Whether the model's answer improves when the same token budget holds diverse chunks instead of redundant ones.
  17. Latency added by the diversity stage. Greedy MMR over a few hundred candidates is cheap; a DPP over thousands is not. Keep diversity on the candidate set, never the corpus, and watch the p95.


  18. Walk the lambda or penalty knob from pure relevance toward more diversity and stop where aspect recall stops improving, so you add variety only up to the point it still helps.

    Doing This in Mixpeek



    Diversity is a retriever-stage concern in Mixpeek. You keep your normal relevance retrieval as the first stage to gather candidates, then add a diversity stage that reranks those candidates into a non-redundant final set. Because it operates on the candidate list rather than the whole collection, it composes with whatever feature search or filter stages came before it.

    from mixpeek import Mixpeek

    client = Mixpeek(api_key="mxp_sk_...")

    # Retrieve relevant candidates first, then diversify the final selection. retriever = client.retrievers.create( namespace_id="ns_video", collection_ids=["col_keyframes"], retriever_name="product-reveals-diverse", stages=[ # Stage 1: relevance. Gather a generous candidate pool. {"stage_name": "knn_search", "parameters": {"top_k": 200}}, # Stage 2: diversity. Greedy MMR over the candidates. { "stage_name": "diversity_rerank", "parameters": { "method": "mmr", # or "facility_location" "lambda": 0.6, # relevance vs redundancy tradeoff "similarity_space": "visual_embedding", "top_k": 10, }, }, ], )

    # A coverage-style query wants high diversity; flip lambda lower. results = client.retrievers.execute( retriever_id=retriever.retriever_id, inputs={"text": "the moment the speaker reveals the product"}, top_k=10, )


    Treat the lambda and the similarity space as versioned config, because they change which evidence an agent sees. Measure aspect recall and intra-set redundancy before and after a change, and pick the embedding for the similarity space, on the Models page, that matches the redundancy you want to suppress: a visual encoder to collapse look-alike frames, a text encoder to collapse paraphrased chunks.

    Further Reading



  19. Multi-Stage Retrieval -- the relevance-first, diversify-on-candidates pattern this guide plugs into
  20. Cross-Encoder Reranking -- the other reranking stage, focused on precision rather than diversity
  21. Video Frame Sampling -- choosing diverse frames at ingestion, the upstream cousin of diversifying at query time
  22. Calibrating Similarity Scores -- making the similarities the diversity penalty relies on actually mean something
  23. Evaluating Multimodal Retrieval -- metrics and ground truth, including how to score a set instead of an item
  24. Managed Mixpeek

    Put multimodal search to work

    Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.

    Start with Managed
    MVS · bring your own

    Already have vectors?

    Keep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.

    Start with MVS

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs

    Related guides

    Retrieval

    BM25 and the Inverted Index: The Lexical Retriever Every Hybrid Search Treats as a Black Box

    Every hybrid search pipeline pairs dense vectors with BM25, but almost no one can say where the BM25 number actually comes from, which is exactly why fusion, tuning, and exact-match failures stay mysterious. This guide opens the box: how an inverted index turns transcripts and OCR text into posting lists, the precise BM25 scoring formula with its term-frequency saturation and length normalization, what the k1 and b parameters really do, and why the tokenizer is the silent decider of whether an agent ever finds a serial number.

    Read guide →
    Retrieval

    Hybrid Search Fusion: How to Combine Dense and Lexical Retrieval Without Breaking Ranking

    An agent searching transcripts, OCR text, and captions needs both meaning (dense vectors) and exact terms (BM25), but the two return scores on incompatible scales that you cannot simply add. This guide teaches the real fusion mechanics: why score distributions make naive normalization fail, the exact math of Reciprocal Rank Fusion and how its k parameter behaves, weighted convex combination with proper normalization, and how to choose and tune a fusion method against a labeled set.

    Read guide →
    Retrieval

    Filtered Vector Search: How Agents Combine Similarity with Hard Constraints

    Almost every agentic query is a vector search plus a constraint -- 'clips from campaign X after May', 'images of red cars in the EU bucket'. This guide explains the three filtering strategies (pre-filter, post-filter, in-place predicate-aware traversal), why each one silently breaks recall or latency at different selectivities, and how a query planner picks between them.

    Read guide →