Diversity-Aware Retrieval: How Agents Avoid Twenty Near-Duplicate Results

The Problem: Top-k Is Not the Same as Top-k Useful

A nearest-neighbor search returns the k items closest to the query vector. Sort by similarity, take the first ten, done. For a single fact lookup that is fine. For an agent assembling evidence it is quietly broken, because similarity to the query and usefulness as a set are not the same thing.

Picture an agent searching a video archive for "the moment the speaker reveals the product." Scene segmentation produced eight near-identical keyframes of that one reveal, all embedded close to the query. Plain top-10 returns six copies of the reveal and nothing else. The agent now has high relevance and almost no information. The shots before and after the reveal, the audience reaction, the on-screen price that flashed two seconds later, all of it sits at rank eleven and beyond, crowded out by duplicates.

This shows up everywhere an agent reads unstructured content:

Video. Adjacent frames and shots of the same scene cluster together and flood the results.

Documents. The same boilerplate clause appears in fifty contracts; a query for it returns fifty rows that say the same thing.

Audio. A repeated jingle or a recurring phrase dominates the top of every related query.

RAG context windows. You can fit maybe twelve chunks in the prompt. Three of them being paraphrases of each other wastes a third of the agent's working memory.

The fix is to stop optimizing each result in isolation and start optimizing the set. You want the most relevant, least redundant subset that fits the budget. That is diversity-aware retrieval, and there is a clean body of theory behind it.

Relevance of a Set Is Not the Sum of Relevances

The core insight is that the value of adding an item depends on what you have already selected. The fifth photo of the same reveal adds almost nothing because the first four already carried that information. The first photo of the audience reaction adds a lot because nothing in the set covered it yet.

Formally, the value of a set has diminishing returns: adding an item to a larger set helps less than adding it to a smaller one. A function with that property is called submodular, and submodularity is the mathematical backbone of every method below. It is exactly the right shape for "I already have this information, so the marginal worth of more of the same is low."

Submodular set functions have a famous practical gift: you cannot maximize them exactly in reasonable time, but a simple greedy algorithm gets within a guaranteed factor of the optimum. Start with an empty set, and repeatedly add the single item that increases the objective the most, until you hit your budget. For monotone submodular objectives that greedy result is provably within (1 - 1/e), about 63 percent, of the best possible set. Almost every diversity method in production is some version of this greedy loop.

Method 1: Maximal Marginal Relevance

Maximal Marginal Relevance, or MMR, is the original and still the most widely deployed approach. It is the greedy loop made concrete with one tunable knob.

At each step, score every not-yet-selected candidate by trading off how relevant it is to the query against how similar it is to the items already chosen:

mmr_score(c) =
  lambda * sim(c, query)
  - (1 - lambda) * max over s in selected of sim(c, s)

pick the candidate with the highest mmr_score, add it to selected, repeat.

The first term is plain relevance. The second term is a redundancy penalty: an item that looks just like something already chosen gets docked by its closest match in the current set. The lambda knob slides between the two:

lambda = 1 ignores diversity entirely. You get plain top-k back.

lambda = 0 ignores the query and just spreads candidates apart.

lambda around 0.5 to 0.7 is the usual sweet spot: relevance leads, redundancy is penalized.

MMR is cheap, it needs only the candidate-to-query similarities you already computed plus pairwise similarities among the few items you select, and it is easy to reason about. Its weakness is that the redundancy term only looks at the single nearest selected item (the max), so it can miss the case where a candidate is moderately similar to several already-chosen items at once. That is the gap the next methods close.

Method 2: Determinantal Point Processes

A determinantal point process, or DPP, models diversity through geometry. Build a kernel matrix L where each entry L[i][j] combines the quality of items i and j with their similarity. The probability of selecting a particular subset is proportional to the determinant of the submatrix for that subset.

Why the determinant? Geometrically the determinant of a set of vectors is the volume of the parallelepiped they span. Two vectors pointing the same way (redundant items) span almost no volume, so the determinant, and therefore the probability, is near zero. Vectors pointing in different directions (diverse items) span a large volume and get high probability. The determinant naturally punishes redundancy across the whole set at once, not just against the nearest neighbor, which is exactly the blind spot MMR has.

L[i][j] = quality(i) * similarity(i, j) * quality(j)

P(subset S) is proportional to determinant( L restricted to rows/cols in S )

high volume  = diverse, dissimilar items   -> high probability
near-zero volume = redundant items spanning no new space -> near-zero probability

DPPs give a principled, global notion of diversity and you can find a high-probability fixed-size subset with, again, a greedy maximization of the log-determinant. The cost is real: building and operating on the kernel scales poorly as the candidate pool grows, which is why DPPs are usually applied to a few hundred candidates after a cheap first-stage retrieval, never to the whole corpus.

Method 3: Submodular Facility Location and Coverage

A third framing asks for coverage: pick a small set so that every relevant item in the candidate pool is close to something you chose. The classic objective is facility location:

coverage(S) = sum over every candidate c of (
  max over s in S of similarity(c, s)
)

Read it as: for each candidate, how well is it represented by its nearest selected item. Maximizing this picks a set of "representatives" that collectively stand in for the whole pool, which is precisely choosing one keyframe per distinct scene instead of eight frames of the same scene. Facility location is monotone submodular, so the same greedy loop with the (1 - 1/e) guarantee applies.

You can fold relevance in by weighting each candidate's contribution by its query relevance, so the representatives you pick are biased toward the parts of the pool that matter to the query. This coverage view is often more natural than MMR for media, because it directly answers "did I capture every distinct thing that happened" rather than "is each new item far from the last."

Method 4: Redundancy-Aware Context Packing for RAG

When the consumer is a language model, diversity is also a budget problem. The prompt holds a fixed number of tokens, and recent token-budgeted RAG work, such as redundancy-aware greedy context selection, optimizes a set-level objective explicitly:

maximize  sum of relevance(chunk to query)
        - penalty * redundancy(chunk to chunks already packed)
subject to  total tokens of selected chunks <= context budget

This is MMR's idea generalized two ways: the constraint is a token budget rather than a fixed count, so a long, dense chunk competes against several short ones, and the redundancy term can be a set-level penalty rather than only the nearest neighbor. The selection is still greedy by marginal gain per token. The result is a context window where every chunk earns its space, which both raises answer quality and cuts cost, because you are not paying to feed the model three phrasings of the same sentence.

Choosing a Method

Method

Redundancy view

Cost

Best for

MMR	Nearest selected item only	Lowest	General reranking, the safe default
DPP	Whole-set volume via determinant	High; needs a candidate pool	When subtle multi-way redundancy hurts
Facility location	Coverage of the whole pool	Moderate, greedy	Media: one representative per distinct scene or cluster
Redundancy-aware packing	Set-level under a token budget	Moderate, greedy	Filling an LLM context window efficiently

Three rules cut across all of them:

Diversify after relevance, not instead of it. Run a normal relevance retrieval first to get a few hundred candidates, then apply diversity to pick the final handful. Diversity over the whole corpus just returns random scattered junk.

Diversity is a knob, not a default-on switch. A query for "every distinct camera angle in this shoot" wants high diversity. A query for "the single best match" wants none. The agent should set the knob from the task.

Define similarity in the space that matters. Visual diversity, semantic diversity, and temporal diversity are different. Two frames can be visually identical but seconds apart, or visually different but semantically the same event. Pick the embedding, or combine several, that captures the redundancy you actually want to suppress.

What to Measure

Plain recall and precision at k will not reveal a redundancy problem, because twenty copies of one relevant item still score as twenty relevant results. Measure the set, not the items:

Subtopic or aspect recall. Over a set of queries with known distinct aspects, how many aspects does the top-k cover. This is what diversity is supposed to improve and what plain similarity ignores.

Intra-set redundancy. The average pairwise similarity among returned items. Lower is more diverse. Watch it move as you turn the lambda or penalty knob.

Downstream answer quality for RAG. Whether the model's answer improves when the same token budget holds diverse chunks instead of redundant ones.

Latency added by the diversity stage. Greedy MMR over a few hundred candidates is cheap; a DPP over thousands is not. Keep diversity on the candidate set, never the corpus, and watch the p95.

Walk the lambda or penalty knob from pure relevance toward more diversity and stop where aspect recall stops improving, so you add variety only up to the point it still helps.

Doing This in Mixpeek

Diversity is a retriever-stage concern in Mixpeek. You keep your normal relevance retrieval as the first stage to gather candidates, then add a diversity stage that reranks those candidates into a non-redundant final set. Because it operates on the candidate list rather than the whole collection, it composes with whatever feature search or filter stages came before it.

from mixpeek import Mixpeek

client = Mixpeek(api_key="mxp_sk_...")

# Retrieve relevant candidates first, then diversify the final selection.
retriever = client.retrievers.create(
    namespace_id="ns_video",
    collection_ids=["col_keyframes"],
    retriever_name="product-reveals-diverse",
    stages=[
        # Stage 1: relevance. Gather a generous candidate pool.
        {"stage_name": "knn_search", "parameters": {"top_k": 200}},
        # Stage 2: diversity. Greedy MMR over the candidates.
        {
            "stage_name": "diversity_rerank",
            "parameters": {
                "method": "mmr",        # or "facility_location"
                "lambda": 0.6,           # relevance vs redundancy tradeoff
                "similarity_space": "visual_embedding",
                "top_k": 10,
            },
        },
    ],
)

# A coverage-style query wants high diversity; flip lambda lower.
results = client.retrievers.execute(
    retriever_id=retriever.retriever_id,
    inputs={"text": "the moment the speaker reveals the product"},
    top_k=10,
)

Treat the lambda and the similarity space as versioned config, because they change which evidence an agent sees. Measure aspect recall and intra-set redundancy before and after a change, and pick the embedding for the similarity space, on the Models page, that matches the redundancy you want to suppress: a visual encoder to collapse look-alike frames, a text encoder to collapse paraphrased chunks.

The Problem: Top-k Is Not the Same as Top-k Useful

Relevance of a Set Is Not the Sum of Relevances

Method 1: Maximal Marginal Relevance

Method 2: Determinantal Point Processes

Method 3: Submodular Facility Location and Coverage

Method 4: Redundancy-Aware Context Packing for RAG

Choosing a Method

What to Measure

Doing This in Mixpeek

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Semantic Caching: How Agents Skip Work They Have Already Done

BM25 and the Inverted Index: The Lexical Retriever Every Hybrid Search Treats as a Black Box

Hybrid Search Fusion: How to Combine Dense and Lexical Retrieval Without Breaking Ranking