The Problem: Top-k Is Not the Same as Top-k Useful
A nearest-neighbor search returns the k items closest to the query vector. Sort by similarity, take the first ten, done. For a single fact lookup that is fine. For an agent assembling evidence it is quietly broken, because similarity to the query and usefulness as a set are not the same thing.
Picture an agent searching a video archive for "the moment the speaker reveals the product." Scene segmentation produced eight near-identical keyframes of that one reveal, all embedded close to the query. Plain top-10 returns six copies of the reveal and nothing else. The agent now has high relevance and almost no information. The shots before and after the reveal, the audience reaction, the on-screen price that flashed two seconds later, all of it sits at rank eleven and beyond, crowded out by duplicates.
This shows up everywhere an agent reads unstructured content:
The fix is to stop optimizing each result in isolation and start optimizing the set. You want the most relevant, least redundant subset that fits the budget. That is diversity-aware retrieval, and there is a clean body of theory behind it.
Relevance of a Set Is Not the Sum of Relevances
The core insight is that the value of adding an item depends on what you have already selected. The fifth photo of the same reveal adds almost nothing because the first four already carried that information. The first photo of the audience reaction adds a lot because nothing in the set covered it yet.
Formally, the value of a set has diminishing returns: adding an item to a larger set helps less than adding it to a smaller one. A function with that property is called submodular, and submodularity is the mathematical backbone of every method below. It is exactly the right shape for "I already have this information, so the marginal worth of more of the same is low."
Submodular set functions have a famous practical gift: you cannot maximize them exactly in reasonable time, but a simple greedy algorithm gets within a guaranteed factor of the optimum. Start with an empty set, and repeatedly add the single item that increases the objective the most, until you hit your budget. For monotone submodular objectives that greedy result is provably within (1 - 1/e), about 63 percent, of the best possible set. Almost every diversity method in production is some version of this greedy loop.
Method 1: Maximal Marginal Relevance
Maximal Marginal Relevance, or MMR, is the original and still the most widely deployed approach. It is the greedy loop made concrete with one tunable knob.
At each step, score every not-yet-selected candidate by trading off how relevant it is to the query against how similar it is to the items already chosen:
mmr_score(c) =
lambda * sim(c, query)
- (1 - lambda) * max over s in selected of sim(c, s)
pick the candidate with the highest mmr_score, add it to selected, repeat.
The first term is plain relevance. The second term is a redundancy penalty: an item that looks just like something already chosen gets docked by its closest match in the current set. The lambda knob slides between the two:
MMR is cheap, it needs only the candidate-to-query similarities you already computed plus pairwise similarities among the few items you select, and it is easy to reason about. Its weakness is that the redundancy term only looks at the single nearest selected item (the max), so it can miss the case where a candidate is moderately similar to several already-chosen items at once. That is the gap the next methods close.
Method 2: Determinantal Point Processes
A determinantal point process, or DPP, models diversity through geometry. Build a kernel matrix L where each entry L[i][j] combines the quality of items i and j with their similarity. The probability of selecting a particular subset is proportional to the determinant of the submatrix for that subset.
Why the determinant? Geometrically the determinant of a set of vectors is the volume of the parallelepiped they span. Two vectors pointing the same way (redundant items) span almost no volume, so the determinant, and therefore the probability, is near zero. Vectors pointing in different directions (diverse items) span a large volume and get high probability. The determinant naturally punishes redundancy across the whole set at once, not just against the nearest neighbor, which is exactly the blind spot MMR has.
L[i][j] = quality(i) * similarity(i, j) * quality(j)
P(subset S) is proportional to determinant( L restricted to rows/cols in S )
high volume = diverse, dissimilar items -> high probability
near-zero volume = redundant items spanning no new space -> near-zero probability
DPPs give a principled, global notion of diversity and you can find a high-probability fixed-size subset with, again, a greedy maximization of the log-determinant. The cost is real: building and operating on the kernel scales poorly as the candidate pool grows, which is why DPPs are usually applied to a few hundred candidates after a cheap first-stage retrieval, never to the whole corpus.
Method 3: Submodular Facility Location and Coverage
A third framing asks for coverage: pick a small set so that every relevant item in the candidate pool is close to something you chose. The classic objective is facility location:
coverage(S) = sum over every candidate c of (
max over s in S of similarity(c, s)
)
Read it as: for each candidate, how well is it represented by its nearest selected item. Maximizing this picks a set of "representatives" that collectively stand in for the whole pool, which is precisely choosing one keyframe per distinct scene instead of eight frames of the same scene. Facility location is monotone submodular, so the same greedy loop with the (1 - 1/e) guarantee applies.
You can fold relevance in by weighting each candidate's contribution by its query relevance, so the representatives you pick are biased toward the parts of the pool that matter to the query. This coverage view is often more natural than MMR for media, because it directly answers "did I capture every distinct thing that happened" rather than "is each new item far from the last."
Method 4: Redundancy-Aware Context Packing for RAG
When the consumer is a language model, diversity is also a budget problem. The prompt holds a fixed number of tokens, and recent token-budgeted RAG work, such as redundancy-aware greedy context selection, optimizes a set-level objective explicitly:
maximize sum of relevance(chunk to query)
- penalty * redundancy(chunk to chunks already packed)
subject to total tokens of selected chunks <= context budget
This is MMR's idea generalized two ways: the constraint is a token budget rather than a fixed count, so a long, dense chunk competes against several short ones, and the redundancy term can be a set-level penalty rather than only the nearest neighbor. The selection is still greedy by marginal gain per token. The result is a context window where every chunk earns its space, which both raises answer quality and cuts cost, because you are not paying to feed the model three phrasings of the same sentence.
Choosing a Method
| Method | Redundancy view | Cost | Best for |
| MMR | Nearest selected item only | Lowest | General reranking, the safe default |
| DPP | Whole-set volume via determinant | High; needs a candidate pool | When subtle multi-way redundancy hurts |
| Facility location | Coverage of the whole pool | Moderate, greedy | Media: one representative per distinct scene or cluster |
| Redundancy-aware packing | Set-level under a token budget | Moderate, greedy | Filling an LLM context window efficiently |
What to Measure
Plain recall and precision at k will not reveal a redundancy problem, because twenty copies of one relevant item still score as twenty relevant results. Measure the set, not the items:
Walk the lambda or penalty knob from pure relevance toward more diversity and stop where aspect recall stops improving, so you add variety only up to the point it still helps.
Doing This in Mixpeek
Diversity is a retriever-stage concern in Mixpeek. You keep your normal relevance retrieval as the first stage to gather candidates, then add a diversity stage that reranks those candidates into a non-redundant final set. Because it operates on the candidate list rather than the whole collection, it composes with whatever feature search or filter stages came before it.
from mixpeek import Mixpeek
client = Mixpeek(api_key="mxp_sk_...")
# Retrieve relevant candidates first, then diversify the final selection.
retriever = client.retrievers.create(
namespace_id="ns_video",
collection_ids=["col_keyframes"],
retriever_name="product-reveals-diverse",
stages=[
# Stage 1: relevance. Gather a generous candidate pool.
{"stage_name": "knn_search", "parameters": {"top_k": 200}},
# Stage 2: diversity. Greedy MMR over the candidates.
{
"stage_name": "diversity_rerank",
"parameters": {
"method": "mmr", # or "facility_location"
"lambda": 0.6, # relevance vs redundancy tradeoff
"similarity_space": "visual_embedding",
"top_k": 10,
},
},
],
)
# A coverage-style query wants high diversity; flip lambda lower.
results = client.retrievers.execute(
retriever_id=retriever.retriever_id,
inputs={"text": "the moment the speaker reveals the product"},
top_k=10,
)
Treat the lambda and the similarity space as versioned config, because they change which evidence an agent sees. Measure aspect recall and intra-set redundancy before and after a change, and pick the embedding for the similarity space, on the Models page, that matches the redundancy you want to suppress: a visual encoder to collapse look-alike frames, a text encoder to collapse paraphrased chunks.