The Problem: One Embedding Size Does Not Fit Every Query
An AI agent searching unstructured content faces a constant tension. To cover a corpus of millions of images, video frames, audio spans, or document chunks, it wants embeddings small enough to store cheaply and scan fast. To make a final, confident decision about which few results to act on, it wants embeddings expressive enough to separate near-duplicates. Those two needs pull in opposite directions.
The naive fix is to pick one dimensionality and live with the compromise: a 1024-dimensional vector that is precise but expensive to store and search at scale, or a 256-dimensional vector that is cheap but blurry. Neither is right for every stage of a search.
Matryoshka Representation Learning (MRL) removes the forced choice. It trains a single model so that the first *k* dimensions of every embedding are themselves a complete, usable embedding. Named after Russian nesting dolls, an MRL vector contains smaller vectors inside it. You can slice the first 64, 128, 256, or 512 dimensions off a 1024-dimensional vector and each prefix still works as a coordinate in a meaningful space. That single property is what lets an agent search coarse first and refine fine later, without re-embedding anything.
This guide explains the real training objective, why truncation works at all, how it compares to alternatives, the cascade pattern agents use, and the practical gotchas that quietly break naive implementations.
The Core Idea: Coarse-to-Fine Information Packing
A normal embedding model spreads information across all dimensions with no ordering guarantee. Dimension 900 might carry as much signal as dimension 5. If you truncate that vector, you discard random pieces of meaning and the result is degraded in unpredictable ways.
MRL changes the objective so the model is forced to pack the most important, most general information into the front of the vector and progressively finer detail toward the back. The result is a single vector you can read at multiple resolutions:
1. First 64 dims — coarse semantics. Enough to tell a cat photo from a car photo, or a refund email from an invoice. 2. First 128–256 dims — mid resolution. Enough to separate breeds of cat, or distinguish a refund request from a refund confirmation. 3. Full 1024 dims — fine resolution. Enough to rank near-duplicates and resolve subtle distinctions an agent's final answer depends on.
The mental model: a normal embedding is a single photograph at one fixed resolution. An MRL embedding is a progressive JPEG. The first bytes already give you a recognizable thumbnail, and every additional byte sharpens it. You decide how much to load based on what the moment requires.
The Training Objective
The mechanism is a modified loss function, not a special architecture. You take an ordinary encoder (a transformer, a vision encoder, a multimodal model) that outputs a *d*-dimensional vector. You then choose a set of nested prefix sizes — a common choice is powers of two such as `{64, 128, 256, 512, 1024}`.
During training, for a single forward pass producing one *d*-dimensional output, you compute the training loss (for example, a contrastive or classification loss) independently at each prefix size, then sum them:
L_matryoshka(x) =
sum over m in {64, 128, 256, 512, 1024} of
w_m * L_task( truncate(f(x), m) )
Here `f(x)` is the full embedding, `truncate(v, m)` keeps the first *m* dimensions, `L_task` is the original loss (often computed with its own small classifier head or against the same contrastive targets), and `w_m` are per-prefix weights (frequently all equal to 1).
The key consequences:
Why Truncation Works: Variance Lands at the Front
Because the smallest prefix must solve the task alone, the optimizer is pushed to concentrate the highest-variance, most discriminative directions into the earliest dimensions. The front of the vector ends up carrying the components that most separate items; the tail carries diminishing, more specialized refinements.
This is conceptually similar to how PCA orders components by explained variance — except MRL bakes the ordering into the learned representation through the loss, so the prefixes are optimized end-to-end for the actual retrieval task rather than for reconstruction of the input distribution. The practical payoff is graceful degradation: a 128-dim prefix of an MRL model loses only a little accuracy versus the full 1024, while a 128-dim truncation of a non-MRL model can be near-useless.
How MRL Compares to the Alternatives
It is worth being precise about what MRL replaces and what it does not.
| Approach | What it does | Cost | Downside |
| Train N models, one per dim | Separate 128-d, 256-d, 1024-d models | N times the training and serving footprint; N inference passes if you want multiple sizes | Wasteful; vectors from different models are not comparable, so you cannot mix resolutions in one index |
| PCA / post-hoc reduction | Fit a projection after training, then reduce | Cheap to apply | Projection is fit to a distribution, not the retrieval task; quality drops faster than MRL and you must store/version the projection matrix |
| Plain quantization | Reduce bits per dimension (float32 to int8 or binary) | Very cheap | Orthogonal to dimensionality — it shrinks each number, not the count of numbers |
| MRL truncation | Slice the first *m* dims of one trained vector | One model, one inference pass, free truncation | Requires the model to be MRL-trained; truncating a non-MRL vector destroys it |
MRL versus PCA. Both produce a lower-dimensional vector ordered by importance. But PCA optimizes for reconstructing the variance of the embedding distribution, while MRL optimizes the prefixes directly against the downstream retrieval objective. Empirically MRL prefixes retain retrieval accuracy better at aggressive truncation, and they need no separate projection step or stored matrix at query time — you just keep the first *m* numbers.
MRL versus quantization. These solve different axes and compose. Dimensionality reduction (MRL) cuts the *number* of dimensions; quantization cuts the *bits per* dimension. You can take a 1024-dim float32 MRL vector, truncate to 256 dims, then quantize those 256 dims to int8 — stacking both savings. For the bit-level half of this picture, see the companion guide on embedding quantization and compression. The two techniques are most powerful used together in a storage-tiering strategy.
The Adaptive Retrieval Cascade
The reason agents care about MRL is the retrieve-coarse, rerank-fine cascade. Instead of running one expensive search over full-dimension vectors, the agent runs a cheap first pass to build a shortlist, then spends precision only on that shortlist.
1. Index the full corpus at a small prefix. Store, say, a 128-dim truncation of every item in the ANN index used for the first-pass scan. This is the index the agent searches against the whole corpus. 2. Coarse shortlist. Embed the query, truncate the query vector to the same 128 dims, and run approximate nearest neighbor search to retrieve a candidate set — for example the top 200 of 10 million items. This pass is fast and memory-light because the vectors are small. 3. Fine rerank. Fetch the full 1024-dim vectors for only those 200 candidates and re-score them with the query's full-dim vector. Reorder, then return the final top 10.
The agent pays full-dimension cost on 200 items, not 10 million. Recall stays high because the coarse pass is good enough to keep the truly relevant items inside the 200-candidate net, and the fine pass restores the precision needed to rank within it.
A Worked Numeric Example
Take a corpus of 10,000,000 items embedded at 1024 dimensions in float32 (4 bytes per dimension).
Storage. Full vectors cost `10,000,000 × 1024 × 4 bytes ≈ 41 GB`. A 128-dim prefix index costs `10,000,000 × 128 × 4 bytes ≈ 5.1 GB` — an 8x reduction in the hot index that has to live in fast memory. You can keep the full 41 GB on a cheaper tier (see vector storage tiering) and only read the handful of full vectors you actually rerank.
Latency. Brute-force scan cost scales with dimensionality. A single-pass full-dim scan does work proportional to `10,000,000 × 1024`. The cascade does `10,000,000 × 128` for the coarse pass plus `200 × 1024` for the rerank:
Single full-dim pass: 10,000,000 × 1024 = 10,240,000,000 units
Cascade coarse pass: 10,000,000 × 128 = 1,280,000,000 units
Cascade rerank: 200 × 1024 = 204,800 units
Cascade total: ≈ 1,280,204,800 units
Speedup: 10,240,000,000 / 1,280,204,800 ≈ 8x
(Real ANN indexes are sublinear, so absolute numbers differ — but the *ratio* between coarse and fine work is what drives the savings, and that ratio is set by the prefix size you choose.) The cost/recall knob is the prefix size and the candidate count: a smaller prefix and larger shortlist favor recall and speed at some precision risk; a larger prefix narrows the shortlist needed to hit a recall target. Tune the prefix and `top_k` together against a labeled eval set rather than guessing.
For the index internals that make the coarse pass sublinear, see approximate nearest neighbor algorithms. For the broader pattern of staging cheap-then-expensive retrieval, see multi-stage retrieval: how agents search unstructured data.
Why This Matters for Multimodal and Agent Perception
The cascade is useful for text, but it is *decisive* for multimodal corpora, where vector counts and dimensions are both large.
Models That Ship Nested Embeddings (as of 2026)
MRL has moved from a research idea into mainstream production models. Verify the exact prefix sizes against each provider's current card before relying on a specific number, but these families are known to support truncatable / nested embeddings:
If you are unsure whether a given checkpoint was MRL-trained, treat it as not truncatable until the model card says otherwise — and prefer phrasing like "models such as these" over asserting a specific model's internals you have not verified. For broader comparisons see the curated lists of best multimodal embedding models and best self-hosted embedding models.
Practical Gotchas
These are the failure modes that quietly break naive MRL implementations.
1. Normalize After Truncation, Not Before
If your similarity metric is cosine (or you rely on unit vectors for dot-product equivalence), you must L2-normalize the truncated prefix, not the full vector. A vector that is unit-length at 1024 dims is generally not unit-length after you slice it to 128 dims, because you dropped part of its magnitude. The correct order is: truncate first, then normalize the prefix. Normalizing the full vector and then truncating leaves you with mismatched magnitudes and degraded cosine scores. Apply the same truncate-then-normalize order to both the indexed items and the query so they live in the same space.
2. Not All Models Are MRL-Trained
Truncating a non-MRL embedding does not give you a smaller usable embedding — it gives you garbage, because the information was never ordered front-to-back. Before you truncate anything, confirm the model was trained with a Matryoshka objective (check the model card or the provider's `dimensions` parameter docs). If it was not, you must either keep the full vector or use a proper learned reduction; a raw slice will silently wreck recall.
3. Choosing the Prefix Size
There is no universal prefix. The right size depends on corpus size, the hardness of the queries, and your recall target. A practical procedure:
1. Build a labeled eval set of queries with known relevant items. 2. Sweep candidate prefix sizes (64, 128, 256, 512) for the coarse pass. 3. For each, measure recall@k of the coarse shortlist (does the relevant item survive into the candidate set?) and end-to-end precision after the full-dim rerank. 4. Pick the smallest prefix that keeps coarse recall above your threshold. Smaller prefix means cheaper index and faster scan; you recover precision in the rerank.
4. The ANN Index Is Built on the Truncated Dimension
This is the subtle one. Your approximate nearest neighbor index — its graph or partitions, its distance computations — is built over the vectors you actually insert. For the coarse pass, that means the index is built on the truncated dimension. You do not index full vectors and then ask the index for a 128-dim answer; you index 128-dim vectors. The full-dim vectors are stored separately (often on a cheaper tier) and fetched only for rerank. Keep the index dimension, the stored-vector dimension, and the rerank dimension explicitly distinct in your design so you never accidentally search the wrong resolution. The adaptive-indexing guide above covers how to manage this when the agent changes resolution per query.
Mapping This to Mixpeek
Mixpeek expresses the coarse-to-fine cascade as a multi-stage retriever: a cheap `feature_search` stage produces a shortlist over the full corpus, and a higher-dimension stage reranks that shortlist. You index the truncated dimension for the first stage and keep the full vectors available for the rerank, exactly as the pattern above prescribes.
pip install mixpeek
from mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_API_KEY")
# Index a large multimodal corpus with an MRL-capable embedding model.
# The coarse index stores a truncated prefix; full vectors stay available
# for reranking the shortlist.
collection = mx.collections.create(
collection_name="media_library",
source={"type": "bucket", "uri": "s3://brand-media-library"},
feature_extractors=[
{
"feature": "visual_embedding",
"model": "nomic-ai/nomic-embed-vision",
# Coarse prefix for the first-pass index; full vector retained.
"matryoshka_dims": [128, 1024],
},
],
)
# Build a retriever that shortlists coarse, then reranks fine.
retriever = mx.retrievers.create(
collection_id=collection.id,
stages=[
# Stage 1: cheap coarse shortlist over the whole corpus (128-d prefix).
{
"type": "feature_search",
"feature": "visual_embedding",
"dimensions": 128,
"top_k": 200,
},
# Stage 2: precise rerank of the 200 candidates at full dimension.
{
"type": "feature_search",
"feature": "visual_embedding",
"dimensions": 1024,
"top_k": 10,
},
],
)
# An agent runs fast-then-precise search with a single call.
results = mx.retrievers.execute(
retriever_id=retriever.id,
query="overhead shot of a product on a marble kitchen counter",
return_fields=["asset_id", "timestamps", "keyframe_url", "score"],
)
The agent gets a fast first pass over millions of vectors at 128 dimensions, then spends full-dimension precision on only the 200 it shortlisted — the same 8x-class savings worked through above, expressed declaratively. To configure which feature dimensions are extracted and stored, see extractors; to store and query your own pre-computed MRL vectors with object-storage economics, see MVS. For cost planning across resolutions and tiers, see pricing and the docs.