Matryoshka Representation Learning: Nested Embeddings for Adaptive Multimodal Retrieval

The Problem: One Embedding Size Does Not Fit Every Query

An AI agent searching unstructured content faces a constant tension. To cover a corpus of millions of images, video frames, audio spans, or document chunks, it wants embeddings small enough to store cheaply and scan fast. To make a final, confident decision about which few results to act on, it wants embeddings expressive enough to separate near-duplicates. Those two needs pull in opposite directions.

The naive fix is to pick one dimensionality and live with the compromise: a 1024-dimensional vector that is precise but expensive to store and search at scale, or a 256-dimensional vector that is cheap but blurry. Neither is right for every stage of a search.

Matryoshka Representation Learning (MRL) removes the forced choice. It trains a single model so that the first *k* dimensions of every embedding are themselves a complete, usable embedding. Named after Russian nesting dolls, an MRL vector contains smaller vectors inside it. You can slice the first 64, 128, 256, or 512 dimensions off a 1024-dimensional vector and each prefix still works as a coordinate in a meaningful space. That single property is what lets an agent search coarse first and refine fine later, without re-embedding anything.

This guide explains the real training objective, why truncation works at all, how it compares to alternatives, the cascade pattern agents use, and the practical gotchas that quietly break naive implementations.

The Core Idea: Coarse-to-Fine Information Packing

A normal embedding model spreads information across all dimensions with no ordering guarantee. Dimension 900 might carry as much signal as dimension 5. If you truncate that vector, you discard random pieces of meaning and the result is degraded in unpredictable ways.

MRL changes the objective so the model is forced to pack the most important, most general information into the front of the vector and progressively finer detail toward the back. The result is a single vector you can read at multiple resolutions:

1. First 64 dims — coarse semantics. Enough to tell a cat photo from a car photo, or a refund email from an invoice. 2. First 128–256 dims — mid resolution. Enough to separate breeds of cat, or distinguish a refund request from a refund confirmation. 3. Full 1024 dims — fine resolution. Enough to rank near-duplicates and resolve subtle distinctions an agent's final answer depends on.

The mental model: a normal embedding is a single photograph at one fixed resolution. An MRL embedding is a progressive JPEG. The first bytes already give you a recognizable thumbnail, and every additional byte sharpens it. You decide how much to load based on what the moment requires.

The Training Objective

The mechanism is a modified loss function, not a special architecture. You take an ordinary encoder (a transformer, a vision encoder, a multimodal model) that outputs a *d*-dimensional vector. You then choose a set of nested prefix sizes — a common choice is powers of two such as {64, 128, 256, 512, 1024}.

During training, for a single forward pass producing one *d*-dimensional output, you compute the training loss (for example, a contrastive or classification loss) independently at each prefix size, then sum them:

L_matryoshka(x) =
  sum over m in {64, 128, 256, 512, 1024} of
    w_m * L_task( truncate(f(x), m) )

Here f(x) is the full embedding, truncate(v, m) keeps the first *m* dimensions, L_task is the original loss (often computed with its own small classifier head or against the same contrastive targets), and w_m are per-prefix weights (frequently all equal to 1).

The key consequences:

The gradient from the 64-dim loss flows only through the first 64 dimensions, the 128-dim loss through the first 128, and so on. The front dimensions receive gradient from every prefix term, so they are optimized to be useful on their own. Back dimensions are optimized only by the larger prefixes, so they specialize in the residual detail the smaller prefixes could not capture.

You train once. There is no separate model per dimension. The full vector and all its prefixes come out of the same forward pass.

Inference is unchanged. The model emits the full vector; truncation happens downstream at index time or query time. There is no extra inference cost to support multiple resolutions.

Why Truncation Works: Variance Lands at the Front

Because the smallest prefix must solve the task alone, the optimizer is pushed to concentrate the highest-variance, most discriminative directions into the earliest dimensions. The front of the vector ends up carrying the components that most separate items; the tail carries diminishing, more specialized refinements.

This is conceptually similar to how PCA orders components by explained variance — except MRL bakes the ordering into the learned representation through the loss, so the prefixes are optimized end-to-end for the actual retrieval task rather than for reconstruction of the input distribution. The practical payoff is graceful degradation: a 128-dim prefix of an MRL model loses only a little accuracy versus the full 1024, while a 128-dim truncation of a non-MRL model can be near-useless.

How MRL Compares to the Alternatives

It is worth being precise about what MRL replaces and what it does not.

Approach

What it does

Cost

Downside

Train N models, one per dim	Separate 128-d, 256-d, 1024-d models	N times the training and serving footprint; N inference passes if you want multiple sizes	Wasteful; vectors from different models are not comparable, so you cannot mix resolutions in one index
PCA / post-hoc reduction	Fit a projection after training, then reduce	Cheap to apply	Projection is fit to a distribution, not the retrieval task; quality drops faster than MRL and you must store/version the projection matrix
Plain quantization	Reduce bits per dimension (float32 to int8 or binary)	Very cheap	Orthogonal to dimensionality — it shrinks each number, not the count of numbers
MRL truncation	Slice the first m dims of one trained vector	One model, one inference pass, free truncation	Requires the model to be MRL-trained; truncating a non-MRL vector destroys it

Two clarifications matter for agents.

MRL versus PCA. Both produce a lower-dimensional vector ordered by importance. But PCA optimizes for reconstructing the variance of the embedding distribution, while MRL optimizes the prefixes directly against the downstream retrieval objective. Empirically MRL prefixes retain retrieval accuracy better at aggressive truncation, and they need no separate projection step or stored matrix at query time — you just keep the first *m* numbers.

MRL versus quantization. These solve different axes and compose. Dimensionality reduction (MRL) cuts the *number* of dimensions; quantization cuts the *bits per* dimension. You can take a 1024-dim float32 MRL vector, truncate to 256 dims, then quantize those 256 dims to int8 — stacking both savings. For the bit-level half of this picture, see the companion guide on embedding quantization and compression. The two techniques are most powerful used together in a storage-tiering strategy.

The Adaptive Retrieval Cascade

The reason agents care about MRL is the retrieve-coarse, rerank-fine cascade. Instead of running one expensive search over full-dimension vectors, the agent runs a cheap first pass to build a shortlist, then spends precision only on that shortlist.

1. Index the full corpus at a small prefix. Store, say, a 128-dim truncation of every item in the ANN index used for the first-pass scan. This is the index the agent searches against the whole corpus. 2. Coarse shortlist. Embed the query, truncate the query vector to the same 128 dims, and run approximate nearest neighbor search to retrieve a candidate set — for example the top 200 of 10 million items. This pass is fast and memory-light because the vectors are small. 3. Fine rerank. Fetch the full 1024-dim vectors for only those 200 candidates and re-score them with the query's full-dim vector. Reorder, then return the final top 10.

The agent pays full-dimension cost on 200 items, not 10 million. Recall stays high because the coarse pass is good enough to keep the truly relevant items inside the 200-candidate net, and the fine pass restores the precision needed to rank within it.

A Worked Numeric Example

Take a corpus of 10,000,000 items embedded at 1024 dimensions in float32 (4 bytes per dimension).

Storage. Full vectors cost 10,000,000 × 1024 × 4 bytes ≈ 41 GB. A 128-dim prefix index costs 10,000,000 × 128 × 4 bytes ≈ 5.1 GB — an 8x reduction in the hot index that has to live in fast memory. You can keep the full 41 GB on a cheaper tier (see vector storage tiering) and only read the handful of full vectors you actually rerank.

Latency. Brute-force scan cost scales with dimensionality. A single-pass full-dim scan does work proportional to 10,000,000 × 1024. The cascade does 10,000,000 × 128 for the coarse pass plus 200 × 1024 for the rerank:

Single full-dim pass:  10,000,000 × 1024        = 10,240,000,000 units
Cascade coarse pass:   10,000,000 ×  128         =  1,280,000,000 units
Cascade rerank:               200 × 1024         =        204,800 units
Cascade total:                                   ≈  1,280,204,800 units
Speedup:               10,240,000,000 / 1,280,204,800  ≈ 8x

(Real ANN indexes are sublinear, so absolute numbers differ — but the *ratio* between coarse and fine work is what drives the savings, and that ratio is set by the prefix size you choose.) The cost/recall knob is the prefix size and the candidate count: a smaller prefix and larger shortlist favor recall and speed at some precision risk; a larger prefix narrows the shortlist needed to hit a recall target. Tune the prefix and top_k together against a labeled eval set rather than guessing.

For the index internals that make the coarse pass sublinear, see approximate nearest neighbor algorithms. For the broader pattern of staging cheap-then-expensive retrieval, see multi-stage retrieval: how agents search unstructured data.

Why This Matters for Multimodal and Agent Perception

The cascade is useful for text, but it is *decisive* for multimodal corpora, where vector counts and dimensions are both large.

Images and video. A single hour of video segmented into frames or shots can produce thousands of vectors. A media library produces billions. Storing every frame at full dimension is the dominant cost in many perception systems. Truncating the first-pass index slashes that cost while the full vectors stay available for the few segments an agent reranks.

Audio. Long-form audio split into spans behaves like video — many vectors per asset — and benefits from the same coarse index.

Agents that need a fast first look. An agent answering a question rarely needs maximum precision on its first move. It needs a fast, cheap survey of what exists, then precision on the narrow set it decides to inspect. MRL maps directly onto that loop: coarse scan to orient, fine rerank to commit. This pairs naturally with budget-aware and adaptive strategies — see budget-aware multi-vector retrieval and adaptive indexing for agentic search.

Models That Ship Nested Embeddings (as of 2026)

MRL has moved from a research idea into mainstream production models. Verify the exact prefix sizes against each provider's current card before relying on a specific number, but these families are known to support truncatable / nested embeddings:

OpenAI text-embedding-3 (small and large). Expose a dimensions parameter that lets you request shorter vectors trained to remain useful when truncated.

Nomic Embed. Released as an open model explicitly trained with Matryoshka so you can truncate the output.

Jina embeddings (v3 and later). Support Matryoshka-style truncation of the output dimensionality.

Google EmbeddingGemma. Ships nested-embedding support, letting you trade dimensionality for footprint.

Alibaba GTE / Qwen3-Embedding family. Models in this line support reduced output dimensions via MRL-style training.

If you are unsure whether a given checkpoint was MRL-trained, treat it as not truncatable until the model card says otherwise — and prefer phrasing like "models such as these" over asserting a specific model's internals you have not verified. For broader comparisons see the curated lists of best multimodal embedding models and best self-hosted embedding models.

Practical Gotchas

These are the failure modes that quietly break naive MRL implementations.

1. Normalize After Truncation, Not Before

If your similarity metric is cosine (or you rely on unit vectors for dot-product equivalence), you must L2-normalize the truncated prefix, not the full vector. A vector that is unit-length at 1024 dims is generally not unit-length after you slice it to 128 dims, because you dropped part of its magnitude. The correct order is: truncate first, then normalize the prefix. Normalizing the full vector and then truncating leaves you with mismatched magnitudes and degraded cosine scores. Apply the same truncate-then-normalize order to both the indexed items and the query so they live in the same space.

2. Not All Models Are MRL-Trained

Truncating a non-MRL embedding does not give you a smaller usable embedding — it gives you garbage, because the information was never ordered front-to-back. Before you truncate anything, confirm the model was trained with a Matryoshka objective (check the model card or the provider's dimensions parameter docs). If it was not, you must either keep the full vector or use a proper learned reduction; a raw slice will silently wreck recall.

3. Choosing the Prefix Size

There is no universal prefix. The right size depends on corpus size, the hardness of the queries, and your recall target. A practical procedure:

1. Build a labeled eval set of queries with known relevant items. 2. Sweep candidate prefix sizes (64, 128, 256, 512) for the coarse pass. 3. For each, measure recall@k of the coarse shortlist (does the relevant item survive into the candidate set?) and end-to-end precision after the full-dim rerank. 4. Pick the smallest prefix that keeps coarse recall above your threshold. Smaller prefix means cheaper index and faster scan; you recover precision in the rerank.

4. The ANN Index Is Built on the Truncated Dimension

This is the subtle one. Your approximate nearest neighbor index — its graph or partitions, its distance computations — is built over the vectors you actually insert. For the coarse pass, that means the index is built on the truncated dimension. You do not index full vectors and then ask the index for a 128-dim answer; you index 128-dim vectors. The full-dim vectors are stored separately (often on a cheaper tier) and fetched only for rerank. Keep the index dimension, the stored-vector dimension, and the rerank dimension explicitly distinct in your design so you never accidentally search the wrong resolution. The adaptive-indexing guide above covers how to manage this when the agent changes resolution per query.

Mapping This to Mixpeek

Mixpeek expresses the coarse-to-fine cascade as a multi-stage retriever: a cheap feature_search stage produces a shortlist over the full corpus, and a higher-dimension stage reranks that shortlist. You index the truncated dimension for the first stage and keep the full vectors available for the rerank, exactly as the pattern above prescribes.

pip install mixpeek

from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_API_KEY")

# Index a large multimodal corpus with an MRL-capable embedding model.
# The coarse index stores a truncated prefix; full vectors stay available
# for reranking the shortlist.
collection = mx.collections.create(
    collection_name="media_library",
    source={"type": "bucket", "uri": "s3://brand-media-library"},
    feature_extractors=[
        {
            "feature": "visual_embedding",
            "model": "nomic-ai/nomic-embed-vision",
            # Coarse prefix for the first-pass index; full vector retained.
            "matryoshka_dims": [128, 1024],
        },
    ],
)

# Build a retriever that shortlists coarse, then reranks fine.
retriever = mx.retrievers.create(
    collection_id=collection.id,
    stages=[
        # Stage 1: cheap coarse shortlist over the whole corpus (128-d prefix).
        {
            "type": "feature_search",
            "feature": "visual_embedding",
            "dimensions": 128,
            "top_k": 200,
        },
        # Stage 2: precise rerank of the 200 candidates at full dimension.
        {
            "type": "feature_search",
            "feature": "visual_embedding",
            "dimensions": 1024,
            "top_k": 10,
        },
    ],
)

# An agent runs fast-then-precise search with a single call.
results = mx.retrievers.execute(
    retriever_id=retriever.id,
    query="overhead shot of a product on a marble kitchen counter",
    return_fields=["asset_id", "timestamps", "keyframe_url", "score"],
)

The agent gets a fast first pass over millions of vectors at 128 dimensions, then spends full-dimension precision on only the 200 it shortlisted — the same 8x-class savings worked through above, expressed declaratively. To configure which feature dimensions are extracted and stored, see extractors; to store and query your own pre-computed MRL vectors with object-storage economics, see MVS. For cost planning across resolutions and tiers, see pricing and the docs.

Production Checklist

Confirm the embedding model is genuinely MRL-trained before truncating anything.

Truncate first, then L2-normalize the prefix — apply the same order to items and queries.

Build the coarse ANN index on the truncated dimension; store full vectors separately for rerank.

Pick the prefix size from a labeled eval, not intuition: smallest prefix that holds coarse recall.

Tune prefix size and shortlist top_k together against your recall/latency target.

Compose MRL with quantization for stacked savings (fewer dims and fewer bits per dim).

Tier storage: cheap coarse index hot, full vectors on cheaper storage, fetched only on rerank.

Measure coarse recall@k separately from end-to-end precision so you know which stage to tune.

Key Takeaways

MRL trains one model so the first *k* dimensions of every vector are an independently usable embedding — coarse semantics at the front, fine detail toward the back.

It works because the nested loss forces the highest-variance, most discriminative directions into the earliest dimensions, giving graceful degradation under truncation.

It beats training N models (one pass, comparable vectors) and beats PCA (task-optimized prefixes, no stored projection); it is orthogonal to and composable with quantization.

The payoff for agents is the retrieve-coarse, rerank-fine cascade: a cheap low-dim scan over the whole corpus, then full-dim precision on a small shortlist — roughly 8x storage and scan savings in the worked example.

It matters most for multimodal and agent perception, where vector counts are enormous and an agent wants a fast first look before committing precision.

Watch the gotchas: normalize after truncation, never truncate non-MRL models, choose the prefix empirically, and remember the ANN index is built on the truncated dimension.

Related Resources

Embedding Quantization and Compression -- the bit-level companion to dimensionality reduction

Approximate Nearest Neighbor Algorithms -- what makes the coarse pass sublinear

Multi-Stage Retrieval: How Agents Search Unstructured Data -- the staging pattern in full

Budget-Aware Multi-Vector Retrieval -- spending compute where it counts

Adaptive Indexing for Agentic Search -- changing resolution per query

Vector Storage Tiering -- hot coarse index, cold full vectors

Best Multimodal Embedding Models -- compare nested-embedding support

Best Self-Hosted Embedding Models -- open MRL-capable options

Best Vector Databases -- where the truncated index lives