The Modality Gap: Why Text-to-Image Search Underperforms and How Agents Fix It

The Bug That Looks Like a Tuning Problem

You build a CLIP-style index. Text-to-image search works: type "a dog on a beach" and you get dogs on beaches. Then you try image-to-image search using the same model's image encoder, and the results are worse than you expected. Then you build one shared index holding both image vectors and text-caption vectors so an agent can search either way, and you notice something strange: a text query overwhelmingly retrieves other text, and an image query overwhelmingly retrieves other images, almost regardless of meaning.

This is not a bug in your code and it is not a hyperparameter you forgot to tune. It is a structural property of how joint embedding models are trained, called the modality gap, and it silently caps the quality of cross-modal and mixed-modal retrieval. For an AI agent that has to search across images, video frames, audio, and text in one space, understanding the gap is the difference between trusting your similarity scores and being quietly misled by them.

A shared embedding space is really two well-separated cones: all image embeddings in one region, all text embeddings in another, matched pairs still closer to each other than to mismatches but never registered onto the same point.

See the full diagram →

What the Gap Actually Is

Train a dual-encoder model like CLIP and you would naively expect a semantically matched image and caption to land at nearly the same point in the shared space. They do not. Instead, all image embeddings occupy one narrow region (a cone) of the hypersphere, and all text embeddings occupy a different, well-separated region. The two clouds barely overlap. A matched image-text pair is closer to each other than to mismatched pairs, but both still live inside their own modality's cone, separated by a consistent offset vector.

        text cone
        *  *  *
       *  *  *  *
        *  *  *
            \
             \   <- the gap: a roughly constant
              \     offset between the two clouds
               *  *  *
              *  *  *  *
               *  *  *
        image cone

Two facts make this counterintuitive:

1. The gap is large relative to the within-modality spread. The distance from the text cloud to the image cloud is often bigger than the distance between a query and its correct match inside a cloud. So raw cosine similarity is dominated by "which modality is this?" before it ever gets to "what does this mean?" 2. The gap is roughly constant in direction. You can estimate a single offset vector (the difference of the two cloud centroids) that explains most of the separation. That constancy is what makes the gap fixable.

Why It Happens

The gap is not an accident of one checkpoint. Three forces create and preserve it.

Separate encoders, separate parameters. A dual encoder runs images through one network and text through another, with no shared weights. Nothing forces the two networks to use the same region of the output space; they only have to make matched pairs score higher than mismatched pairs.

Contrastive loss only constrains relative order, not absolute position. The InfoNCE objective rewards a matched pair for being more similar than the in-batch negatives. It is perfectly satisfied if every image sits in cone A and every caption in cone B, as long as within that arrangement the right pairs are closest. The loss has no term that says "put the cat image on top of the word cat."

Initialization and temperature lock it in. At initialization each encoder's outputs already occupy a small cone (a known effect of deep networks on the sphere), and the low softmax temperature used in contrastive training sharpens the contrast between matched and unmatched pairs without ever pulling the cones together. Early training widens the gap and later training preserves it.

The cleanest way to internalize this: the model was trained to rank, not to register. Cross-modal ranking within a query works because the offset is shared across all candidates and cancels out. Anything that mixes modalities in the candidate set, or compares absolute scores across modalities, exposes the offset.

The Three Failure Modes Agents Hit

The gap is invisible until your retrieval pattern stops being "text query over an image-only index." These are the patterns that break.

1. Mixed-modality indexes (intra-modal ranking bias). Put image vectors and text vectors in one collection and let an agent query with either. Because each query sits inside its own cone, the nearest neighbors are dominated by same-modality items. A text query buries relevant images under barely-relevant captions; an image query buries relevant captions under barely-relevant images. The agent thinks it ran a semantic search; it actually ran a modality filter with a semantic tiebreaker.

2. Intra-modal retrieval with a cross-modal model (image-to-image search). CLIP's image encoder is excellent at the contrastive task but was never optimized to make two images of the same object land close together; that constraint was never in the loss. So image-to-image and text-to-text similarity from a CLIP-style model is systematically weaker than a model trained for that modality directly. Reaching for the joint model's single-modality encoder because it is already in your stack is a common and quiet quality regression.

3. Cross-modal score thresholds and fusion. If you set "relevant if cosine > 0.28" from text-to-image experiments and reuse it on text-to-text, the threshold is meaningless because the score distributions live on different scales separated by the gap. The same problem corrupts naive score fusion across modalities. (This is the cross-modal twin of the calibration problem in Calibrating Similarity Scores.)

Four Ways to Close or Sidestep the Gap

There is no single fix; there is a menu, and the right choice depends on whether you can change the model, what you store, and which retrieval pattern you run.

1. Subtract the offset (gap closing)

Because the gap is roughly a constant translation, estimate it and remove it. Compute the centroid of a sample of text embeddings and the centroid of a sample of image embeddings, take the difference, and shift one modality by that vector before indexing. A slightly stronger version learns a small linear map (a rotation plus translation) on held-out matched pairs.

import numpy as np

# matched samples used only to estimate the gap, not to train the encoders
text_mean = text_vecs.mean(axis=0)
img_mean = img_vecs.mean(axis=0)
gap = text_mean - img_mean          # the (roughly constant) offset

# shift image vectors toward the text cone before indexing, then renormalize
img_aligned = img_vecs + gap
img_aligned /= np.linalg.norm(img_aligned, axis=1, keepdims=True)

This is cheap, needs no retraining, and typically recovers a meaningful slice of cross-modal recall. Its ceiling is set by how constant the gap really is; the residual, query-dependent part of the offset it cannot remove.

2. Route by modality, then search within (gap sidestepping)

The most robust pattern for mixed corpora is to stop mixing modalities in one ranked list. Keep a separate index per modality, run the query against each, and merge the rankings with a rank-based method like RRF that never compares raw cross-modal scores. The gap stops mattering because every comparison happens inside one modality's cone, and rank fusion is scale-agnostic by construction. (This is the same fusion machinery described in Learned Sparse Retrieval and Dense-Sparse Hybrid and Multi-Index Search Architecture.)

3. Store multiple vectors per item

Give each item one embedding per modality it has: an image vector for the visual, a text vector for its caption or OCR, an audio vector for its sound. A text query hits the text vectors (a within-modality, gap-free comparison) and an image query hits the image vectors. You trade storage for recall and you sidestep the gap entirely, at the cost of running an extra encoder per item. This is the multi-index pattern made per-item rather than per-collection.

4. Use a model trained against the gap

Newer training recipes explicitly add intra-modal terms (image-to-image and text-to-text contrastive losses alongside the cross-modal one), or post-hoc alignment objectives that pull the cones together. SigLIP-style sigmoid losses and "mind the gap" fine-tuning reduce, though do not eliminate, the separation. If you control the model or can pick one, a model with intra-modal constraints is strictly better for any agent that does both cross-modal and within-modal search. (See Contrastive Learning: How CLIP, SigLIP, and CLAP Work for the loss internals.)

A Decision Guide

Text query over an image-only index: the gap cancels; do nothing.

Mixed-modality index, either query type: never trust one fused cosine list. Route by modality and merge with RRF, or store per-modality vectors.

Image-to-image or text-to-text search: do not reuse a cross-modal encoder's single tower if quality matters. Use a model with intra-modal constraints or a dedicated single-modality encoder.

Cross-modal thresholds or score fusion: recalibrate per modality pair. A threshold from text-to-image does not transfer to text-to-text.

Can't change the model, can change the index: subtract the offset before indexing. Cheap, no retraining, partial fix.

Measuring the Gap Before You Trust the Index

Do not assume the gap is small for your model and data. Measure it, with three quick diagnostics on a labeled sample.

Cloud separation. Compute the centroids of text and image embeddings and the distance between them, relative to the average within-cloud distance. A separation larger than the within-cloud spread means raw cross-modal scores are gap-dominated.

Intra-modal bias rate. Build a mixed index, run matched queries, and measure what fraction of the top-k come from the query's own modality. A healthy mixed search returns a balance that reflects the corpus; a number near 100 percent same-modality is the gap leaking into your results.

Within-modal recall delta. Compare image-to-image recall from your cross-modal model against a dedicated single-modality encoder on the same data. A large delta tells you the joint model's tower is the wrong tool for intra-modal search.

The single number to watch for an agent stack is the intra-modal bias rate on a mixed index. If it is pinned high, your "semantic" search is really a modality sort, and the fix is routing or per-modality vectors, not a better query.

Doing This in Mixpeek

For agents, the durable answer is to never force incompatible modalities into one ranked list and to keep within-modality comparisons within-modality. In Managed Mixpeek, that is a retriever configuration rather than custom plumbing: index a per-modality feature for each record (visual, transcript text, audio), run a stage per modality, and fuse with RRF so no raw cross-modal score is ever compared directly.

from mixpeek import Mixpeek

client = Mixpeek(api_key="mxp_sk_...")

retriever = client.retrievers.create(
    namespace="media_library",
    stages=[
        # within-modality: the text query scores against caption/OCR text
        {"stage_type": "vector_search", "field": "text_embedding", "top_k": 200},
        # within-modality: the same query, embedded for the visual space
        {"stage_type": "vector_search", "field": "visual_embedding", "top_k": 200},
        # merge by rank so the modality gap never enters the comparison
        {"stage_type": "rank_fusion", "method": "rrf"},
    ],
)

results = client.retrievers.execute(
    retriever_id=retriever.retriever_id,
    inputs={"text": "a dog running on a beach at sunset"},
    top_k=20,
)

If you bring your own embeddings, you can apply the offset-subtraction trick before indexing and run dense search over the aligned vectors in MVS, the Mixpeek Vector Store, side by side with the raw vectors to measure the recall lift. Either way the agent issues one query and gets a single fused ranking whose comparisons all happened inside a single modality, so the gap never silently reorders its evidence.

The Bug That Looks Like a Tuning Problem

What the Gap Actually Is

Why It Happens

The Three Failure Modes Agents Hit

Four Ways to Close or Sidestep the Gap

1. Subtract the offset (gap closing)

2. Route by modality, then search within (gap sidestepping)

3. Store multiple vectors per item

4. Use a model trained against the gap

A Decision Guide

Measuring the Gap Before You Trust the Index

Doing This in Mixpeek

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

How Does LoRA Fine-Tuning Work? (Adapters, QLoRA, DoRA, and Fine-Tuning Retrieval Models)

How to Switch Embedding Models Without Re-Embedding Everything

Matryoshka Representation Learning: Nested Embeddings for Adaptive Multimodal Retrieval