NEWVectors or files. Pick a path.Start →
    Embeddings
    17 min read
    Updated 2026-06-19

    The Modality Gap: Why Text-to-Image Search Underperforms and How Agents Fix It

    A geometric guide to the modality gap in dual-encoder models like CLIP. Learn why text and image embeddings sit in separate cones of the same space, why this quietly caps cross-modal recall and breaks image-to-image search, and the four concrete techniques an AI agent stack can use to close or sidestep the gap when searching unstructured content.

    Modality Gap
    Cross-Modal Retrieval
    CLIP
    Multimodal Embeddings
    Embedding Geometry

    The Bug That Looks Like a Tuning Problem



    You build a CLIP-style index. Text-to-image search works: type "a dog on a beach" and you get dogs on beaches. Then you try image-to-image search using the same model's image encoder, and the results are worse than you expected. Then you build one shared index holding both image vectors and text-caption vectors so an agent can search either way, and you notice something strange: a text query overwhelmingly retrieves other text, and an image query overwhelmingly retrieves other images, almost regardless of meaning.

    This is not a bug in your code and it is not a hyperparameter you forgot to tune. It is a structural property of how joint embedding models are trained, called the modality gap, and it silently caps the quality of cross-modal and mixed-modal retrieval. For an AI agent that has to search across images, video frames, audio, and text in one space, understanding the gap is the difference between trusting your similarity scores and being quietly misled by them.

    What the Gap Actually Is



    Train a dual-encoder model like CLIP and you would naively expect a semantically matched image and caption to land at nearly the same point in the shared space. They do not. Instead, all image embeddings occupy one narrow region (a cone) of the hypersphere, and all text embeddings occupy a different, well-separated region. The two clouds barely overlap. A matched image-text pair is closer to each other than to mismatched pairs, but both still live inside their own modality's cone, separated by a consistent offset vector.

            text cone
            *  *  *
           *  *  *  *
            *  *  *
                \
                 \   <- the gap: a roughly constant
                  \     offset between the two clouds
                   *  *  *
                  *  *  *  *
                   *  *  *
            image cone
    


    Two facts make this counterintuitive:

    1. The gap is large relative to the within-modality spread. The distance from the text cloud to the image cloud is often bigger than the distance between a query and its correct match inside a cloud. So raw cosine similarity is dominated by "which modality is this?" before it ever gets to "what does this mean?" 2. The gap is roughly constant in direction. You can estimate a single offset vector (the difference of the two cloud centroids) that explains most of the separation. That constancy is what makes the gap fixable.

    Why It Happens



    The gap is not an accident of one checkpoint. Three forces create and preserve it.

  1. Separate encoders, separate parameters. A dual encoder runs images through one network and text through another, with no shared weights. Nothing forces the two networks to use the same region of the output space; they only have to make matched pairs score higher than mismatched pairs.
  2. Contrastive loss only constrains relative order, not absolute position. The InfoNCE objective rewards a matched pair for being more similar than the in-batch negatives. It is perfectly satisfied if every image sits in cone A and every caption in cone B, as long as within that arrangement the right pairs are closest. The loss has no term that says "put the cat image on top of the word cat."
  3. Initialization and temperature lock it in. At initialization each encoder's outputs already occupy a small cone (a known effect of deep networks on the sphere), and the low softmax temperature used in contrastive training sharpens the contrast between matched and unmatched pairs without ever pulling the cones together. Early training widens the gap and later training preserves it.


  4. The cleanest way to internalize this: the model was trained to rank, not to register. Cross-modal ranking within a query works because the offset is shared across all candidates and cancels out. Anything that mixes modalities in the candidate set, or compares absolute scores across modalities, exposes the offset.

    The Three Failure Modes Agents Hit



    The gap is invisible until your retrieval pattern stops being "text query over an image-only index." These are the patterns that break.

    1. Mixed-modality indexes (intra-modal ranking bias). Put image vectors and text vectors in one collection and let an agent query with either. Because each query sits inside its own cone, the nearest neighbors are dominated by same-modality items. A text query buries relevant images under barely-relevant captions; an image query buries relevant captions under barely-relevant images. The agent thinks it ran a semantic search; it actually ran a modality filter with a semantic tiebreaker.

    2. Intra-modal retrieval with a cross-modal model (image-to-image search). CLIP's image encoder is excellent at the contrastive task but was never optimized to make two images of the same object land close together; that constraint was never in the loss. So image-to-image and text-to-text similarity from a CLIP-style model is systematically weaker than a model trained for that modality directly. Reaching for the joint model's single-modality encoder because it is already in your stack is a common and quiet quality regression.

    3. Cross-modal score thresholds and fusion. If you set "relevant if cosine > 0.28" from text-to-image experiments and reuse it on text-to-text, the threshold is meaningless because the score distributions live on different scales separated by the gap. The same problem corrupts naive score fusion across modalities. (This is the cross-modal twin of the calibration problem in Calibrating Similarity Scores.)

    Four Ways to Close or Sidestep the Gap



    There is no single fix; there is a menu, and the right choice depends on whether you can change the model, what you store, and which retrieval pattern you run.

    1. Subtract the offset (gap closing)



    Because the gap is roughly a constant translation, estimate it and remove it. Compute the centroid of a sample of text embeddings and the centroid of a sample of image embeddings, take the difference, and shift one modality by that vector before indexing. A slightly stronger version learns a small linear map (a rotation plus translation) on held-out matched pairs.

    import numpy as np

    # matched samples used only to estimate the gap, not to train the encoders text_mean = text_vecs.mean(axis=0) img_mean = img_vecs.mean(axis=0) gap = text_mean - img_mean # the (roughly constant) offset

    # shift image vectors toward the text cone before indexing, then renormalize img_aligned = img_vecs + gap img_aligned /= np.linalg.norm(img_aligned, axis=1, keepdims=True)


    This is cheap, needs no retraining, and typically recovers a meaningful slice of cross-modal recall. Its ceiling is set by how constant the gap really is; the residual, query-dependent part of the offset it cannot remove.

    2. Route by modality, then search within (gap sidestepping)



    The most robust pattern for mixed corpora is to stop mixing modalities in one ranked list. Keep a separate index per modality, run the query against each, and merge the rankings with a rank-based method like RRF that never compares raw cross-modal scores. The gap stops mattering because every comparison happens inside one modality's cone, and rank fusion is scale-agnostic by construction. (This is the same fusion machinery described in Learned Sparse Retrieval and Dense-Sparse Hybrid and Multi-Index Search Architecture.)

    3. Store multiple vectors per item



    Give each item one embedding per modality it has: an image vector for the visual, a text vector for its caption or OCR, an audio vector for its sound. A text query hits the text vectors (a within-modality, gap-free comparison) and an image query hits the image vectors. You trade storage for recall and you sidestep the gap entirely, at the cost of running an extra encoder per item. This is the multi-index pattern made per-item rather than per-collection.

    4. Use a model trained against the gap



    Newer training recipes explicitly add intra-modal terms (image-to-image and text-to-text contrastive losses alongside the cross-modal one), or post-hoc alignment objectives that pull the cones together. SigLIP-style sigmoid losses and "mind the gap" fine-tuning reduce, though do not eliminate, the separation. If you control the model or can pick one, a model with intra-modal constraints is strictly better for any agent that does both cross-modal and within-modal search. (See Contrastive Learning: How CLIP, SigLIP, and CLAP Work for the loss internals.)

    A Decision Guide



  5. Text query over an image-only index: the gap cancels; do nothing.
  6. Mixed-modality index, either query type: never trust one fused cosine list. Route by modality and merge with RRF, or store per-modality vectors.
  7. Image-to-image or text-to-text search: do not reuse a cross-modal encoder's single tower if quality matters. Use a model with intra-modal constraints or a dedicated single-modality encoder.
  8. Cross-modal thresholds or score fusion: recalibrate per modality pair. A threshold from text-to-image does not transfer to text-to-text.
  9. Can't change the model, can change the index: subtract the offset before indexing. Cheap, no retraining, partial fix.


  10. Measuring the Gap Before You Trust the Index



    Do not assume the gap is small for your model and data. Measure it, with three quick diagnostics on a labeled sample.

  11. Cloud separation. Compute the centroids of text and image embeddings and the distance between them, relative to the average within-cloud distance. A separation larger than the within-cloud spread means raw cross-modal scores are gap-dominated.
  12. Intra-modal bias rate. Build a mixed index, run matched queries, and measure what fraction of the top-k come from the query's own modality. A healthy mixed search returns a balance that reflects the corpus; a number near 100 percent same-modality is the gap leaking into your results.
  13. Within-modal recall delta. Compare image-to-image recall from your cross-modal model against a dedicated single-modality encoder on the same data. A large delta tells you the joint model's tower is the wrong tool for intra-modal search.


  14. The single number to watch for an agent stack is the intra-modal bias rate on a mixed index. If it is pinned high, your "semantic" search is really a modality sort, and the fix is routing or per-modality vectors, not a better query.

    Doing This in Mixpeek



    For agents, the durable answer is to never force incompatible modalities into one ranked list and to keep within-modality comparisons within-modality. In Managed Mixpeek, that is a retriever configuration rather than custom plumbing: index a per-modality feature for each record (visual, transcript text, audio), run a stage per modality, and fuse with RRF so no raw cross-modal score is ever compared directly.

    from mixpeek import Mixpeek

    client = Mixpeek(api_key="mxp_sk_...")

    retriever = client.retrievers.create( namespace="media_library", stages=[ # within-modality: the text query scores against caption/OCR text {"stage_type": "vector_search", "field": "text_embedding", "top_k": 200}, # within-modality: the same query, embedded for the visual space {"stage_type": "vector_search", "field": "visual_embedding", "top_k": 200}, # merge by rank so the modality gap never enters the comparison {"stage_type": "rank_fusion", "method": "rrf"}, ], )

    results = client.retrievers.execute( retriever_id=retriever.retriever_id, inputs={"text": "a dog running on a beach at sunset"}, top_k=20, )


    If you bring your own embeddings, you can apply the offset-subtraction trick before indexing and run dense search over the aligned vectors in MVS, the Mixpeek Vector Store, side by side with the raw vectors to measure the recall lift. Either way the agent issues one query and gets a single fused ranking whose comparisons all happened inside a single modality, so the gap never silently reorders its evidence.

    Further Reading



  15. Contrastive Learning: How CLIP, SigLIP, and CLAP Work -- the dual-encoder loss that creates the gap
  16. Embedding Space Geometry -- why cosine similarity does not always mean what you think
  17. Omnimodal Embeddings -- single-model approaches to text, image, audio, and video
  18. Multi-Index Search Architecture -- combining per-modality indexes the gap-safe way
  19. Calibrating Similarity Scores -- recalibrating thresholds across modality pairs
  20. Managed Mixpeek

    Put multimodal search to work

    Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.

    Start with Managed
    MVS · bring your own

    Already have vectors?

    Keep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.

    Start with MVS

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs