The Bug That Looks Like a Tuning Problem
You build a CLIP-style index. Text-to-image search works: type "a dog on a beach" and you get dogs on beaches. Then you try image-to-image search using the same model's image encoder, and the results are worse than you expected. Then you build one shared index holding both image vectors and text-caption vectors so an agent can search either way, and you notice something strange: a text query overwhelmingly retrieves other text, and an image query overwhelmingly retrieves other images, almost regardless of meaning.
This is not a bug in your code and it is not a hyperparameter you forgot to tune. It is a structural property of how joint embedding models are trained, called the modality gap, and it silently caps the quality of cross-modal and mixed-modal retrieval. For an AI agent that has to search across images, video frames, audio, and text in one space, understanding the gap is the difference between trusting your similarity scores and being quietly misled by them.
What the Gap Actually Is
Train a dual-encoder model like CLIP and you would naively expect a semantically matched image and caption to land at nearly the same point in the shared space. They do not. Instead, all image embeddings occupy one narrow region (a cone) of the hypersphere, and all text embeddings occupy a different, well-separated region. The two clouds barely overlap. A matched image-text pair is closer to each other than to mismatched pairs, but both still live inside their own modality's cone, separated by a consistent offset vector.
text cone
* * *
* * * *
* * *
\
\ <- the gap: a roughly constant
\ offset between the two clouds
* * *
* * * *
* * *
image cone
Two facts make this counterintuitive:
1. The gap is large relative to the within-modality spread. The distance from the text cloud to the image cloud is often bigger than the distance between a query and its correct match inside a cloud. So raw cosine similarity is dominated by "which modality is this?" before it ever gets to "what does this mean?" 2. The gap is roughly constant in direction. You can estimate a single offset vector (the difference of the two cloud centroids) that explains most of the separation. That constancy is what makes the gap fixable.
Why It Happens
The gap is not an accident of one checkpoint. Three forces create and preserve it.
The cleanest way to internalize this: the model was trained to rank, not to register. Cross-modal ranking within a query works because the offset is shared across all candidates and cancels out. Anything that mixes modalities in the candidate set, or compares absolute scores across modalities, exposes the offset.
The Three Failure Modes Agents Hit
The gap is invisible until your retrieval pattern stops being "text query over an image-only index." These are the patterns that break.
1. Mixed-modality indexes (intra-modal ranking bias). Put image vectors and text vectors in one collection and let an agent query with either. Because each query sits inside its own cone, the nearest neighbors are dominated by same-modality items. A text query buries relevant images under barely-relevant captions; an image query buries relevant captions under barely-relevant images. The agent thinks it ran a semantic search; it actually ran a modality filter with a semantic tiebreaker.
2. Intra-modal retrieval with a cross-modal model (image-to-image search). CLIP's image encoder is excellent at the contrastive task but was never optimized to make two images of the same object land close together; that constraint was never in the loss. So image-to-image and text-to-text similarity from a CLIP-style model is systematically weaker than a model trained for that modality directly. Reaching for the joint model's single-modality encoder because it is already in your stack is a common and quiet quality regression.
3. Cross-modal score thresholds and fusion. If you set "relevant if cosine > 0.28" from text-to-image experiments and reuse it on text-to-text, the threshold is meaningless because the score distributions live on different scales separated by the gap. The same problem corrupts naive score fusion across modalities. (This is the cross-modal twin of the calibration problem in Calibrating Similarity Scores.)
Four Ways to Close or Sidestep the Gap
There is no single fix; there is a menu, and the right choice depends on whether you can change the model, what you store, and which retrieval pattern you run.
1. Subtract the offset (gap closing)
Because the gap is roughly a constant translation, estimate it and remove it. Compute the centroid of a sample of text embeddings and the centroid of a sample of image embeddings, take the difference, and shift one modality by that vector before indexing. A slightly stronger version learns a small linear map (a rotation plus translation) on held-out matched pairs.
import numpy as np
# matched samples used only to estimate the gap, not to train the encoders
text_mean = text_vecs.mean(axis=0)
img_mean = img_vecs.mean(axis=0)
gap = text_mean - img_mean # the (roughly constant) offset
# shift image vectors toward the text cone before indexing, then renormalize
img_aligned = img_vecs + gap
img_aligned /= np.linalg.norm(img_aligned, axis=1, keepdims=True)
This is cheap, needs no retraining, and typically recovers a meaningful slice of cross-modal recall. Its ceiling is set by how constant the gap really is; the residual, query-dependent part of the offset it cannot remove.
2. Route by modality, then search within (gap sidestepping)
The most robust pattern for mixed corpora is to stop mixing modalities in one ranked list. Keep a separate index per modality, run the query against each, and merge the rankings with a rank-based method like RRF that never compares raw cross-modal scores. The gap stops mattering because every comparison happens inside one modality's cone, and rank fusion is scale-agnostic by construction. (This is the same fusion machinery described in Learned Sparse Retrieval and Dense-Sparse Hybrid and Multi-Index Search Architecture.)
3. Store multiple vectors per item
Give each item one embedding per modality it has: an image vector for the visual, a text vector for its caption or OCR, an audio vector for its sound. A text query hits the text vectors (a within-modality, gap-free comparison) and an image query hits the image vectors. You trade storage for recall and you sidestep the gap entirely, at the cost of running an extra encoder per item. This is the multi-index pattern made per-item rather than per-collection.
4. Use a model trained against the gap
Newer training recipes explicitly add intra-modal terms (image-to-image and text-to-text contrastive losses alongside the cross-modal one), or post-hoc alignment objectives that pull the cones together. SigLIP-style sigmoid losses and "mind the gap" fine-tuning reduce, though do not eliminate, the separation. If you control the model or can pick one, a model with intra-modal constraints is strictly better for any agent that does both cross-modal and within-modal search. (See Contrastive Learning: How CLIP, SigLIP, and CLAP Work for the loss internals.)
A Decision Guide
Measuring the Gap Before You Trust the Index
Do not assume the gap is small for your model and data. Measure it, with three quick diagnostics on a labeled sample.
The single number to watch for an agent stack is the intra-modal bias rate on a mixed index. If it is pinned high, your "semantic" search is really a modality sort, and the fix is routing or per-modality vectors, not a better query.
Doing This in Mixpeek
For agents, the durable answer is to never force incompatible modalities into one ranked list and to keep within-modality comparisons within-modality. In Managed Mixpeek, that is a retriever configuration rather than custom plumbing: index a per-modality feature for each record (visual, transcript text, audio), run a stage per modality, and fuse with RRF so no raw cross-modal score is ever compared directly.
from mixpeek import Mixpeek
client = Mixpeek(api_key="mxp_sk_...")
retriever = client.retrievers.create(
namespace="media_library",
stages=[
# within-modality: the text query scores against caption/OCR text
{"stage_type": "vector_search", "field": "text_embedding", "top_k": 200},
# within-modality: the same query, embedded for the visual space
{"stage_type": "vector_search", "field": "visual_embedding", "top_k": 200},
# merge by rank so the modality gap never enters the comparison
{"stage_type": "rank_fusion", "method": "rrf"},
],
)
results = client.retrievers.execute(
retriever_id=retriever.retriever_id,
inputs={"text": "a dog running on a beach at sunset"},
top_k=20,
)
If you bring your own embeddings, you can apply the offset-subtraction trick before indexing and run dense search over the aligned vectors in MVS, the Mixpeek Vector Store, side by side with the raw vectors to measure the recall lift. Either way the agent issues one query and gets a single fused ranking whose comparisons all happened inside a single modality, so the gap never silently reorders its evidence.