Why This Matters for Agents
Every cross-modal search query your agent runs — "find frames matching this description," "retrieve audio that sounds like this," "find documents similar to this image" — depends on contrastive learning. CLIP, SigLIP, CLAP, BGE, and virtually every embedding model used for multimodal retrieval was trained with some variant of contrastive learning.
Previous guides in this series covered what these models do. This guide covers how they learn — the training dynamics that determine whether your search results are good or garbage. Understanding contrastive learning is not academic curiosity. It directly predicts:
The Core Idea
Contrastive learning trains a model to answer one question: are these two things related or not?
Given a batch of N (anchor, positive) pairs, the model learns to produce embeddings where: 1. The anchor and its positive have high similarity (close in embedding space) 2. The anchor and all other items in the batch have low similarity (far apart)
The "anchor" and "positive" can be anything: an image and its caption (CLIP), two augmented views of the same image (SimCLR), an audio clip and its text description (CLAP), or a query and a relevant document (BGE).
What makes contrastive learning powerful is that you never need to define similarity explicitly. You just need pairs that should match and let the loss function push everything else apart. The model discovers what "similar" means by learning which features matter for the matching task.
The InfoNCE Loss
The dominant contrastive loss function is InfoNCE (Information Noise-Contrastive Estimation), introduced in 2018 and used by CLIP, CLAP, and most contrastive models.
For a batch of N pairs, consider one anchor embedding `a_i` and its positive `p_i`. The loss for this pair is:
L_i = -log( exp(sim(a_i, p_i) / τ) / Σ_j exp(sim(a_i, p_j) / τ) )
Where:
This is a softmax cross-entropy loss where the "correct class" is the matching positive and the "wrong classes" are all other items in the batch. The total loss averages over all N anchors.
Intuition: For each anchor, the model assigns a probability to every item in the batch being the correct match. Training pushes the probability mass toward the true match.
Why Batch Size Matters Enormously
The denominator sums over all N items in the batch. Each non-matching item acts as a negative example. More negatives means the model must learn finer distinctions to identify the correct positive.
CLIP was trained with a batch size of 32,768. This means each anchor is contrasted against 32,767 negatives simultaneously. At this scale, the model cannot succeed by learning coarse features ("this is a photo" vs "this is text"). It must learn fine-grained semantic alignment.
With a batch size of 32, the task is much easier — the model can often find the correct match by superficial cues. This is why small-batch contrastive training produces poor embeddings even with the same architecture and data.
Practical implication: If you fine-tune a contrastive model on your domain data with small batches, the resulting embeddings will be less discriminative than the original. Minimum batch size of 256 is a common rule of thumb; 1024+ is better.
Temperature: The Most Misunderstood Parameter
The temperature τ controls the "sharpness" of the similarity distribution. It is arguably the most important hyperparameter in contrastive learning.
Low temperature (τ → 0): The softmax becomes very sharp. Only the highest-similarity item gets significant probability mass. The model aggressively separates positives from negatives. This pushes embeddings toward a uniform distribution on the hypersphere — maximum spread, maximum discrimination.
High temperature (τ → ∞): The softmax becomes flat. All items get roughly equal probability regardless of similarity. The gradients are weak and the model learns slowly.
CLIP uses a learnable log-temperature initialized at `log(1/0.07) ≈ 2.66`, clamped to a maximum of `log(100) ≈ 4.6`. During training, the temperature typically converges to around 0.01, which is very sharp. This extreme sharpness is why CLIP embeddings have a distinctive property: most pairs have very low cosine similarity (near 0), and even highly similar items rarely exceed 0.35-0.40.
Why this matters for search: When you set a similarity threshold for retrieval (e.g., "return results with similarity > 0.25"), the right threshold depends on the temperature the model was trained with. A CLIP similarity of 0.30 is a strong match. A BGE similarity of 0.30 might be weak. There is no universal "good" threshold — it is model-specific and determined by training temperature.
CLIP: The Softmax Formulation
CLIP (Contrastive Language-Image Pre-training, OpenAI 2021) trains two encoders simultaneously:
Image → Vision Transformer (ViT) → image embedding (512/768-dim)
Text → Transformer (GPT-like) → text embedding (512/768-dim)
Both projected to shared dimensionality via learned linear layers
Given a batch of N (image, text) pairs, CLIP computes an N×N matrix of cosine similarities. The loss is symmetric:
L_image = (1/N) Σ_i CrossEntropy(row_i, target=i)
L_text = (1/N) Σ_i CrossEntropy(col_i, target=i)
L_total = (L_image + L_text) / 2
Each row represents one image contrasted against all texts. Each column represents one text contrasted against all images. The target is the diagonal — each image should match its own text.
The Softmax Problem
The softmax denominator grows linearly with batch size. For a batch of 32,768, computing the full N×N similarity matrix requires 32,768² ≈ 1 billion similarity computations. This is computationally expensive and, more importantly, the softmax formulation creates an implicit competition between all pairs.
If two images in the batch have genuinely similar captions (e.g., "a dog running in a park" and "a puppy running on grass"), the model is penalized for assigning high similarity to both. The softmax forces it to choose one. This is a false negative problem — the model learns to avoid features shared by multiple positive pairs, which can suppress genuinely useful semantic similarities.
SigLIP: The Sigmoid Alternative
SigLIP (Sigmoid Loss for Language-Image Pre-training, Google 2023) replaces the softmax with independent sigmoid losses:
For each pair (i, j):
y_ij = 1 if (i, j) is a matching pair, -1 otherwise
L_ij = -log σ(y_ij · sim(i, j) / τ + b)
L_total = (1/N²) Σ_i Σ_j L_ij
Where σ is the sigmoid function and b is a learnable bias.
The key difference: each (image, text) pair is classified independently as matching or not, with no softmax competition. Two similar captions can both have high similarity to their respective images without penalty.
Why SigLIP is Better (Sometimes)
1. No false negatives: Similar-but-not-paired items do not compete. The model can learn that "dog in park" and "puppy on grass" are both close to park dog images. 2. Scales better: No N×N softmax — the loss decomposes into independent binary classifications. SigLIP was trained with chunks of 32K pairs but evaluated on smaller sub-batches. 3. Simpler temperature dynamics: The bias term b absorbs some of the work that temperature does in InfoNCE, leading to more stable training.
SigLIP 2 (Google 2024-2025) extends this with better data curation and multi-resolution training, achieving SOTA on many zero-shot benchmarks with fewer parameters than CLIP.
But: The softmax formulation has an advantage when the batch does contain true duplicates — it forces the model to find the exact match, which can improve fine-grained discrimination. SigLIP trades fine-grained precision for better recall.
CLAP: Contrastive Learning for Audio
CLAP (Contrastive Language-Audio Pretraining) applies the same framework to audio:
Audio → HTSAT Encoder (Swin Transformer on mel spectrograms) → 512-dim
Text → RoBERTa Encoder → 512-dim
InfoNCE loss on (audio, text) pairs
The architecture is conceptually identical to CLIP, but the audio encoder processes mel spectrograms (2D time-frequency representations) instead of images. HTSAT uses hierarchical windowed self-attention, similar to Swin Transformer but adapted for the rectangular aspect ratio of spectrograms (long in time, short in frequency).
Audio-Specific Challenges
Variable length: Audio clips range from 1 second to minutes. CLAP handles this by padding/truncating to a fixed window (typically 10 seconds) and using a global average pooling to produce a single embedding. Longer audio must be chunked.
Temporal structure: Unlike images where spatial information is crucial, audio has strong temporal ordering. The HTSAT encoder processes the spectrogram with shifted windows that span the time axis, capturing temporal patterns (speech rhythm, musical phrases, sound event sequences).
Sparse positive labels: Most audio datasets have short, noisy captions. AudioCaps (the main CLAP training dataset) has ~50K audio-caption pairs, compared to CLIP's 400M image-text pairs. This scarcity is partially compensated by data augmentation (time stretching, pitch shifting, noise addition) and by using additional weakly-labeled audio datasets.
The Modality Gap
Every contrastive model trained on cross-modal data exhibits the modality gap — embeddings from different modalities (image vs text, audio vs text) occupy distinct sub-regions of the shared space, separated by a systematic offset.
In CLIP, if you compute the average image embedding and the average text embedding, they are far apart despite being in the "same" space. The gap is consistent: all image embeddings live in a cone-shaped region, and all text embeddings live in a different cone-shaped region. The cones overlap but are offset.
Why the Gap Exists
The modality gap emerges from two factors:
1. Initialization bias: The image and text encoders are initialized differently (ViT vs GPT-like transformer). Their initial embeddings occupy different regions of the space, and training does not fully close this gap because it only needs to align *relative* similarities, not absolute positions.
2. Temperature-driven uniformity: Low temperature pushes all embeddings to be uniformly spread on the hypersphere. Within each modality, embeddings spread to maximize entropy. But the spreading happens independently within each modality's sub-region. The two regions settle into a configuration where they are "near" each other (cross-modal matching works) but not "on top of" each other.
Practical Impact
The modality gap means:
1. Cross-modal similarity scores are lower than within-modal scores. A text query's similarity to a matching image (say, 0.28) is typically lower than the similarity between two semantically similar images (say, 0.85). Your threshold for cross-modal search should be much lower than for within-modal search.
2. Nearest-neighbor search can fail. If you mix image and text embeddings in the same vector index and run a text query, the nearest neighbors will always be other text embeddings (same modality, higher similarity) rather than matching images. The standard workaround is to search only across modalities: text queries search only image embeddings, and vice versa.
3. Fusion helps. If a document has both a text description and an image, generating embeddings for both and fusing the search results (via RRF or learned combination) consistently outperforms searching either modality alone.
Hard Negatives: Making Training Harder on Purpose
The quality of contrastive embeddings depends heavily on the quality of negative examples. Random negatives (any other item in the batch) work for initial training but plateau quickly. Hard negatives — items that are similar to the anchor but not actual matches — force the model to learn finer distinctions.
Types of Hard Negatives
In-batch hard negatives: The hardest negatives that naturally occur in the batch. With large batch sizes, there will be items that share some semantic overlap with the anchor (e.g., two different dog images). These are "free" hard negatives.
Mined hard negatives: Use a preliminary model to find near-miss items, then include them in training batches. For text embeddings (BGE, E5), this typically means finding passages that are topically related but not actually relevant to the query.
Synthetic hard negatives: Generate confusing negatives using LLMs or data augmentation. For example, given the caption "a red car parked on a street," a synthetic hard negative might be "a blue car parked on a street" — forcing the model to attend to attributes, not just objects.
The Hard Negative Trap
Too many hard negatives cause mode collapse — the model learns to map everything to a small region of the embedding space, producing embeddings that all look alike. The loss drops to zero because the model finds a trivially uniform solution.
The standard mitigation is to mix hard negatives with easy negatives at a controlled ratio. BGE uses a 1:7 ratio (1 hard negative for every 7 random negatives). Training first with random negatives, then introducing hard negatives in later stages, is another common strategy.
Evaluating Contrastive Embeddings
Standard Benchmarks
Zero-shot classification (images): Encode class names as text, encode test images, find the closest text embedding for each image. Report accuracy.
Retrieval metrics: Given a query, rank all candidates by cosine similarity. Report:
MTEB (Massive Text Embedding Benchmark): 56+ tasks covering retrieval, classification, clustering, reranking, and semantic textual similarity for text embeddings.
Production-Relevant Metrics
Benchmarks test models on clean, curated datasets. Production data is messier. Additional metrics to track:
Distribution coverage: Do your embeddings span the space or cluster in a small region? Compute the average pairwise distance between embeddings. If it is very low, the model may have collapsed or your data is too homogeneous.
Cross-modal gap magnitude: Compute the mean embedding per modality and measure their distance. A gap > 0.5 (in cosine distance) suggests cross-modal search will be unreliable without threshold calibration.
Recall at your actual retrieval cutoff: If your pipeline returns top-20 results, Recall@1000 is irrelevant. Measure Recall@20 on a representative sample of queries.
Debugging Common Embedding Issues
"All my search results have the same similarity score"
Cause: Embedding collapse — the model maps everything to a similar region. This happens with over-training on hard negatives, too-low temperature, or degenerate data.
Fix: Check the standard deviation of cosine similarities in your index. For healthy CLIP embeddings, the std of random pair similarities should be ~0.04-0.08. If it is < 0.01, the embeddings have collapsed.
"Cross-modal search returns irrelevant results but within-modal search works fine"
Cause: Modality gap is too large. The text and image embeddings are in separate regions of the space, and your similarity threshold cannot bridge the gap.
Fix: Calibrate thresholds per query type. Or use a learned projection to align modalities post-hoc (a small linear layer trained on matched pairs to reduce the gap).
"Search works for common queries but fails on rare or specific ones"
Cause: The model was trained on generic data and has not seen your domain's concepts. Contrastive models generalize through the composition of learned features, but truly novel concepts (specific industrial parts, niche medical terminology) may not be representable.
Fix: Fine-tune with domain-specific (query, positive, negative) triples. Even 1,000-10,000 domain triples can dramatically improve recall for your specific vocabulary.
"Adding more data made search quality worse"
Cause: The new data introduced false negatives. If your dataset has duplicate or near-duplicate items with different labels, the contrastive loss penalizes the model for correctly recognizing their similarity.
Fix: Deduplicate your training data. Remove or merge items that are semantically identical but have different labels/captions.
Building Search Pipelines with Contrastive Models
Single-Model Pipeline
The simplest approach: one contrastive model, one vector index, cosine similarity search.
# Index time
image_embedding = clip.encode_image(image)
vector_store.upsert(id=doc_id, vector=image_embedding)
# Query time
query_embedding = clip.encode_text(query)
results = vector_store.search(query_embedding, top_k=20)
When it works: Homogeneous content, single modality, broad queries.
When it fails: Mixed modalities (images + text + audio), specific queries, domain-specific vocabulary.
Multi-Model Fusion Pipeline
Use multiple contrastive models and fuse their results:
# Index time: generate embeddings from multiple models
clip_emb = clip.encode_image(image)
siglip_emb = siglip.encode_image(image)
text_emb = bge.encode(caption)
# Store each in its own index
clip_index.upsert(id=doc_id, vector=clip_emb)
siglip_index.upsert(id=doc_id, vector=siglip_emb)
text_index.upsert(id=doc_id, vector=text_emb)
# Query time: search each index, fuse with RRF
clip_results = clip_index.search(clip.encode_text(query), top_k=100)
siglip_results = siglip_index.search(siglip.encode_text(query), top_k=100)
text_results = text_index.search(bge.encode(query), top_k=100)
fused = reciprocal_rank_fusion([clip_results, siglip_results, text_results])
When to use: When different models have complementary strengths. CLIP may excel at objects and scenes; SigLIP may handle text-in-images better; BGE captures textual semantics that visual models miss.
Agent-Driven Adaptive Retrieval
An agent decides which model(s) to query based on the input:
def agent_search(query: str, context: dict) -> list:
# Agent analyzes the query to decide retrieval strategy
if mentions_visual_content(query):
results = clip_search(query, top_k=50)
elif mentions_audio_content(query):
results = clap_search(query, top_k=50)
else:
results = text_search(query, top_k=50)
# Agent can refine with follow-up searches
if results[0].score < confidence_threshold:
# Try a different modality
fallback = multi_model_fusion(query)
results = merge(results, fallback)
return results
When to use: When the query space is diverse and a single retrieval strategy cannot cover all cases.
The Training Data Question
Contrastive models are only as good as their training pairs. Understanding what data went into a model predicts its strengths and weaknesses:
| Model | Training Data | Size | Strengths | Weaknesses |
| CLIP (OpenAI) | WebImageText (web-scraped) | 400M pairs | Broad coverage, objects, scenes | Specialized domains, fine-grained attributes |
| SigLIP 2 (Google) | WebLI (web-scraped, filtered) | 10B+ pairs | Text-in-images, multilingual, detailed descriptions | Less common in open-source tooling |
| CLAP (LAION) | AudioCaps + AudioSet + FreeSound | ~630K pairs | Environmental sounds, music classification | Speech content, rare sounds |
| BGE (BAAI) | Curated text pairs + synthetic | ~1B pairs | Text retrieval, multilingual | No visual understanding |
| Jina CLIP v2 | Curated multimodal | 400M+ pairs | Long captions, document screenshots | Smaller community |
Common Pitfalls
Comparing scores across models. A cosine similarity of 0.30 from CLIP and 0.30 from SigLIP do not indicate the same confidence. Each model's score distribution is shaped by its training temperature, batch size, and data distribution. Always calibrate thresholds per model.
Using L2 distance instead of cosine similarity. Contrastive models are trained with cosine similarity. The embeddings are typically L2-normalized, making cosine similarity equivalent to dot product. Using raw L2 distance without normalization produces meaningless rankings.
Ignoring the modality gap in mixed indexes. If you store image embeddings and text embeddings in the same vector index and run nearest-neighbor search, results will be dominated by same-modality matches. Always search across modalities (text query → image index, not text query → mixed index).
Fine-tuning with a learning rate that is too high. Contrastive models are pre-trained on massive data with carefully tuned learning rates (typically 1e-5 to 5e-4). Fine-tuning with 1e-3 will destroy the learned representations in a few steps. Use learning rates 10-100x smaller than from-scratch training, and freeze the first several layers.
Assuming embedding dimensions are comparable. CLIP produces 512 or 768-dim embeddings. BGE produces 1024-dim. SigLIP produces 1152-dim. These are not interchangeable — you cannot concatenate embeddings from different models or compare them directly. Each model's embedding space is its own learned coordinate system.