Contrastive Learning: How CLIP, SigLIP, and CLAP Actually Work

Why This Matters for Agents

Every cross-modal search query your agent runs ("find frames matching this description," "retrieve audio that sounds like this," "find documents similar to this image") depends on contrastive learning. CLIP, SigLIP, CLAP, BGE, and virtually every embedding model used for multimodal retrieval was trained with some variant of contrastive learning.

Previous guides in this series covered what these models do. This guide covers how they learn: the training dynamics that determine whether your search results are good or garbage. Understanding contrastive learning is not academic curiosity. It directly predicts:

Why your text-to-image search returns irrelevant results at certain thresholds

Why the same cosine similarity score means different things for different models

Why adding more data sometimes makes embeddings worse

Why your embeddings cluster by modality instead of by meaning

Contrastive training scores a batch as an N by N grid of cosine similarities where the diagonal holds the true pairings; InfoNCE pulls the diagonal together and pushes everything else apart, which is why batch size is a hyperparameter rather than a throughput knob.

See the full diagram →

The Core Idea

Contrastive learning trains a model to answer one question: are these two things related or not?

Given a batch of N (anchor, positive) pairs, the model learns to produce embeddings where: 1. The anchor and its positive have high similarity (close in embedding space) 2. The anchor and all other items in the batch have low similarity (far apart)

The "anchor" and "positive" can be anything: an image and its caption (CLIP), two augmented views of the same image (SimCLR), an audio clip and its text description (CLAP), or a query and a relevant document (BGE).

What makes contrastive learning powerful is that you never need to define similarity explicitly. You just need pairs that should match and let the loss function push everything else apart. The model discovers what "similar" means by learning which features matter for the matching task.

The InfoNCE Loss

The dominant contrastive loss function is InfoNCE (Information Noise-Contrastive Estimation), introduced in 2018 and used by CLIP, CLAP, and most contrastive models.

For a batch of N pairs, consider one anchor embedding a_i and its positive p_i. The loss for this pair is:

L_i = -log( exp(sim(a_i, p_i) / τ) / Σ_j exp(sim(a_i, p_j) / τ) )

Where:

sim(x, y) is cosine similarity between embeddings

τ (tau) is a learnable temperature parameter

The sum in the denominator runs over all positives in the batch, including the correct one

This is a softmax cross-entropy loss where the "correct class" is the matching positive and the "wrong classes" are all other items in the batch. The total loss averages over all N anchors.

Intuition: For each anchor, the model assigns a probability to every item in the batch being the correct match. Training pushes the probability mass toward the true match.

Why Batch Size Matters Enormously

The denominator sums over all N items in the batch. Each non-matching item acts as a negative example. More negatives means the model must learn finer distinctions to identify the correct positive.

CLIP was trained with a batch size of 32,768. This means each anchor is contrasted against 32,767 negatives simultaneously. At this scale, the model cannot succeed by learning coarse features ("this is a photo" vs "this is text"). It must learn fine-grained semantic alignment.

With a batch size of 32, the task is much easier: the model can often find the correct match by superficial cues. This is why small-batch contrastive training produces poor embeddings even with the same architecture and data.

Practical implication: If you fine-tune a contrastive model on your domain data with small batches, the resulting embeddings will be less discriminative than the original. Minimum batch size of 256 is a common rule of thumb; 1024+ is better.

Temperature: The Most Misunderstood Parameter

The temperature τ controls the "sharpness" of the similarity distribution. It is arguably the most important hyperparameter in contrastive learning.

Low temperature (τ → 0): The softmax becomes very sharp. Only the highest-similarity item gets significant probability mass. The model aggressively separates positives from negatives. This pushes embeddings toward a uniform distribution on the hypersphere: maximum spread, maximum discrimination.

High temperature (τ → ∞): The softmax becomes flat. All items get roughly equal probability regardless of similarity. The gradients are weak and the model learns slowly.

CLIP uses a learnable log-temperature initialized at log(1/0.07) ≈ 2.66, clamped to a maximum of log(100) ≈ 4.6. During training, the temperature typically converges to around 0.01, which is very sharp. This extreme sharpness is why CLIP embeddings have a distinctive property: most pairs have very low cosine similarity (near 0), and even highly similar items rarely exceed 0.35-0.40.

Why this matters for search: When you set a similarity threshold for retrieval (e.g., "return results with similarity > 0.25"), the right threshold depends on the temperature the model was trained with. A CLIP similarity of 0.30 is a strong match. A BGE similarity of 0.30 might be weak. There is no universal "good" threshold: it is model-specific and determined by training temperature.

CLIP: The Softmax Formulation

CLIP (Contrastive Language-Image Pre-training, OpenAI 2021) trains two encoders simultaneously:

Image → Vision Transformer (ViT) → image embedding (512/768-dim)
Text  → Transformer (GPT-like)   → text embedding (512/768-dim)

Both projected to shared dimensionality via learned linear layers

Given a batch of N (image, text) pairs, CLIP computes an N×N matrix of cosine similarities. The loss is symmetric:

L_image = (1/N) Σ_i CrossEntropy(row_i, target=i)
L_text  = (1/N) Σ_i CrossEntropy(col_i, target=i)
L_total = (L_image + L_text) / 2

Each row represents one image contrasted against all texts. Each column represents one text contrasted against all images. The target is the diagonal: each image should match its own text.

The Softmax Problem

The softmax denominator grows linearly with batch size. For a batch of 32,768, computing the full N×N similarity matrix requires 32,768² ≈ 1 billion similarity computations. This is computationally expensive and, more importantly, the softmax formulation creates an implicit competition between all pairs.

If two images in the batch have genuinely similar captions (e.g., "a dog running in a park" and "a puppy running on grass"), the model is penalized for assigning high similarity to both. The softmax forces it to choose one. This is a false negative problem: the model learns to avoid features shared by multiple positive pairs, which can suppress genuinely useful semantic similarities.

SigLIP: The Sigmoid Alternative

SigLIP (Sigmoid Loss for Language-Image Pre-training, Google 2023) replaces the softmax with independent sigmoid losses:

For each pair (i, j):
  y_ij = 1 if (i, j) is a matching pair, -1 otherwise
  L_ij = -log σ(y_ij · sim(i, j) / τ + b)

L_total = (1/N²) Σ_i Σ_j L_ij

Where σ is the sigmoid function and b is a learnable bias.

The key difference: each (image, text) pair is classified independently as matching or not, with no softmax competition. Two similar captions can both have high similarity to their respective images without penalty.

Why SigLIP is Better (Sometimes)

1. No false negatives: Similar-but-not-paired items do not compete. The model can learn that "dog in park" and "puppy on grass" are both close to park dog images. 2. Scales better: No N×N softmax, the loss decomposes into independent binary classifications. SigLIP was trained with chunks of 32K pairs but evaluated on smaller sub-batches. 3. Simpler temperature dynamics: The bias term b absorbs some of the work that temperature does in InfoNCE, leading to more stable training.

SigLIP 2 (Google 2024-2025) extends this with better data curation and multi-resolution training, achieving SOTA on many zero-shot benchmarks with fewer parameters than CLIP.

But: The softmax formulation has an advantage when the batch does contain true duplicates, it forces the model to find the exact match, which can improve fine-grained discrimination. SigLIP trades fine-grained precision for better recall.

CLAP: Contrastive Learning for Audio

CLAP (Contrastive Language-Audio Pretraining) applies the same framework to audio:

Audio → HTSAT Encoder (Swin Transformer on mel spectrograms) → 512-dim
Text  → RoBERTa Encoder                                     → 512-dim

InfoNCE loss on (audio, text) pairs

The architecture is conceptually identical to CLIP, but the audio encoder processes mel spectrograms (2D time-frequency representations) instead of images. HTSAT uses hierarchical windowed self-attention, similar to Swin Transformer but adapted for the rectangular aspect ratio of spectrograms (long in time, short in frequency).

Audio-Specific Challenges

Variable length: Audio clips range from 1 second to minutes. CLAP handles this by padding/truncating to a fixed window (typically 10 seconds) and using a global average pooling to produce a single embedding. Longer audio must be chunked.

Temporal structure: Unlike images where spatial information is crucial, audio has strong temporal ordering. The HTSAT encoder processes the spectrogram with shifted windows that span the time axis, capturing temporal patterns (speech rhythm, musical phrases, sound event sequences).

Sparse positive labels: Most audio datasets have short, noisy captions. AudioCaps (the main CLAP training dataset) has ~50K audio-caption pairs, compared to CLIP's 400M image-text pairs. This scarcity is partially compensated by data augmentation (time stretching, pitch shifting, noise addition) and by using additional weakly-labeled audio datasets.

The Modality Gap

Every contrastive model trained on cross-modal data exhibits the modality gap: embeddings from different modalities (image vs text, audio vs text) occupy distinct sub-regions of the shared space, separated by a systematic offset.

In CLIP, if you compute the average image embedding and the average text embedding, they are far apart despite being in the "same" space. The gap is consistent: all image embeddings live in a cone-shaped region, and all text embeddings live in a different cone-shaped region. The cones overlap but are offset.

Why the Gap Exists

The modality gap emerges from two factors:

1. Initialization bias: The image and text encoders are initialized differently (ViT vs GPT-like transformer). Their initial embeddings occupy different regions of the space, and training does not fully close this gap because it only needs to align *relative* similarities, not absolute positions.

2. Temperature-driven uniformity: Low temperature pushes all embeddings to be uniformly spread on the hypersphere. Within each modality, embeddings spread to maximize entropy. But the spreading happens independently within each modality's sub-region. The two regions settle into a configuration where they are "near" each other (cross-modal matching works) but not "on top of" each other.

Practical Impact

The modality gap means:

1. Cross-modal similarity scores are lower than within-modal scores. A text query's similarity to a matching image (say, 0.28) is typically lower than the similarity between two semantically similar images (say, 0.85). Your threshold for cross-modal search should be much lower than for within-modal search.

2. Nearest-neighbor search can fail. If you mix image and text embeddings in the same vector index and run a text query, the nearest neighbors will always be other text embeddings (same modality, higher similarity) rather than matching images. The standard workaround is to search only across modalities: text queries search only image embeddings, and vice versa.

3. Fusion helps. If a document has both a text description and an image, generating embeddings for both and fusing the search results (via RRF or learned combination) consistently outperforms searching either modality alone.

Hard Negatives: Making Training Harder on Purpose

The quality of contrastive embeddings depends heavily on the quality of negative examples. Random negatives (any other item in the batch) work for initial training but plateau quickly. Hard negatives, items that are similar to the anchor but not actual matches, force the model to learn finer distinctions.

Types of Hard Negatives

In-batch hard negatives: The hardest negatives that naturally occur in the batch. With large batch sizes, there will be items that share some semantic overlap with the anchor (e.g., two different dog images). These are "free" hard negatives.

Mined hard negatives: Use a preliminary model to find near-miss items, then include them in training batches. For text embeddings (BGE, E5), this typically means finding passages that are topically related but not actually relevant to the query.

Synthetic hard negatives: Generate confusing negatives using LLMs or data augmentation. For example, given the caption "a red car parked on a street," a synthetic hard negative might be "a blue car parked on a street", forcing the model to attend to attributes, not just objects.

The Hard Negative Trap

Too many hard negatives cause mode collapse: the model learns to map everything to a small region of the embedding space, producing embeddings that all look alike. The loss drops to zero because the model finds a trivially uniform solution.

The standard mitigation is to mix hard negatives with easy negatives at a controlled ratio. BGE uses a 1:7 ratio (1 hard negative for every 7 random negatives). Training first with random negatives, then introducing hard negatives in later stages, is another common strategy.

Evaluating Contrastive Embeddings

Standard Benchmarks

Zero-shot classification (images): Encode class names as text, encode test images, find the closest text embedding for each image. Report accuracy.

Retrieval metrics: Given a query, rank all candidates by cosine similarity. Report:

Recall@K: Fraction of queries where the correct match is in the top K results

Mean Reciprocal Rank (MRR): Average of 1/rank of the first correct result

NDCG@K: Normalized discounted cumulative gain, accounting for graded relevance

MTEB (Massive Text Embedding Benchmark): 56+ tasks covering retrieval, classification, clustering, reranking, and semantic textual similarity for text embeddings.

Production-Relevant Metrics

Benchmarks test models on clean, curated datasets. Production data is messier. Additional metrics to track:

Distribution coverage: Do your embeddings span the space or cluster in a small region? Compute the average pairwise distance between embeddings. If it is very low, the model may have collapsed or your data is too homogeneous.

Cross-modal gap magnitude: Compute the mean embedding per modality and measure their distance. A gap > 0.5 (in cosine distance) suggests cross-modal search will be unreliable without threshold calibration.

Recall at your actual retrieval cutoff: If your pipeline returns top-20 results, Recall@1000 is irrelevant. Measure Recall@20 on a representative sample of queries.

Debugging Common Embedding Issues

"All my search results have the same similarity score"

Cause: Embedding collapse, the model maps everything to a similar region. This happens with over-training on hard negatives, too-low temperature, or degenerate data.

Fix: Check the standard deviation of cosine similarities in your index. For healthy CLIP embeddings, the std of random pair similarities should be ~0.04-0.08. If it is < 0.01, the embeddings have collapsed.

"Cross-modal search returns irrelevant results but within-modal search works fine"

Cause: Modality gap is too large. The text and image embeddings are in separate regions of the space, and your similarity threshold cannot bridge the gap.

Fix: Calibrate thresholds per query type. Or use a learned projection to align modalities post-hoc (a small linear layer trained on matched pairs to reduce the gap).

"Search works for common queries but fails on rare or specific ones"

Cause: The model was trained on generic data and has not seen your domain's concepts. Contrastive models generalize through the composition of learned features, but truly novel concepts (specific industrial parts, niche medical terminology) may not be representable.

Fix: Fine-tune with domain-specific (query, positive, negative) triples. Even 1,000-10,000 domain triples can dramatically improve recall for your specific vocabulary.

"Adding more data made search quality worse"

Cause: The new data introduced false negatives. If your dataset has duplicate or near-duplicate items with different labels, the contrastive loss penalizes the model for correctly recognizing their similarity.

Fix: Deduplicate your training data. Remove or merge items that are semantically identical but have different labels/captions.

Building Search Pipelines with Contrastive Models

Single-Model Pipeline

The simplest approach: one contrastive model, one vector index, cosine similarity search.

# Index time
image_embedding = clip.encode_image(image)
vector_store.upsert(id=doc_id, vector=image_embedding)

# Query time
query_embedding = clip.encode_text(query)
results = vector_store.search(query_embedding, top_k=20)

When it works: Homogeneous content, single modality, broad queries.

When it fails: Mixed modalities (images + text + audio), specific queries, domain-specific vocabulary.

Multi-Model Fusion Pipeline

Use multiple contrastive models and fuse their results:

# Index time: generate embeddings from multiple models
clip_emb = clip.encode_image(image)
siglip_emb = siglip.encode_image(image)
text_emb = bge.encode(caption)

# Store each in its own index
clip_index.upsert(id=doc_id, vector=clip_emb)
siglip_index.upsert(id=doc_id, vector=siglip_emb)
text_index.upsert(id=doc_id, vector=text_emb)

# Query time: search each index, fuse with RRF
clip_results = clip_index.search(clip.encode_text(query), top_k=100)
siglip_results = siglip_index.search(siglip.encode_text(query), top_k=100)
text_results = text_index.search(bge.encode(query), top_k=100)

fused = reciprocal_rank_fusion([clip_results, siglip_results, text_results])

When to use: When different models have complementary strengths. CLIP may excel at objects and scenes; SigLIP may handle text-in-images better; BGE captures textual semantics that visual models miss.

Agent-Driven Adaptive Retrieval

An agent decides which model(s) to query based on the input:

def agent_search(query: str, context: dict) -> list:
    # Agent analyzes the query to decide retrieval strategy
    if mentions_visual_content(query):
        results = clip_search(query, top_k=50)
    elif mentions_audio_content(query):
        results = clap_search(query, top_k=50)
    else:
        results = text_search(query, top_k=50)

    # Agent can refine with follow-up searches
    if results[0].score < confidence_threshold:
        # Try a different modality
        fallback = multi_model_fusion(query)
        results = merge(results, fallback)

    return results

When to use: When the query space is diverse and a single retrieval strategy cannot cover all cases.

The Training Data Question

Contrastive models are only as good as their training pairs. Understanding what data went into a model predicts its strengths and weaknesses:

Model

Training Data

Size

Strengths

Weaknesses

CLIP (OpenAI)	WebImageText (web-scraped)	400M pairs	Broad coverage, objects, scenes	Specialized domains, fine-grained attributes
SigLIP 2 (Google)	WebLI (web-scraped, filtered)	10B+ pairs	Text-in-images, multilingual, detailed descriptions	Less common in open-source tooling
CLAP (LAION)	AudioCaps + AudioSet + FreeSound	~630K pairs	Environmental sounds, music classification	Speech content, rare sounds
BGE (BAAI)	Curated text pairs + synthetic	~1B pairs	Text retrieval, multilingual	No visual understanding
Jina CLIP v2	Curated multimodal	400M+ pairs	Long captions, document screenshots	Smaller community

Rule of thumb: If your domain was well-represented in the training data, the model works out of the box. If not, fine-tuning on domain pairs is required. Medical imaging, satellite imagery, industrial inspection, and niche audio domains almost always require fine-tuning.

Common Pitfalls

Comparing scores across models. A cosine similarity of 0.30 from CLIP and 0.30 from SigLIP do not indicate the same confidence. Each model's score distribution is shaped by its training temperature, batch size, and data distribution. Always calibrate thresholds per model.

Using L2 distance instead of cosine similarity. Contrastive models are trained with cosine similarity. The embeddings are typically L2-normalized, making cosine similarity equivalent to dot product. Using raw L2 distance without normalization produces meaningless rankings.

Ignoring the modality gap in mixed indexes. If you store image embeddings and text embeddings in the same vector index and run nearest-neighbor search, results will be dominated by same-modality matches. Always search across modalities (text query → image index, not text query → mixed index).

Fine-tuning with a learning rate that is too high. Contrastive models are pre-trained on massive data with carefully tuned learning rates (typically 1e-5 to 5e-4). Fine-tuning with 1e-3 will destroy the learned representations in a few steps. Use learning rates 10-100x smaller than from-scratch training, and freeze the first several layers.

Assuming embedding dimensions are comparable. CLIP produces 512 or 768-dim embeddings. BGE produces 1024-dim. SigLIP produces 1152-dim. These are not interchangeable: you cannot concatenate embeddings from different models or compare them directly. Each model's embedding space is its own learned coordinate system.