A self-supervised learning paradigm that trains models to bring similar data points closer together and push dissimilar ones apart in embedding space. Contrastive learning is fundamental to training multimodal models like CLIP that align images and text.
Contrastive learning trains an encoder by presenting pairs or groups of examples. Positive pairs (semantically similar items) are pulled together in the embedding space, while negative pairs (dissimilar items) are pushed apart. The model learns to produce embeddings where distance reflects semantic similarity, without requiring explicit class labels.
Key loss functions include InfoNCE (used in CLIP and SimCLR), triplet loss, and NT-Xent. Training requires careful construction of positive pairs (data augmentation, naturally paired data) and negative sampling strategies. Temperature scaling controls the sharpness of the similarity distribution. Batch size is critical as larger batches provide more informative negatives.