A training paradigm where models learn meaningful representations from unlabeled data by solving automatically generated pretext tasks. Self-supervised learning is the foundation of modern pretrained models in vision, language, and multimodal AI.
Self-supervised learning creates supervision signals from the data itself rather than requiring human labels. Pretext tasks include predicting masked portions of the input (masked language modeling, masked image modeling), predicting the relationship between data augmentations (contrastive learning), or predicting future observations from past ones. The model learns general representations while solving these pretext tasks, which transfer well to downstream tasks.
Major paradigms include contrastive methods (SimCLR, MoCo, CLIP), masked prediction (BERT, MAE, BEiT), and self-distillation (DINO, BYOL). Contrastive methods maximize agreement between different views of the same data while pushing apart different data. Masked prediction reconstructs hidden portions from visible context. These approaches scale to billions of unlabeled examples and produce embeddings useful for downstream classification, retrieval, and generation tasks.