A training paradigm where models learn meaningful representations from unlabeled data by solving automatically generated pretext tasks. Self-supervised learning is the foundation of modern pretrained models in vision, language, and multimodal AI.
Self-supervised learning creates supervision signals from the data itself rather than requiring human labels. Pretext tasks include predicting masked portions of the input (masked language modeling, masked image modeling), predicting the relationship between data augmentations (contrastive learning), or predicting future observations from past ones. The model learns general representations while solving these pretext tasks, which transfer well to downstream tasks.
Major paradigms include contrastive methods (SimCLR, MoCo, CLIP), masked prediction (BERT, MAE, BEiT), and self-distillation (DINO, BYOL). Contrastive methods maximize agreement between different views of the same data while pushing apart different data. Masked prediction reconstructs hidden portions from visible context. These approaches scale to billions of unlabeled examples and produce embeddings useful for downstream classification, retrieval, and generation tasks.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS