A vision-language model that replaces CLIP's softmax contrastive loss with a sigmoid loss function, enabling more efficient training on image-text pairs and better performance on retrieval and classification tasks.
SigLIP learns to align images and text in a shared embedding space, similar to CLIP, but uses a fundamentally different training objective. While CLIP uses a softmax-based contrastive loss that treats each batch as a classification problem, SigLIP uses a sigmoid loss that independently evaluates each image-text pair as a binary classification (matching or not matching). This eliminates the need for a global normalization step across the batch, allowing more efficient training on larger datasets and producing embeddings that perform better on downstream retrieval and zero-shot classification tasks.
SigLIP consists of a vision encoder (typically a Vision Transformer) and a text encoder (a transformer language model) trained jointly on large-scale image-text datasets. The sigmoid loss computes a binary cross-entropy for each image-text pair independently, rather than the InfoNCE softmax loss used by CLIP. This means each positive pair is compared only against its negative pairs without global batch normalization. The result is a model that scales better with batch size and produces embeddings with strong retrieval properties. SigLIP models are available in multiple sizes and are supported as feature extractors in Mixpeek's image processing pipeline.