A vision-language model that replaces CLIP's softmax contrastive loss with a sigmoid loss function, enabling more efficient training on image-text pairs and better performance on retrieval and classification tasks.
SigLIP learns to align images and text in a shared embedding space, similar to CLIP, but uses a fundamentally different training objective. While CLIP uses a softmax-based contrastive loss that treats each batch as a classification problem, SigLIP uses a sigmoid loss that independently evaluates each image-text pair as a binary classification (matching or not matching). This eliminates the need for a global normalization step across the batch, allowing more efficient training on larger datasets and producing embeddings that perform better on downstream retrieval and zero-shot classification tasks.
SigLIP consists of a vision encoder (typically a Vision Transformer) and a text encoder (a transformer language model) trained jointly on large-scale image-text datasets. The sigmoid loss computes a binary cross-entropy for each image-text pair independently, rather than the InfoNCE softmax loss used by CLIP. This means each positive pair is compared only against its negative pairs without global batch normalization. The result is a model that scales better with batch size and produces embeddings with strong retrieval properties. SigLIP models are available in multiple sizes and are supported as feature extractors in Mixpeek's image processing pipeline.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS