siglip-base-patch16-224
by google
Sigmoid Loss for Language Image Pre-Training — efficient contrastive learning
google/siglip-base-patch16-224mixpeek://image_extractor@v1/google_siglip_base_v1Overview
SigLIP replaces CLIP's softmax-based contrastive loss with a simple pairwise sigmoid loss, enabling more efficient training on larger batch sizes without requiring a global normalization step.
On Mixpeek, SigLIP offers a lighter-weight alternative to CLIP for visual embedding extraction, with comparable accuracy on many benchmarks while being faster to run at inference time.
Architecture
Vision Transformer (ViT-B/16) with 12 layers, 768-dim hidden size, 12 attention heads. Uses sigmoid contrastive loss instead of softmax, eliminating the need for large batch normalization.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/image.jpg" },
feature_extractors: [{
name: "image_embedding",
version: "v1",
params: {
model_id: "google/siglip-base-patch16-224"
}
}]
});Capabilities
- Efficient contrastive image-text learning
- 768-dimensional dense vector embeddings
- Lower memory footprint than CLIP ViT-L
- Strong zero-shot classification performance
Use Cases on Mixpeek
Specification
Research Paper
Sigmoid Loss for Language Image Pre-Training
arxiv.orgBuild a pipeline with siglip-base-patch16-224
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder