siglip-base-patch16-224

by google

Sigmoid Loss for Language Image Pre-Training, efficient contrastive learning

1.5Mdl/month

87likes

203Mparams

HuggingFace Run on your data, free

Identifiers

Model ID

google/siglip-base-patch16-224

Feature URI

mixpeek://image_extractor@v1/google_siglip_base_v1

Overview

SigLIP replaces CLIP's softmax-based contrastive loss with a simple pairwise sigmoid loss, enabling more efficient training on larger batch sizes without requiring a global normalization step.

On Mixpeek, SigLIP offers a lighter-weight alternative to CLIP for visual embedding extraction, with comparable accuracy on many benchmarks while being faster to run at inference time.

Architecture

Vision Transformer (ViT-B/16) with 12 layers, 768-dim hidden size, 12 attention heads. Uses sigmoid contrastive loss instead of softmax, eliminating the need for large batch normalization.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "image_embedding",
    version: "v1",
    parameters: { model_id: "google/siglip-base-patch16-224" },
  },
});

Capabilities

Efficient contrastive image-text learning
768-dimensional dense vector embeddings
Lower memory footprint than CLIP ViT-L
Strong zero-shot classification performance

Use Cases on Mixpeek

High-throughput visual indexing of large image catalogs

Real-time visual similarity for recommendation engines

Lightweight embedding extraction for edge deployments

Benchmarks

Dataset	Metric	Score	Source
ImageNet zero-shot	Top-1 Accuracy	73.2%	Zhai et al., 2023 — Table 1
COCO (image→text)	Recall@1	62.7%	Zhai et al., 2023 — Table 3
ObjectNet	Top-1 Accuracy	59.1%	Zhai et al., 2023 — Table 2