Multimodal Image Search with SigLIP and RRF

We built a visual search engine for the National Gallery of Art's public collection. Text search, image search, and hybrid queries with reciprocal rank fusion.

Demo | Code | Video

The Stack

Feature Extractor: Google SigLIP (768-dim vectors)
Stack: FastAPI -> Celery -> Ray -> Qdrant
Resources: 2× NVIDIA L4, 32GB on GCP via Anyscale

💡

~2 hours for 120K images (~60ms/image)

Why SigLIP over CLIP

CLIP uses softmax loss—it optimizes for relative ranking within a batch. SigLIP uses sigmoid loss, treating each image-text pair as independent binary classification.

Practical difference: SigLIP embeddings live in a global semantic space. Similarity scores are consistent whether you're comparing 10 documents or 10 million. Better for retrieval at scale.

The base model, siglip-base-patch16-224, hits ~84% zero-shot on ImageNet. Good enough out of the box, no fine-tuning needed for general visual similarity.

Pipeline

The collection config:

{
  "feature_extractor": {
    "feature_extractor_name": "image_extractor",
    "version": "v1",
    "input_mappings": { "image": "image" },
    "parameters": {
      "model": "siglip-base-patch16-224",
      "generate_thumbnail": true
    }
  }
}

Thumbnails generated and pushed to Cloudfront. The entire batch (120k images) runs in a dedicated Ray job as a feature extractor to fully saturate our GPUs/CPUs, orchestrated by Anyscale in GCP scaling up/down accordingly.

💡

Feature extractors are fully-managed indexing pipelines composed of models, workflows and code to deliver SOTA retrieval for any filetype at terabyte scale.

The output is then stored in a collection as documents, ready for retrieval.

The Retriever

When we create the retriever, we provide inputs to the stages using standard Jinja templating. So we now have three query types in one feature_search stage:

{
  "stage_id": "feature_search",
  "parameters": {
    "searches": [
      {
        "feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
        "on_empty": "skip",
        "query": {
          "input_mode": "text",
          "value": "{{INPUT.text}}"
        },
        "top_k": 250
      },
      {
        "feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
        "on_empty": "skip",
        "query": {
          "input_mode": "content",
          "value": "{{INPUT.image}}"
        },
        "top_k": 250
      },
      {
        "feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
        "on_empty": "skip",
        "query": {
          "input_mode": "document",
          "document_ref": {
            "collection_id": "{{INPUT.doc_ref.collection_id}}",
            "document_id": "{{INPUT.doc_ref.document_id}}"
          }
        },
        "top_k": 250
      }
    ],
    "final_top_k": 500,
    "fusion": "rrf"
  }
}

Text: "portrait of a scientist" → text encoder → kNN
Image: upload reference → vision encoder → kNN
Document reference: look up stored embedding → kNN (the "find similar" button)

feature_uri enables us to map the query to the index, embedding model, extractor, and version. Then feature search calls a hot Ray Serve node with fractional GPU availability.

💡

Learn more about the feature search stage

skip_if_empty: true means you pass whatever inputs you have. One query type? Runs that. Multiple? Fuses with RRF using default weights.

We refer to this architecture as an Exploratory Multimodal Retriever: a single retrieval pipeline that accepts optional text, image, or document-reference inputs and produces a navigable similarity space.

RRF for Hybrid Search

Reciprocal Rank Fusion merges ranked lists without caring about raw scores:

score(d) = Σ 1/(k + rank(d))

Why this matters: text-to-image similarity might cluster in [0.2, 0.4] while image-to-image clusters in [0.6, 0.9]. Score-based fusion would be biased. RRF normalizes by rank.

Learn more about RRF in our hybrid search university module.

The killer query: pass a document_id (reference portrait) + query_text ("but wearing blue"). RRF combines structural similarity with the color constraint.

Execution

Since this is a named retriever, when you call it you only provide the inputs and the mapping is done on execution time.

curl -X POST ".../retrievers/<RETRIEVER_ID>/execute" \
  -d '{
    "inputs": {
      "text": "dog outside"
    }
  }'

Response includes per-stage timing so you can see where latency lives.

{
  "stage_name": "feature_search",
  "num_features": 3,
  "fusion_strategy": "rrf",
  "total_results": 250,
  "duration_ms": 899.33,
  "cache_hit": false
}

https://mixpeek.com/docs/retrieval/retrievers

Mixpeek caches retriever results by hashing normalized inputs and pipeline configuration, reusing full or stage-level outputs on repeat queries to cut latency and inference cost while respecting TTLs and invalidation rules. About caching

Numbers

Metric	Value
Images indexed	120,000+
Processing time	~2 hours
Embedding dimensions	768
Vector data size	~350MB
Query latency	<800ms p95

Same pattern works for product catalogs, media DAMs, real estate photos, medical imaging. Swap the data source, keep the architecture!

Code: https://github.com/mixpeek/showcase/tree/main/gallery
Demo: https://mxp.co/r/nga
Video: https://mixpeek.com/videos/exploratory-multimodal-retriever