Multimodal Image Search with SigLIP and RRF
Search 120K pieces of art by text, image, or both. How we built a multimodal retriever with SigLIP and reciprocal rank fusion.

We built a visual search engine for the National Gallery of Art's public collection. Text search, image search, and hybrid queries with reciprocal rank fusion.
The Stack
- Feature Extractor: Google SigLIP (768-dim vectors)
- Stack: FastAPI -> Celery -> Ray -> Qdrant
- Resources: 2Γ NVIDIA L4, 32GB on GCP via Anyscale
Why SigLIP over CLIP
CLIP uses softmax lossβit optimizes for relative ranking within a batch. SigLIP uses sigmoid loss, treating each image-text pair as independent binary classification.
Practical difference: SigLIP embeddings live in a global semantic space. Similarity scores are consistent whether you're comparing 10 documents or 10 million. Better for retrieval at scale.
The base model, siglip-base-patch16-224, hits ~84% zero-shot on ImageNet. Good enough out of the box, no fine-tuning needed for general visual similarity.
Pipeline

The collection config:
{
"feature_extractor": {
"feature_extractor_name": "image_extractor",
"version": "v1",
"input_mappings": { "image": "image" },
"parameters": {
"model": "siglip-base-patch16-224",
"generate_thumbnail": true
}
}
}
Thumbnails generated and pushed to Cloudfront. The entire batch (120k images) runs in a dedicated Ray job as a feature extractor to fully saturate our GPUs/CPUs, orchestrated by Anyscale in GCP scaling up/down accordingly.
The output is then stored in a collection as documents, ready for retrieval.
The Retriever
When we create the retriever, we provide inputs to the stages using standard Jinja templating. So we now have three query types in one feature_search stage:
{
"stage_id": "feature_search",
"parameters": {
"searches": [
{
"feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
"on_empty": "skip",
"query": {
"input_mode": "text",
"value": "{{INPUT.text}}"
},
"top_k": 250
},
{
"feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
"on_empty": "skip",
"query": {
"input_mode": "content",
"value": "{{INPUT.image}}"
},
"top_k": 250
},
{
"feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
"on_empty": "skip",
"query": {
"input_mode": "document",
"document_ref": {
"collection_id": "{{INPUT.doc_ref.collection_id}}",
"document_id": "{{INPUT.doc_ref.document_id}}"
}
},
"top_k": 250
}
],
"final_top_k": 500,
"fusion": "rrf"
}
}- Text: "portrait of a scientist" β text encoder β kNN
- Image: upload reference β vision encoder β kNN
- Document reference: look up stored embedding β kNN (the "find similar" button)
feature_uri enables us to map the query to the index, embedding model, extractor, and version. Then feature search calls a hot Ray Serve node with fractional GPU availability.
skip_if_empty: true means you pass whatever inputs you have. One query type? Runs that. Multiple? Fuses with RRF using default weights.
We refer to this architecture as an Exploratory Multimodal Retriever: a single retrieval pipeline that accepts optional text, image, or document-reference inputs and produces a navigable similarity space.
RRF for Hybrid Search
Reciprocal Rank Fusion merges ranked lists without caring about raw scores:
score(d) = Ξ£ 1/(k + rank(d))
Why this matters: text-to-image similarity might cluster in [0.2, 0.4] while image-to-image clusters in [0.6, 0.9]. Score-based fusion would be biased. RRF normalizes by rank.
Learn more about RRF in our hybrid search university module.
The killer query: pass a document_id (reference portrait) + query_text ("but wearing blue"). RRF combines structural similarity with the color constraint.
Execution
Since this is a named retriever, when you call it you only provide the inputs and the mapping is done on execution time.
curl -X POST ".../retrievers/<RETRIEVER_ID>/execute" \
-d '{
"inputs": {
"text": "dog outside"
}
}'
Response includes per-stage timing so you can see where latency lives.
{
"stage_name": "feature_search",
"num_features": 3,
"fusion_strategy": "rrf",
"total_results": 250,
"duration_ms": 899.33,
"cache_hit": false
}https://docs.mixpeek.com/retrieval/retrievers
Mixpeek caches retriever results by hashing normalized inputs and pipeline configuration, reusing full or stage-level outputs on repeat queries to cut latency and inference cost while respecting TTLs and invalidation rules. About caching
Numbers
| Metric | Value |
|---|---|
| Images indexed | 120,000+ |
| Processing time | ~2 hours |
| Embedding dimensions | 768 |
| Vector data size | ~350MB |
| Query latency | <800ms p95 |
Same pattern works for product catalogs, media DAMs, real estate photos, medical imaging. Swap the data source, keep the architecture!
