Mixpeek Logo
    β€’3 min read

    Multimodal Image Search with SigLIP and RRF

    Search 120K pieces of art by text, image, or both. How we built a multimodal retriever with SigLIP and reciprocal rank fusion.

    Multimodal Image Search with SigLIP and RRF
    Showcase

    We built a visual search engine for the National Gallery of Art's public collection. Text search, image search, and hybrid queries with reciprocal rank fusion.

    Demo | Code | Video


    The Stack

    πŸ’‘
    ~2 hours for 120K images (~60ms/image)

    Why SigLIP over CLIP

    CLIP uses softmax lossβ€”it optimizes for relative ranking within a batch. SigLIP uses sigmoid loss, treating each image-text pair as independent binary classification.

    Practical difference: SigLIP embeddings live in a global semantic space. Similarity scores are consistent whether you're comparing 10 documents or 10 million. Better for retrieval at scale.

    The base model, siglip-base-patch16-224, hits ~84% zero-shot on ImageNet. Good enough out of the box, no fine-tuning needed for general visual similarity.

    Pipeline

    The collection config:

    {
      "feature_extractor": {
        "feature_extractor_name": "image_extractor",
        "version": "v1",
        "input_mappings": { "image": "image" },
        "parameters": {
          "model": "siglip-base-patch16-224",
          "generate_thumbnail": true
        }
      }
    }
    

    Thumbnails generated and pushed to Cloudfront. The entire batch (120k images) runs in a dedicated Ray job as a feature extractor to fully saturate our GPUs/CPUs, orchestrated by Anyscale in GCP scaling up/down accordingly.

    πŸ’‘
    Feature extractors are fully-managed indexing pipelines composed of models, workflows and code to deliver SOTA retrieval for any filetype at terabyte scale.

    The output is then stored in a collection as documents, ready for retrieval.

    The Retriever

    When we create the retriever, we provide inputs to the stages using standard Jinja templating. So we now have three query types in one feature_search stage:

    {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          {
            "feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
            "on_empty": "skip",
            "query": {
              "input_mode": "text",
              "value": "{{INPUT.text}}"
            },
            "top_k": 250
          },
          {
            "feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
            "on_empty": "skip",
            "query": {
              "input_mode": "content",
              "value": "{{INPUT.image}}"
            },
            "top_k": 250
          },
          {
            "feature_uri": "mixpeek://image_extractor@v1/google_siglip_base_v1",
            "on_empty": "skip",
            "query": {
              "input_mode": "document",
              "document_ref": {
                "collection_id": "{{INPUT.doc_ref.collection_id}}",
                "document_id": "{{INPUT.doc_ref.document_id}}"
              }
            },
            "top_k": 250
          }
        ],
        "final_top_k": 500,
        "fusion": "rrf"
      }
    }
    • Text: "portrait of a scientist" β†’ text encoder β†’ kNN
    • Image: upload reference β†’ vision encoder β†’ kNN
    • Document reference: look up stored embedding β†’ kNN (the "find similar" button)

    feature_uri enables us to map the query to the index, embedding model, extractor, and version. Then feature search calls a hot Ray Serve node with fractional GPU availability.

    πŸ’‘
    Learn more about the feature search stage

    skip_if_empty: true means you pass whatever inputs you have. One query type? Runs that. Multiple? Fuses with RRF using default weights.

    We refer to this architecture as an Exploratory Multimodal Retriever: a single retrieval pipeline that accepts optional text, image, or document-reference inputs and produces a navigable similarity space.

    Reciprocal Rank Fusion merges ranked lists without caring about raw scores:

    score(d) = Ξ£ 1/(k + rank(d))
    

    Why this matters: text-to-image similarity might cluster in [0.2, 0.4] while image-to-image clusters in [0.6, 0.9]. Score-based fusion would be biased. RRF normalizes by rank.

    Learn more about RRF in our hybrid search university module.

    The killer query: pass a document_id (reference portrait) + query_text ("but wearing blue"). RRF combines structural similarity with the color constraint.

    Execution

    Since this is a named retriever, when you call it you only provide the inputs and the mapping is done on execution time.

    curl -X POST ".../retrievers/<RETRIEVER_ID>/execute" \
      -d '{
        "inputs": {
          "text": "dog outside"
        }
      }'
    

    Response includes per-stage timing so you can see where latency lives.

    {
      "stage_name": "feature_search",
      "num_features": 3,
      "fusion_strategy": "rrf",
      "total_results": 250,
      "duration_ms": 899.33,
      "cache_hit": false
    }

    https://docs.mixpeek.com/retrieval/retrievers

    Mixpeek caches retriever results by hashing normalized inputs and pipeline configuration, reusing full or stage-level outputs on repeat queries to cut latency and inference cost while respecting TTLs and invalidation rules. About caching

    Numbers

    Metric Value
    Images indexed 120,000+
    Processing time ~2 hours
    Embedding dimensions 768
    Vector data size ~350MB
    Query latency <800ms p95

    Same pattern works for product catalogs, media DAMs, real estate photos, medical imaging. Swap the data source, keep the architecture!