NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/vidore/colqwen-omni-v0.1
    HFVisual EmbeddingsMIT

    colqwen-omni-v0.1

    by vidore

    Omnimodal ColBERT retrieval for documents, audio, and video search

    781dl/month
    ~3Bparams
    Identifiers
    Model ID
    vidore/colqwen-omni-v0.1
    Feature URI
    mixpeek://image_extractor@v1/vidore_colqwen_omni_v1

    Overview

    ColQwen Omni extends the ColPali paradigm to all modalities — documents, audio, and video — using ColBERT-style multi-vector representations built on Qwen2.5-Omni-3B. Unlike dense single-vector models, multi-vector retrieval preserves fine-grained token-level matching, delivering higher precision on complex queries.

    On Mixpeek, ColQwen Omni powers late-interaction retrieval across document pages, audio recordings, and video content. Its zero-shot audio retrieval (no audio training data needed) makes it especially useful for indexing podcasts, meetings, and lecture recordings alongside visual content.

    Architecture

    Qwen2.5-Omni-3B-Instruct fine-tuned for ColBERT-style multi-vector output. Dynamic image resolution (max 1024 patches). Audio/video towers frozen during training — audio retrieval is zero-shot. Trained with colpali-engine 0.3.11 on 127K query-page pairs.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "mixed-media",
    source: { url: "https://example.com/podcast.mp3" },
    feature_extractors: [{
    feature: "multimodal_embedding",
    model: "vidore/colqwen-omni-v0.1"
    }]
    });

    Capabilities

    • ColBERT-style multi-vector retrieval across all modalities
    • Zero-shot audio retrieval without audio training data
    • Dynamic image resolution up to 1024 patches
    • 30-minute podcast embedded in under 10 seconds
    • Fine-grained token-level matching for complex queries

    Use Cases on Mixpeek

    Document retrieval: find specific pages in scanned PDFs by content description
    Podcast search: query spoken content without pre-transcribing audio
    Video moment retrieval: locate specific scenes using natural language
    Multi-format archives: search across mixed document, audio, and video collections

    Benchmarks

    DatasetMetricScoreSource
    ViDoRe V1 (visual doc)nDCG@5~90%Vidore Blog, 2025

    Performance

    Input SizeVariable (doc pages / audio / video)
    Embedding Dim128 per token (multi-vector)
    GPU Latency~25ms / page (A100)
    GPU Throughput~40 pages/sec (A100)
    GPU Memory~8 GB

    Specification

    FrameworkHF
    Organizationvidore
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters~3B
    LicenseMIT
    Downloads/mo781

    Research Paper

    ColPali: Efficient Document Retrieval with Vision Language Models

    arxiv.org

    Build a pipeline with colqwen-omni-v0.1

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio