NEWVectors or files. Pick a path.Start →
    Models/Embeddings/BAAI/BGE-VL-v1.5-zs
    HFVisual Embeddingsmit

    BGE-VL-v1.5-zs

    by BAAI

    Zero-shot multimodal retrieval from BAAI's MegaPairs-trained BGE-VL family

    41dl/month
    9likes
    7.6Bparams
    Identifiers
    Model ID
    BAAI/BGE-VL-v1.5-zs
    Feature URI
    mixpeek://image_extractor@v1/baai_bge_vl_15_zs_v1

    Overview

    BGE-VL v1.5 ZS is a zero-shot vision-language embedding model trained for universal multimodal retrieval. The BGE-VL family uses MegaPairs, a large synthetic triplet dataset for image, text, and composed image retrieval, to improve retrieval generalization beyond standard CLIP-style contrastive pairs.

    On Mixpeek, BGE-VL v1.5 ZS is useful when agents need instruction-style visual retrieval over screenshots, product images, documents, and video frames. It can retrieve by text, image, or combined text-plus-image intent before a heavier VLM reads the selected evidence.

    Architecture

    Sentence Transformers compatible multimodal embedding model based on an LLaVA-NeXT style vision-language backbone. It maps text, image, and composed text-image inputs into a shared retrieval space and supports task prompts for query formatting.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "visual-evidence",
    source: { url: "s3://visual-evidence/" },
    feature_extractors: [{
    feature: "visual_embeddings",
    model: "BAAI/BGE-VL-v1.5-zs"
    }]
    });

    Capabilities

    • Zero-shot text-image and composed image retrieval
    • Instruction-style prompts for query embeddings
    • Sentence Transformers integration
    • MIT license

    Use Cases on Mixpeek

    Search screenshots by visual state plus task intent
    Find product images from composed queries such as reference image plus constraint
    Retrieve visual document pages before OCR or VLM extraction
    Add stronger visual retrieval to agent perception pipelines

    Performance

    Input SizeText, image, or text-plus-image query
    Embedding DimModel dependent
    GPU LatencyInput dependent
    GPU ThroughputBatch images for best throughput
    GPU Memory~16 GB plus serving overhead

    Specification

    FrameworkHF
    OrganizationBAAI
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters7.6B
    Licensemit
    Downloads/mo41
    Likes9

    Research Paper

    MegaPairs: Massive Data Synthesis for Universal Multimodal Retrieval

    arxiv.org

    Build a pipeline with BGE-VL-v1.5-zs

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio