BGE-VL-v1.5-zs

by BAAI

Zero-shot multimodal retrieval from BAAI's MegaPairs-trained BGE-VL family

41dl/month

9likes

7.6Bparams

HuggingFace Use in Pipeline

Identifiers

Model ID

BAAI/BGE-VL-v1.5-zs

Feature URI

mixpeek://image_extractor@v1/baai_bge_vl_15_zs_v1

Overview

BGE-VL v1.5 ZS is a zero-shot vision-language embedding model trained for universal multimodal retrieval. The BGE-VL family uses MegaPairs, a large synthetic triplet dataset for image, text, and composed image retrieval, to improve retrieval generalization beyond standard CLIP-style contrastive pairs.

On Mixpeek, BGE-VL v1.5 ZS is useful when agents need instruction-style visual retrieval over screenshots, product images, documents, and video frames. It can retrieve by text, image, or combined text-plus-image intent before a heavier VLM reads the selected evidence.

Architecture

Sentence Transformers compatible multimodal embedding model based on an LLaVA-NeXT style vision-language backbone. It maps text, image, and composed text-image inputs into a shared retrieval space and supports task prompts for query formatting.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

await mx.collections.ingest({
  collection_id: "visual-evidence",
  source: { url: "s3://visual-evidence/" },
  feature_extractors: [{
    feature: "visual_embeddings",
    model: "BAAI/BGE-VL-v1.5-zs"
  }]
});