BGE-VL-v1.5-zs
by BAAI
Zero-shot multimodal retrieval from BAAI's MegaPairs-trained BGE-VL family
BAAI/BGE-VL-v1.5-zsmixpeek://image_extractor@v1/baai_bge_vl_15_zs_v1Overview
BGE-VL v1.5 ZS is a zero-shot vision-language embedding model trained for universal multimodal retrieval. The BGE-VL family uses MegaPairs, a large synthetic triplet dataset for image, text, and composed image retrieval, to improve retrieval generalization beyond standard CLIP-style contrastive pairs.
On Mixpeek, BGE-VL v1.5 ZS is useful when agents need instruction-style visual retrieval over screenshots, product images, documents, and video frames. It can retrieve by text, image, or combined text-plus-image intent before a heavier VLM reads the selected evidence.
Architecture
Sentence Transformers compatible multimodal embedding model based on an LLaVA-NeXT style vision-language backbone. It maps text, image, and composed text-image inputs into a shared retrieval space and supports task prompts for query formatting.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "visual-evidence",source: { url: "s3://visual-evidence/" },feature_extractors: [{feature: "visual_embeddings",model: "BAAI/BGE-VL-v1.5-zs"}]});
Capabilities
- Zero-shot text-image and composed image retrieval
- Instruction-style prompts for query embeddings
- Sentence Transformers integration
- MIT license
Use Cases on Mixpeek
Performance
Specification
Research Paper
MegaPairs: Massive Data Synthesis for Universal Multimodal Retrieval
arxiv.orgBuild a pipeline with BGE-VL-v1.5-zs
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio