bge-m3
by BAAI
Hybrid retrieval in one model -- dense, sparse, and ColBERT embeddings from a single forward pass
BAAI/bge-m3mixpeek://text_extractor@v1/baai_bge_m3_v1Overview
BGE-M3 is BAAI's multi-functionality embedding model that produces three types of embeddings simultaneously: dense vectors for semantic search, sparse vectors for lexical matching, and ColBERT-style multi-vector representations for fine-grained late interaction. This eliminates the need to run separate models for different retrieval strategies.
The model supports 100+ languages and handles up to 8192 tokens of input, making it suitable for long documents. On Mixpeek, BGE-M3 powers hybrid retrieval pipelines where a single ingest pass produces all three representation types, and the retriever fuses them at query time for higher recall than any single strategy alone.
Architecture
XLM-RoBERTa backbone, 568M parameters. Produces dense embeddings (1024d), sparse term-weight vectors, and ColBERT multi-vector representations from one forward pass. Trained with self-knowledge distillation across 100+ languages.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/report.pdf" },feature_extractors: [{name: "text_embedding",version: "v1",params: {model_id: "BAAI/bge-m3"}}]});
Capabilities
- Dense, sparse, and ColBERT embeddings in one pass
- 100+ language support
- 8192 token context window
- Hybrid retrieval without multiple models
- Matryoshka dimension reduction
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| MIRACL (avg, 18 languages) | nDCG@10 | 71.9% | BAAI, 2024 -- Paper Table 3 |
| MTEB Retrieval (en) | nDCG@10 | 67.2% | MTEB Leaderboard |
Performance
Specification
Research Paper
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity
arxiv.orgBuild a pipeline with bge-m3
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio