bge-m3
by BAAI
Hybrid retrieval in one model -- dense, sparse, and ColBERT embeddings from a single forward pass
BAAI/bge-m3mixpeek://text_extractor@v1/baai_bge_m3_v1Overview
BGE-M3 is BAAI's multi-functionality embedding model that produces three types of embeddings simultaneously: dense vectors for semantic search, sparse vectors for lexical matching, and ColBERT-style multi-vector representations for fine-grained late interaction. This eliminates the need to run separate models for different retrieval strategies.
The model supports 100+ languages and handles up to 8192 tokens of input, making it suitable for long documents. On Mixpeek, BGE-M3 powers hybrid retrieval pipelines where a single ingest pass produces all three representation types, and the retriever fuses them at query time for higher recall than any single strategy alone.
Architecture
XLM-RoBERTa backbone, 568M parameters. Produces dense embeddings (1024d), sparse term-weight vectors, and ColBERT multi-vector representations from one forward pass. Trained with self-knowledge distillation across 100+ languages.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
namespace_id: "my-namespace",
collection_name: "my-collection",
source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
feature_extractor: {
feature_extractor_name: "text_embedding",
version: "v1",
parameters: { model_id: "BAAI/bge-m3" },
},
});Capabilities
- Dense, sparse, and ColBERT embeddings in one pass
- 100+ language support
- 8192 token context window
- Hybrid retrieval without multiple models
- Matryoshka dimension reduction
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| MIRACL (avg, 18 languages) | nDCG@10 | 71.9% | BAAI, 2024 -- Paper Table 3 |
| MTEB Retrieval (en) | nDCG@10 | 67.2% | MTEB Leaderboard |
Performance
Common Pipeline Companions
Explore on Mixpeek
Compare alternatives in this category
Hand-picked tools & platforms compared
Deep-dive technical guide
See how Mixpeek runs models as extractors
Store & search embeddings at scale
Usage-based pricing for pipelines
Compare models, APIs & infrastructure
Specification
Research Paper
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity
arxiv.orgBuild a pipeline with bge-m3
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio