bge-large-en-v1.5
by BAAI
BAAI General Embedding — state-of-the-art text retrieval
BAAI/bge-large-en-v1.5mixpeek://text_extractor@v1/baai_bge_large_v1Overview
BGE (BAAI General Embedding) is a family of text embedding models that achieve top performance on the MTEB benchmark. The large-en-v1.5 variant produces 1024-dimensional embeddings optimized for English text retrieval and semantic similarity.
On Mixpeek, BGE powers text-based semantic search over extracted text content — transcriptions, captions, OCR results, and document text.
Architecture
BERT-Large architecture (24 layers, 1024-dim hidden, 16 attention heads) with task-specific training using contrastive learning on curated text pairs. Uses [CLS] token pooling with optional instruction prefix.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
await mx.collections.ingest({
collection_id: "my-collection",
source: { url: "https://example.com/report.pdf" },
feature_extractors: [{
name: "text_embedding",
version: "v1",
params: {
model_id: "BAAI/bge-large-en-v1.5"
}
}]
});Capabilities
- 1024-dimensional dense text embeddings
- Top-ranked on MTEB retrieval benchmarks
- Instruction-aware embedding with task prefixes
- Optimized for asymmetric retrieval (query vs. passage)
Use Cases on Mixpeek
Specification
Research Paper
C-Pack: Packaged Resources To Advance General Chinese Embedding
arxiv.orgBuild a pipeline with bge-large-en-v1.5
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Pipeline Builder