bge-large-en-v1.5

by BAAI

BAAI General Embedding — state-of-the-art text retrieval

5.8Mdl/month

631likes

335Mparams

HuggingFace Use in Pipeline

Identifiers

Model ID

BAAI/bge-large-en-v1.5

Feature URI

mixpeek://text_extractor@v1/baai_bge_large_v1

Overview

BGE (BAAI General Embedding) is a family of text embedding models that achieve top performance on the MTEB benchmark. The large-en-v1.5 variant produces 1024-dimensional embeddings optimized for English text retrieval and semantic similarity.

On Mixpeek, BGE powers text-based semantic search over extracted text content — transcriptions, captions, OCR results, and document text.

Architecture

BERT-Large architecture (24 layers, 1024-dim hidden, 16 attention heads) with task-specific training using contrastive learning on curated text pairs. Uses [CLS] token pooling with optional instruction prefix.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

await mx.collections.ingest({
  collection_id: "my-collection",
  source: { url: "https://example.com/report.pdf" },
  feature_extractors: [{
    name: "text_embedding",
    version: "v1",
    params: {
      model_id: "BAAI/bge-large-en-v1.5"
    }
  }]
});