NEWWhy single embeddings fail for video.Read the post →
    Models/Embeddings/BAAI/bge-large-en-v1.5
    HFText Embeddingsmit

    bge-large-en-v1.5

    by BAAI

    BAAI General Embedding, state-of-the-art text retrieval

    7.1Mdl/month
    643likes
    335Mparams
    Identifiers
    Model ID
    BAAI/bge-large-en-v1.5
    Feature URI
    mixpeek://text_extractor@v1/baai_bge_large_v1

    Overview

    BGE (BAAI General Embedding) is a family of text embedding models that achieve top performance on the MTEB benchmark. The large-en-v1.5 variant produces 1024-dimensional embeddings optimized for English text retrieval and semantic similarity.

    On Mixpeek, BGE powers text-based semantic search over extracted text content, transcriptions, captions, OCR results, and document text.

    Architecture

    BERT-Large architecture (24 layers, 1024-dim hidden, 16 attention heads) with task-specific training using contrastive learning on curated text pairs. Uses [CLS] token pooling with optional instruction prefix.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/report.pdf" },
    feature_extractors: [{
    name: "text_embedding",
    version: "v1",
    params: {
    model_id: "BAAI/bge-large-en-v1.5"
    }
    }]
    });

    Capabilities

    • 1024-dimensional dense text embeddings
    • Top-ranked on MTEB retrieval benchmarks
    • Instruction-aware embedding with task prefixes
    • Optimized for asymmetric retrieval (query vs. passage)

    Use Cases on Mixpeek

    Semantic search over transcribed audio/video content
    Document similarity and deduplication
    RAG pipeline embedding backend
    Cross-document concept matching

    Benchmarks

    DatasetMetricScoreSource
    MTEB (56 datasets)Avg Score64.23MTEB Leaderboard — bge-large-en-v1.5
    MS MARCO (Passage)MRR@1041.2Xiao et al., 2024 — Table 3
    NLI (STS)Spearman86.4MTEB Leaderboard

    Performance

    Input Size512 tokens max
    Embedding Dim1024
    GPU Latency~3ms / passage (A100)
    CPU Latency~28ms / passage
    GPU Throughput~330 passages/sec (A100)
    GPU Memory~1.3 GB

    Specification

    FrameworkHF
    OrganizationBAAI
    FeatureText Embeddings
    Output1024-dim vector
    Modalitiesdocument, audio
    RetrieverText Similarity
    Parameters335M
    Licensemit
    Downloads/mo7.1M
    Likes643

    Research Paper

    C-Pack: Packaged Resources To Advance General Chinese Embedding

    arxiv.org

    Build a pipeline with bge-large-en-v1.5

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio