NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/BidirLM/BidirLM-Omni-2.5B-Embedding
    HFVisual EmbeddingsApache 2.0

    BidirLM-Omni-2.5B-Embedding

    by BidirLM

    Bidirectional omni-modal encoder for text, images, and audio in a shared vector space

    8Kdl/month
    2.5Bparams
    Identifiers
    Model ID
    BidirLM/BidirLM-Omni-2.5B-Embedding
    Feature URI
    mixpeek://image_extractor@v1/bidirlm_omni_25b_v1

    Overview

    BidirLM-Omni-2.5B-Embedding is a 2.5B parameter bidirectional embedding model that encodes text, images, and audio into a shared 2048-dimensional vector space. Based on Qwen3 with custom bidirectional attention (replacing the standard causal mask), it achieves state-of-the-art results on MTEB Multilingual V2, MIEB (image), and MAEB (audio) benchmarks simultaneously — making it one of the first models to top leaderboards across all three modalities. Supports 119+ languages with 32K context.

    Architecture

    Modified Qwen3-2.5B with bidirectional attention replacing causal attention for encoding tasks. Modality-specific input adapters project images (via CLIP-style patches) and audio (via mel-spectrogram frames) into the same token space as text. Mean pooling over the final hidden states produces 2048-dimensional embeddings. The bidirectional attention is critical — causal LLM attention degrades embedding quality because later tokens can't attend to earlier ones.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest.videos(
    source="s3://media/mixed-content/",
    collection="omni_search",
    feature_extractors=[{
    "name": "visual_embeddings",
    "model": "BidirLM/BidirLM-Omni-2.5B-Embedding",
    "params": {"modalities": ["text", "image", "audio"], "dim": 2048}
    }]
    )

    Capabilities

    • Unified text, image, and audio embeddings in shared vector space
    • Cross-modal retrieval (text query → image/audio results and vice versa)
    • 119+ language support for multilingual text embedding
    • 32K context window for long document embedding
    • State-of-the-art across MTEB, MIEB, and MAEB simultaneously

    Use Cases on Mixpeek

    Cross-modal search across mixed media libraries
    Unified embedding pipeline replacing separate text + image + audio encoders
    Multilingual multimodal retrieval
    Podcast/video search using audio similarity
    Building shared vector spaces for agent perception across modalities

    Benchmarks

    DatasetMetricScoreSource
    MTEB Multilingual V2Mean ScoreSOTA at 2.5B scaleText embedding benchmark
    MIEBMean ScoreSOTA at 2.5B scaleImage embedding benchmark
    MAEBMean ScoreSOTA at 2.5B scaleAudio embedding benchmark

    Performance

    Input SizeVariable
    GPU LatencyInput dependent
    GPU Throughput~200 items/sec (A100, batch 32, text)
    GPU Memory~6 GB

    Specification

    FrameworkHF
    OrganizationBidirLM
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters2.5B
    LicenseApache 2.0
    Downloads/mo8K

    Build a pipeline with BidirLM-Omni-2.5B-Embedding

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio