e5-omni-7B

by Haon-Chen

State-of-the-art omnimodal embedding with explicit cross-modal alignment

261dl/month

~9Bparams

HuggingFace Run on your data

Identifiers

Model ID

Haon-Chen/e5-omni-7B

Feature URI

mixpeek://image_extractor@v1/haon_chen_e5_omni_7b_v1

Overview

E5-Omni is Microsoft's omnimodal embedding model that achieves state-of-the-art on the MMEB-V2 benchmark across text, image, audio, and video tasks. Built on Qwen2.5-Omni-7B, it introduces modality-aware temperature calibration, controllable negative curriculum learning, and batch whitening for cross-modal alignment.

On Mixpeek, E5-Omni delivers the highest-quality cross-modal embeddings available: its explicit alignment techniques mean that similarity scores between different modalities (e.g., text query vs. audio clip) are more reliable than models trained with simple contrastive objectives.

Architecture

Qwen2.5-Omni-7B backbone with three alignment components: (1) modality-aware temperature calibration, (2) controllable negative curriculum that progressively masks easy negatives, (3) batch whitening and covariance alignment. ~9B total parameters. Unified embedding space for all modalities.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "multimodal_embedding",
    version: "v1",
    parameters: { model_id: "Haon-Chen/e5-omni-7B" },
  },
});

Capabilities

SOTA on MMEB-V2 benchmark (66.4 overall across 78 tasks)
Best audio retrieval among omnimodal models (37.7 Recall@1 on AudioCaps)
Unified text, image, audio, and video embeddings
Explicit cross-modal alignment for reliable similarity scores
Outperforms 3B models by 15+ points on MMEB-V2

Use Cases on Mixpeek

Cross-modal retrieval: find audio clips matching a text description

Multimedia RAG: unified retrieval across all content types

Audio-visual search: query meetings by both spoken content and visual slides

Research libraries: embed papers, presentations, and recorded talks together

Benchmarks

Dataset	Metric	Score	Source
MMEB-V2 (78 tasks)	Overall	66.4	Chen et al., 2025: arxiv,2601.03666
MMEB-V2 Image (36 tasks)	Hit@1	71.2	Chen et al., 2025: arxiv,2601.03666
AudioCaps	Recall@1	37.7	Chen et al., 2025: arxiv,2601.03666