e5-omni-7B
by Haon-Chen
State-of-the-art omnimodal embedding with explicit cross-modal alignment
Haon-Chen/e5-omni-7Bmixpeek://image_extractor@v1/haon_chen_e5_omni_7b_v1Overview
E5-Omni is Microsoft's omnimodal embedding model that achieves state-of-the-art on the MMEB-V2 benchmark across text, image, audio, and video tasks. Built on Qwen2.5-Omni-7B, it introduces modality-aware temperature calibration, controllable negative curriculum learning, and batch whitening for cross-modal alignment.
On Mixpeek, E5-Omni delivers the highest-quality cross-modal embeddings available — its explicit alignment techniques mean that similarity scores between different modalities (e.g., text query vs. audio clip) are more reliable than models trained with simple contrastive objectives.
Architecture
Qwen2.5-Omni-7B backbone with three alignment components: (1) modality-aware temperature calibration, (2) controllable negative curriculum that progressively masks easy negatives, (3) batch whitening and covariance alignment. ~9B total parameters. Unified embedding space for all modalities.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "research-library",source: { url: "https://example.com/lecture-recording.mp4" },feature_extractors: [{feature: "multimodal_embedding",model: "Haon-Chen/e5-omni-7B"}]});
Capabilities
- SOTA on MMEB-V2 benchmark (66.4 overall across 78 tasks)
- Best audio retrieval among omnimodal models (37.7 Recall@1 on AudioCaps)
- Unified text, image, audio, and video embeddings
- Explicit cross-modal alignment for reliable similarity scores
- Outperforms 3B models by 15+ points on MMEB-V2
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| MMEB-V2 (78 tasks) | Overall | 66.4 | Chen et al., 2025 — arxiv:2601.03666 |
| MMEB-V2 Image (36 tasks) | Hit@1 | 71.2 | Chen et al., 2025 — arxiv:2601.03666 |
| AudioCaps | Recall@1 | 37.7 | Chen et al., 2025 — arxiv:2601.03666 |
Performance
Specification
Research Paper
e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings
arxiv.orgBuild a pipeline with e5-omni-7B
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio