nomic-embed-multimodal-3b
by nomic-ai
3B visual-document retriever for text queries over screenshots, pages, and image-heavy documents
nomic-ai/nomic-embed-multimodal-3bmixpeek://image_extractor@v1/nomic_embed_multimodal_3b_v1Overview
Nomic Embed Multimodal 3B is a visual-document retrieval model built on Qwen2.5-VL-3B. It is trained for text-to-visual-document retrieval, where the indexed unit is the rendered page or screenshot rather than OCR text alone.
On Mixpeek, it is useful when an agent needs to search document pages that contain charts, forms, tables, product screenshots, or other information that does not survive plain text extraction.
Architecture
PEFT adapter on Qwen2.5-VL-3B-Instruct for visual-document retrieval. The model is aligned for queries against page images across English, Italian, French, German, and Spanish content.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "visual-docs",source: { url: "https://example.com/annual-report.pdf" },feature_extractors: [{feature: "multimodal_embedding",model: "nomic-ai/nomic-embed-multimodal-3b"}]});
Capabilities
- Text-to-visual-document retrieval
- Multilingual page retrieval across five documented languages
- Works on screenshots and rendered pages where OCR can lose layout
- Fits between lightweight CLIP-style retrieval and larger VLM reranking
Use Cases on Mixpeek
Specification
Research Paper
Nomic Embed Multimodal 3B
arxiv.orgBuild a pipeline with nomic-embed-multimodal-3b
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio