Search PDFs, scanned pages, and figure-heavy reports by visual content using cross-modal embeddings — no OCR required
Visual document retrieval treats each page as an image and embeds it with a cross-modal model, so a text query can find the right page by what it looks like — tables, charts, diagrams, scans — without relying on OCR.
Traditional document search pipelines drop information at every stage:
OCR mangles tables, equations, and low-contrast scans.
Layout parsers miss chart and diagram semantics.
Text-only embeddings never see the visual structure of the page.
Mixpeek’s multimodal_extractor embeds page images directly with Google’s Vertex multimodal model into a shared text-image space (vertex_multimodal_embedding). Because the space is cross-modal, a text query retrieves visually-relevant pages — the words “revenue breakdown by region” can match a page dominated by a financial table, even with no clean extractable text.
This is single-vector cross-modal retrieval (one embedding per page). Mixpeek does not currently offer ColPali-style multi-vector late-interaction (per-patch MaxSim) scoring. For born-digital, text-heavy PDFs where you want extracted text + OCR, use the universal extractor instead (see Document Intelligence).
Render PDF pages to images client-side (e.g. pdftoppm, pdf2image) — one object per page — so each page becomes an independently retrievable result with its own page_number.