Documentation Index
Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
What Gets Extracted
| Feature | Model | Dimensions | Extractor |
|---|---|---|---|
| Text content | PyMuPDF / parser | — | multimodal_extractor |
| Text embeddings | E5-Large | 1024D | multimodal_extractor |
| OCR text (scanned PDFs) | Gemini | — | multimodal_extractor |
| Tables and structured data | Gemini | — | multimodal_extractor |
| Page thumbnails | FFmpeg | — | multimodal_extractor |
Choosing an Extractor
Themultimodal_extractor handles documents alongside images and video in a unified pipeline. It parses text from born-digital PDFs, runs OCR on scanned pages via Gemini, and generates E5-Large embeddings for semantic search.
| Goal | Extractor | Why |
|---|---|---|
| Semantic search over document text | multimodal_extractor | E5-Large 1024D embeddings with cross-modal support |
| OCR for scanned PDFs and images | multimodal_extractor | Gemini-based OCR handles low-quality scans |
| Structured extraction (invoices, forms) | multimodal_extractor with response_shape | LLM extracts structured JSON from document content |
| Documents searchable alongside video and images | multimodal_extractor | Unified embedding space across all modalities |
For scanned PDFs with poor text layers, enable
run_ocr to extract text via Gemini. This works alongside the standard text parser for mixed-quality documents.Create a Collection for Documents
This collection extracts text from documents, generates E5-Large embeddings, and enables OCR for scanned pages.Search Documents
Create a retriever for semantic search over your document corpus, then execute it with a natural language query.Output Schema
After extraction, each document (or document chunk) produces a record like this:| Field | Type | Description |
|---|---|---|
text | string | Extracted text content from the page or chunk |
ocr_text | string | Gemini OCR output for scanned pages |
page_number | integer | Source page number (1-indexed) |
thumbnail_url | string | S3 URL of the page thumbnail |
source_document_url | string | Original source document URL |
multimodal_extractor_v1_text_embedding | float[1024] | E5-Large dense embedding |
Related
- Multimodal Extractor — full parameter reference
- Retrievers — build search pipelines over extracted features
- From Images — image extraction with the same multimodal pipeline
- From Video — video extraction with multimodal embeddings

