Mixpeek extracts text, tables, and structured data from PDFs, Word docs, and other document formats, then generates searchable embeddings. Each document (or chunk) becomes a searchable record with dense vector indexes for semantic retrieval.
| Feature | Model | Dimensions | Extractor |
|---|
| Text content | PyMuPDF / parser | — | multimodal_extractor |
| Text embeddings | E5-Large | 1024D | multimodal_extractor |
| OCR text (scanned PDFs) | Gemini | — | multimodal_extractor |
| Tables and structured data | Gemini | — | multimodal_extractor |
| Page thumbnails | FFmpeg | — | multimodal_extractor |
The multimodal_extractor handles documents alongside images and video in a unified pipeline. It parses text from born-digital PDFs, runs OCR on scanned pages via Gemini, and generates E5-Large embeddings for semantic search.
| Goal | Extractor | Why |
|---|
| Semantic search over document text | multimodal_extractor | E5-Large 1024D embeddings with cross-modal support |
| OCR for scanned PDFs and images | multimodal_extractor | Gemini-based OCR handles low-quality scans |
| Structured extraction (invoices, forms) | multimodal_extractor with response_shape | LLM extracts structured JSON from document content |
| Documents searchable alongside video and images | multimodal_extractor | Unified embedding space across all modalities |
For scanned PDFs with poor text layers, enable run_ocr to extract text via Gemini. This works alongside the standard text parser for mixed-quality documents.
Create a Collection for Documents
This collection extracts text from documents, generates E5-Large embeddings, and enables OCR for scanned pages.
curl -X POST https://api.mixpeek.com/v1/collections \
-H "Authorization: Bearer $MIXPEEK_API_KEY" \
-H "X-Namespace: $NAMESPACE" \
-H "Content-Type: application/json" \
-d '{
"collection_name": "document-library",
"source": { "type": "bucket", "bucket_id": "bkt_documents" },
"feature_extractor": {
"feature_extractor_name": "multimodal_extractor",
"version": "v1",
"input_mappings": {
"document": "payload.document_url"
},
"field_passthrough": [
{ "source_path": "metadata.doc_id" },
{ "source_path": "metadata.title" },
{ "source_path": "metadata.author" }
],
"parameters": {
"run_text_embedding": true,
"run_ocr": true,
"enable_thumbnails": true
}
}
}'
Search Documents
Create a retriever for semantic search over your document corpus, then execute it with a natural language query.
curl -X POST https://api.mixpeek.com/v1/retrievers \
-H "Authorization: Bearer $MIXPEEK_API_KEY" \
-H "X-Namespace: $NAMESPACE" \
-H "Content-Type: application/json" \
-d '{
"retriever_name": "doc-search",
"collection_ids": ["col_document_library"],
"input_schema": {
"properties": {
"query": { "type": "text", "required": true }
}
},
"stages": [
{
"stage_name": "semantic_search",
"stage_type": "filter",
"config": {
"stage_id": "feature_search",
"parameters": {
"query": "{{INPUT.query}}",
"top_k": 20
}
}
}
]
}'
Execute the retriever:
curl -X POST https://api.mixpeek.com/v1/retrievers/ret_doc456/execute \
-H "Authorization: Bearer $MIXPEEK_API_KEY" \
-H "X-Namespace: $NAMESPACE" \
-H "Content-Type: application/json" \
-d '{
"inputs": { "query": "termination clause for breach of contract" },
"limit": 10
}'
Output Schema
After extraction, each document (or document chunk) produces a record like this:
{
"document_id": "doc_pdf_001",
"text": "The service provider may terminate this agreement with 30 days written notice. Either party may terminate immediately upon material breach.",
"ocr_text": "CONFIDENTIAL - Master Services Agreement Rev. 3",
"page_number": 4,
"thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/page_004.jpg",
"source_document_url": "s3://my-bucket/contracts/msa-2025.pdf",
"metadata": {
"doc_id": "CONTRACT-2025-001",
"title": "Master Services Agreement",
"author": "Legal Team"
},
"multimodal_extractor_v1_text_embedding": [0.023, -0.041, "...1024 floats"]
}
| Field | Type | Description |
|---|
text | string | Extracted text content from the page or chunk |
ocr_text | string | Gemini OCR output for scanned pages |
page_number | integer | Source page number (1-indexed) |
thumbnail_url | string | S3 URL of the page thumbnail |
source_document_url | string | Original source document URL |
multimodal_extractor_v1_text_embedding | float[1024] | E5-Large dense embedding |