Skip to main content
Mixpeek extracts text, tables, and structured data from PDFs, Word docs, and other document formats, then generates searchable embeddings. Each document (or chunk) becomes a searchable record with dense vector indexes for semantic retrieval.

What Gets Extracted

FeatureModelDimensionsExtractor
Text contentPyMuPDF / parsermultimodal_extractor
Text embeddingsE5-Large1024Dmultimodal_extractor
OCR text (scanned PDFs)Geminimultimodal_extractor
Tables and structured dataGeminimultimodal_extractor
Page thumbnailsFFmpegmultimodal_extractor

Choosing an Extractor

The multimodal_extractor handles documents alongside images and video in a unified pipeline. It parses text from born-digital PDFs, runs OCR on scanned pages via Gemini, and generates E5-Large embeddings for semantic search.
GoalExtractorWhy
Semantic search over document textmultimodal_extractorE5-Large 1024D embeddings with cross-modal support
OCR for scanned PDFs and imagesmultimodal_extractorGemini-based OCR handles low-quality scans
Structured extraction (invoices, forms)multimodal_extractor with response_shapeLLM extracts structured JSON from document content
Documents searchable alongside video and imagesmultimodal_extractorUnified embedding space across all modalities
For scanned PDFs with poor text layers, enable run_ocr to extract text via Gemini. This works alongside the standard text parser for mixed-quality documents.

Create a Collection for Documents

This collection extracts text from documents, generates E5-Large embeddings, and enables OCR for scanned pages.
curl -X POST https://api.mixpeek.com/v1/collections \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "document-library",
    "source": { "type": "bucket", "bucket_id": "bkt_documents" },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": {
        "document": "payload.document_url"
      },
      "field_passthrough": [
        { "source_path": "metadata.doc_id" },
        { "source_path": "metadata.title" },
        { "source_path": "metadata.author" }
      ],
      "parameters": {
        "run_text_embedding": true,
        "run_ocr": true,
        "enable_thumbnails": true
      }
    }
  }'

Search Documents

Create a retriever for semantic search over your document corpus, then execute it with a natural language query.
curl -X POST https://api.mixpeek.com/v1/retrievers \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "doc-search",
    "collection_ids": ["col_document_library"],
    "input_schema": {
      "properties": {
        "query": { "type": "text", "required": true }
      }
    },
    "stages": [
      {
        "stage_name": "semantic_search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "query": "{{INPUT.query}}",
            "top_k": 20
          }
        }
      }
    ]
  }'
Execute the retriever:
curl -X POST https://api.mixpeek.com/v1/retrievers/ret_doc456/execute \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "termination clause for breach of contract" },
    "limit": 10
  }'

Output Schema

After extraction, each document (or document chunk) produces a record like this:
{
  "document_id": "doc_pdf_001",
  "text": "The service provider may terminate this agreement with 30 days written notice. Either party may terminate immediately upon material breach.",
  "ocr_text": "CONFIDENTIAL - Master Services Agreement Rev. 3",
  "page_number": 4,
  "thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/page_004.jpg",
  "source_document_url": "s3://my-bucket/contracts/msa-2025.pdf",
  "metadata": {
    "doc_id": "CONTRACT-2025-001",
    "title": "Master Services Agreement",
    "author": "Legal Team"
  },
  "multimodal_extractor_v1_text_embedding": [0.023, -0.041, "...1024 floats"]
}
FieldTypeDescription
textstringExtracted text content from the page or chunk
ocr_textstringGemini OCR output for scanned pages
page_numberintegerSource page number (1-indexed)
thumbnail_urlstringS3 URL of the page thumbnail
source_document_urlstringOriginal source document URL
multimodal_extractor_v1_text_embeddingfloat[1024]E5-Large dense embedding