From Documents - Mixpeek

Document extraction: PDFs and docs are parsed, OCR'd, chunked, and embedded for semantic search

Mixpeek extracts text, tables, and structured data from PDFs, Word docs, and other document formats, then generates searchable embeddings. Each document (or chunk) becomes a searchable record with dense vector indexes for semantic retrieval.

What Gets Extracted

Feature	Model	Dimensions	Extractor
Text content	PyMuPDF / parser	—	`multimodal_extractor`
Text embeddings	E5-Large	1024D	`multimodal_extractor`
OCR text (scanned PDFs)	Gemini	—	`multimodal_extractor`
Tables and structured data	Gemini	—	`multimodal_extractor`
Page thumbnails	FFmpeg	—	`multimodal_extractor`

Choosing an Extractor

The multimodal_extractor handles documents alongside images and video in a unified pipeline. It parses text from born-digital PDFs, runs OCR on scanned pages via Gemini, and generates E5-Large embeddings for semantic search.

Goal	Extractor	Why
Semantic search over document text	`multimodal_extractor`	E5-Large 1024D embeddings with cross-modal support
OCR for scanned PDFs and images	`multimodal_extractor`	Gemini-based OCR handles low-quality scans
Structured extraction (invoices, forms)	`multimodal_extractor` with `response_shape`	LLM extracts structured JSON from document content
Documents searchable alongside video and images	`multimodal_extractor`	Unified embedding space across all modalities

For scanned PDFs with poor text layers, enable run_ocr to extract text via Gemini. This works alongside the standard text parser for mixed-quality documents.

Create a Collection for Documents

This collection extracts text from documents, generates E5-Large embeddings, and enables OCR for scanned pages.

curl -X POST https://api.mixpeek.com/v1/collections \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "document-library",
    "source": { "type": "bucket", "bucket_ids": ["bkt_documents"] },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": {
        "document": "document_url"
      },
      "field_passthrough": [
        { "source_path": "metadata.doc_id" },
        { "source_path": "metadata.title" },
        { "source_path": "metadata.author" }
      ],
      "parameters": {
        "run_text_embedding": true,
        "run_ocr": true,
        "enable_thumbnails": true
      }
    }
  }'

Search Documents

Create a retriever for semantic search over your document corpus, then execute it with a natural language query.

curl -X POST https://api.mixpeek.com/v1/retrievers \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "doc-search",
    "collection_identifiers": ["col_document_library"],
    "input_schema": {
      "query": { "type": "text", "required": true }
    },
    "stages": [
      {
        "stage_name": "semantic_search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://multimodal_extractor@v1/transcription_embedding",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "top_k": 20
              }
            ],
            "final_top_k": 20
          }
        }
      }
    ]
  }'

Execute the retriever:

curl -X POST https://api.mixpeek.com/v1/retrievers/ret_doc456/execute \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "termination clause for breach of contract" }
  }'

Output Schema

After extraction, each document (or document chunk) produces a record like this:

{
  "document_id": "doc_pdf_001",
  "text": "The service provider may terminate this agreement with 30 days written notice. Either party may terminate immediately upon material breach.",
  "ocr_text": "CONFIDENTIAL - Master Services Agreement Rev. 3",
  "page_number": 4,
  "thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/page_004.jpg",
  "source_document_url": "s3://my-bucket/contracts/msa-2025.pdf",
  "metadata": {
    "doc_id": "CONTRACT-2025-001",
    "title": "Master Services Agreement",
    "author": "Legal Team"
  },
  "multimodal_extractor_v1_transcription_embedding": [0.023, -0.041, "...1024 floats"]
}

Field	Type	Description
`text`	string	Extracted text content from the page or chunk
`ocr_text`	string	Gemini OCR output for scanned pages
`page_number`	integer	Source page number (1-indexed)
`thumbnail_url`	string	S3 URL of the page thumbnail
`source_document_url`	string	Original source document URL
`multimodal_extractor_v1_transcription_embedding`	float[1024]	E5-Large dense embedding

Multimodal Extractor — full parameter reference
Retrievers — build search pipelines over extracted features
From Images — image extraction with the same multimodal pipeline
From Video — video extraction with multimodal embeddings

​What Gets Extracted

​Choosing an Extractor

​Create a Collection for Documents

​Search Documents

​Output Schema

​Related

What Gets Extracted

Choosing an Extractor

Create a Collection for Documents

Search Documents

Output Schema

Related