Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Document extraction: PDFs and docs are parsed, OCR'd, chunked, and embedded for semantic search
Mixpeek extracts text, tables, and structured data from PDFs, Word docs, and other document formats, then generates searchable embeddings. Each document (or chunk) becomes a searchable record with dense vector indexes for semantic retrieval.

What Gets Extracted

FeatureModelDimensionsExtractor
Text contentPyMuPDF / parsermultimodal_extractor
Text embeddingsE5-Large1024Dmultimodal_extractor
OCR text (scanned PDFs)Geminimultimodal_extractor
Tables and structured dataGeminimultimodal_extractor
Page thumbnailsFFmpegmultimodal_extractor

Choosing an Extractor

The multimodal_extractor handles documents alongside images and video in a unified pipeline. It parses text from born-digital PDFs, runs OCR on scanned pages via Gemini, and generates E5-Large embeddings for semantic search.
GoalExtractorWhy
Semantic search over document textmultimodal_extractorE5-Large 1024D embeddings with cross-modal support
OCR for scanned PDFs and imagesmultimodal_extractorGemini-based OCR handles low-quality scans
Structured extraction (invoices, forms)multimodal_extractor with response_shapeLLM extracts structured JSON from document content
Documents searchable alongside video and imagesmultimodal_extractorUnified embedding space across all modalities
For scanned PDFs with poor text layers, enable run_ocr to extract text via Gemini. This works alongside the standard text parser for mixed-quality documents.

Create a Collection for Documents

This collection extracts text from documents, generates E5-Large embeddings, and enables OCR for scanned pages.
curl -X POST https://api.mixpeek.com/v1/collections \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "document-library",
    "source": { "type": "bucket", "bucket_id": "bkt_documents" },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": {
        "document": "payload.document_url"
      },
      "field_passthrough": [
        { "source_path": "metadata.doc_id" },
        { "source_path": "metadata.title" },
        { "source_path": "metadata.author" }
      ],
      "parameters": {
        "run_text_embedding": true,
        "run_ocr": true,
        "enable_thumbnails": true
      }
    }
  }'

Search Documents

Create a retriever for semantic search over your document corpus, then execute it with a natural language query.
curl -X POST https://api.mixpeek.com/v1/retrievers \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "doc-search",
    "collection_ids": ["col_document_library"],
    "input_schema": {
      "properties": {
        "query": { "type": "text", "required": true }
      }
    },
    "stages": [
      {
        "stage_name": "semantic_search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "query": "{{INPUT.query}}",
            "top_k": 20
          }
        }
      }
    ]
  }'
Execute the retriever:
curl -X POST https://api.mixpeek.com/v1/retrievers/ret_doc456/execute \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "termination clause for breach of contract" },
    "limit": 10
  }'

Output Schema

After extraction, each document (or document chunk) produces a record like this:
{
  "document_id": "doc_pdf_001",
  "text": "The service provider may terminate this agreement with 30 days written notice. Either party may terminate immediately upon material breach.",
  "ocr_text": "CONFIDENTIAL - Master Services Agreement Rev. 3",
  "page_number": 4,
  "thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/page_004.jpg",
  "source_document_url": "s3://my-bucket/contracts/msa-2025.pdf",
  "metadata": {
    "doc_id": "CONTRACT-2025-001",
    "title": "Master Services Agreement",
    "author": "Legal Team"
  },
  "multimodal_extractor_v1_text_embedding": [0.023, -0.041, "...1024 floats"]
}
FieldTypeDescription
textstringExtracted text content from the page or chunk
ocr_textstringGemini OCR output for scanned pages
page_numberintegerSource page number (1-indexed)
thumbnail_urlstringS3 URL of the page thumbnail
source_document_urlstringOriginal source document URL
multimodal_extractor_v1_text_embeddingfloat[1024]E5-Large dense embedding