PDF, Image & Document Processing

Document Processing API

Documents are decomposed into features stored across cost tiers in the multimodal data warehouse. Extract text, tables, and layout structure from PDFs, contracts, and reports, then query across your entire corpus with composable multi-stage retrieval pipelines.

Document Processing Capabilities

From raw document input to searchable, structured output in a single pipeline.

OCR & Text Extraction

Extract text from scanned documents, photographs, and image-based PDFs using state-of-the-art OCR models with support for 100+ languages.

Layout Analysis

Detect and preserve document structure -- headers, paragraphs, tables, figures, and reading order -- for structured data extraction.

Table Extraction

Identify and extract tabular data from documents, preserving row and column relationships for downstream analysis and indexing.

Semantic Embedding

Generate vector embeddings from document content that capture meaning, enabling semantic search across your entire document corpus.

How It Works

A four-stage pipeline transforms raw documents into searchable, structured data.

Upload

Send documents through the API or upload to an S3-compatible bucket. Supports PDF, DOCX, PPTX, images, and scanned files.

Classify

Documents are automatically classified by type and routed to the appropriate extraction pipeline based on their structure and content.

Extract

OCR, layout analysis, table extraction, and text parsing run in parallel. Each extraction stage produces structured output with position metadata.

Embed & Index

Extracted content is embedded into vectors and indexed alongside structural metadata for semantic search and retrieval.

Supported Document Formats

Process any document type through a single API endpoint.

PDF Documents

Digital PDFs
Scanned PDFs
Multi-page documents
Fillable forms

Office Documents

DOCX (Word)
PPTX (PowerPoint)
XLSX (Excel)
ODT / ODP

Images

JPEG / PNG / WebP
TIFF (multi-page)
Scanned documents
Photographs of text

Structured Data

CSV / TSV files
JSON documents
XML files
HTML pages

Use Cases

Document processing powers intelligent workflows across industries.

Contract Analysis

Extract clauses, parties, dates, and obligations from legal contracts. Enable semantic search across thousands of agreements to find relevant precedents and terms.

Invoice Processing

Automatically extract vendor names, line items, amounts, and dates from invoices in any format -- digital PDFs, scanned documents, or photographed receipts.

Medical Records

Parse clinical documents, lab reports, and imaging results. Extract structured data from unstructured medical records for analysis and compliance.

Research Literature

Process academic papers, patents, and technical documents. Extract figures, tables, citations, and full text for semantic search and knowledge discovery.

Mixpeek vs Traditional Document Processing

See what changes when you move beyond basic text extraction.

Feature	Traditional Tools	Mixpeek
Input Types	Digital text PDFs only	PDF, images, scans, Office docs, structured data
Understanding	Raw text extraction	Layout-aware extraction with structural metadata
Search	Keyword matching	Semantic vector search across document content
Table Handling	Tables lost in extraction	Table structure preserved with row/column relationships
Scalability	Sequential processing	Distributed batch processing across GPU workers
Output	Plain text	Structured data + embeddings + searchable index

Simple API Integration

Process documents and search across them with a few lines of code.

document_processing.py

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

# Process a PDF document
result = client.collections.process(
    collection_id="contracts",
    source={
        "type": "file",
        "url": "s3://legal-docs/contract-2024.pdf"
    }
)

# Search across all processed documents
results = client.retrievers.execute(
    "document-search",
    inputs={"query": "indemnification clauses with liability caps"},
    filters={
        "AND": [
            {"key": "document_type", "value": "contract", "operator": "eq"},
            {"key": "year", "value": 2024, "operator": "gte"}
        ]
    },
    limit=10
)

for doc in results["documents"]:
    print(f"Page {doc.metadata['page']}: {doc.score:.3f}")
    print(f"  {doc.text[:200]}...")

Build this from the docs

The extractors and tutorials behind the pipeline above, with request shapes and worked examples.

Document extractorLayout-aware parsing that keeps tables and structure intact.ColPali tutorialRetrieve from the page image when the answer is a chart.Feature search stageTurn a question into ranked passages. API reference.NamespacesWhere parsed documents live and how to scope them.

Frequently Asked Questions

What document formats does Mixpeek support?

Mixpeek processes PDF (digital and scanned), DOCX, PPTX, XLSX, images (JPEG, PNG, WebP, TIFF), CSV, JSON, XML, and HTML. Documents are automatically detected and routed to the appropriate extraction pipeline based on their format and structure.

How does Mixpeek handle scanned documents and images of text?

Scanned documents and images are processed through OCR models that extract text with high accuracy across 100+ languages. Layout analysis identifies document structure -- headers, paragraphs, tables, and figures -- preserving the reading order and spatial relationships that give context to the extracted text.

Can Mixpeek extract tables from PDFs?

Yes. The document processing pipeline includes table detection and extraction models that identify tabular structures within documents and extract cell values with preserved row and column relationships. The output includes both the raw table data and structural metadata that enables downstream processing and querying.

How does semantic search work on documents?

After text and structure are extracted, the content is embedded into vector representations using language models. These embeddings capture the meaning of the content, not just keywords. You can then search across your entire document corpus using natural language queries, finding relevant passages even when they use different terminology than your search terms.

What is the processing latency for documents?

Processing time depends on document complexity, page count, and configured extractors. A typical single-page PDF processes in 1-3 seconds including OCR, layout analysis, and embedding generation. Multi-page documents are processed in parallel across pages. Batch processing of large document collections is distributed across workers for high throughput.

Can I process documents in languages other than English?

Yes. The OCR and text extraction models support over 100 languages. Embedding models are available for multilingual content, enabling semantic search across documents in any supported language. Language detection is automatic, so mixed-language document collections are handled without manual configuration.

How do I integrate document processing into my existing workflow?

Mixpeek provides a REST API for document upload and processing, an S3-compatible bucket trigger for automated processing of new documents, and webhooks for notification when processing completes. SDKs are available for Python and JavaScript. Documents can be uploaded individually or in batches.

Does Mixpeek preserve the original document layout?

Yes. The layout analysis stage identifies structural elements like headers, paragraphs, lists, tables, and figures along with their spatial coordinates and reading order. This structural metadata is stored alongside the extracted text, enabling applications that need to reconstruct the original document layout or reference specific regions.

Start Processing Documents Today

One API to extract, embed, and search across PDFs, images, and office documents. Get started with our free tier or talk to us about enterprise deployment.