Document Intelligence

OCR API: AI-Powered Text Extraction from Images and Documents

OCR is one extractor in the warehouse's Decompose layer. Go beyond simple text extraction: combine optical character recognition with multimodal document understanding, extracting text, tables, and layouts into searchable indexes you can query with composable retriever pipelines.

Beyond Traditional OCR

Traditional OCR extracts text. Mixpeek understands documents. Combine text extraction with visual layout analysis, table detection, and multimodal embeddings for true document intelligence.

Extract + Understand

Traditional OCR

Returns raw text strings with no context about document structure, table relationships, or visual layout. You get a wall of text that loses the document's meaning.

Mixpeek Document Intelligence

Extracts text with layout context, table structures, spatial coordinates, and reading order. Every text block knows where it sits in the document and how it relates to other elements.

Extract + Search

Traditional OCR

Gives you extracted text as a file output. Building search over that text requires a separate pipeline -- text processing, embedding generation, vector database, and search API.

Mixpeek OCR + Search

Extracted text is automatically embedded and indexed into Qdrant namespaces. Composable retriever pipelines let you search immediately with hybrid keyword and semantic matching.

Extract + Multimodal

Traditional OCR

Ignores the visual appearance of documents. A chart, diagram, or annotated image is reduced to whatever text characters appear in it, losing critical information.

Mixpeek Multimodal OCR

Generates document embeddings that capture both text content and visual layout. Search by what a document looks like, not just what text it contains. Charts, diagrams, and formatted content are semantically searchable.

OCR Capabilities

From printed text to handwritten notes, from simple pages to complex multi-column layouts -- comprehensive document text extraction.

Printed and Handwritten Text

Extract text from printed documents and handwritten notes with high accuracy. Support for diverse fonts, layouts, and handwriting styles across scanned documents, photographs, and digital images.

High-accuracy printed text recognition
Handwritten text extraction (ICR)
Mixed printed and handwritten document support

Table Extraction

Detect and extract structured table data from documents, invoices, and spreadsheets. Preserve row-column relationships and output structured data that is immediately searchable and queryable.

Automatic table boundary detection
Row-column relationship preservation
Export to structured JSON with cell coordinates

Layout Preservation

Understand document layout -- headers, paragraphs, columns, sidebars, and footnotes. Extract text with spatial context so search results reference specific document regions, not just raw text.

Document layout analysis and segmentation
Reading order detection for multi-column layouts
Spatial coordinates for every text block

Multi-Language Recognition

Extract text in 100+ languages with script-specific models. Handle mixed-language documents, right-to-left scripts, and CJK characters with dedicated recognition pipelines.

100+ languages supported
Mixed-language document handling
Right-to-left and CJK script support

Supported Document Formats

Process documents in all standard formats. Mixpeek handles format conversion, resolution optimization, and page splitting automatically.

PDF

Single and multi-page documents, scanned and digital

JPEG/PNG

Photographs, screenshots, scanned images

TIFF

High-resolution scans, multi-page TIFF files

WebP

Web-optimized images with text content

BMP

Bitmap images from legacy systems

HEIC

Apple device photos with text content

Industry Use Cases

From financial document processing to historical archive digitization -- OCR powers document intelligence across industries.

Financial Document Processing

Extract data from invoices, receipts, bank statements, and financial reports. OCR combined with table extraction automates data entry workflows. Index extracted text for compliance search and audit trail queries across years of financial documents.

Legal Document Analysis

Process contracts, court filings, patents, and legal correspondence at scale. Extract clauses, signatures, dates, and party names. Build searchable legal archives where attorneys can find relevant precedents and contract terms using natural language queries.

Healthcare Records Digitization

Convert handwritten clinical notes, prescription labels, lab reports, and patient forms into searchable text. Preserve document structure for regulatory compliance. Enable clinicians to search across patient records by symptom, medication, or diagnosis.

Archive and Library Digitization

Digitize historical documents, manuscripts, newspapers, and book collections. Handle degraded print, faded text, and unusual fonts common in archival materials. Build full-text search indexes across massive document collections for researchers.

Mixpeek OCR vs. Alternatives

See how Mixpeek compares to dedicated OCR and document processing services.

Feature	Mixpeek	Mindee	Textract	Document AI	ABBYY
OCR Approach	Multimodal (OCR + visual understanding + embeddings)	Template-based document parsing	Layout-aware text extraction	Document understanding with entities	Traditional OCR + structure recognition
Search Integration	Built-in hybrid search over extracted text	No search (extraction only)	No search (extraction only)	No search (extraction only)	No search (extraction only)
Multimodal Context	Text + visual layout + document embeddings	Text extraction only	Text + table + form extraction	Text + entity + layout extraction	Text + layout recognition
Custom Models	Docker-based plugin system on Ray GPUs	Custom document types via training	Not supported	Custom processor training	Limited customization
Retriever Pipelines	Composable multi-stage (filter, search, rerank)	Not available	Not available	Not available	Not available
Deployment Options	Managed, Dedicated, BYO Cloud	Managed SaaS or on-prem	AWS only	Google Cloud only	On-prem or cloud

Build Document Search with OCR in Minutes

A simple Python API to extract text from documents and build searchable indexes with composable retriever pipelines.

document_ocr.py

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

# Create a collection with OCR and document extractors
collection = client.collections.create(
    name="invoice-archive",
    namespace="invoices",
    extractors=[
        {
            "type": "ocr",
            "model": "paddleocr-v4",
            "config": {
                "languages": ["en", "es", "fr"],
                "detect_tables": True,
                "preserve_layout": True
            }
        },
        {
            "type": "document_embedding",
            "model": "colpali-v1.3",
            "config": {
                "include_visual_features": True
            }
        },
        {
            "type": "text_embedding",
            "model": "sentence-transformers/all-MiniLM-L6-v2",
            "config": {
                "source": "ocr_output",
                "chunk_size": 512
            }
        }
    ]
)

# Upload documents to trigger OCR processing
client.buckets.upload(
    bucket="my-bucket",
    files=["invoice_2024_001.pdf", "receipt_scan.jpg"],
    collection=collection.id
)

# Search extracted text with hybrid retrieval
results = client.retrievers.execute(
    namespace="invoices",
    stages=[
        {
            "type": "feature_search",
            "method": "hybrid",
            "query": {
                "text": "payment due date March 2026 over $5000",
                "modalities": ["text", "image"]
            },
            "limit": 20
        },
        {
            "type": "filter",
            "conditions": {
                "metadata.doc_type": "invoice",
                "metadata.tables_detected": True
            }
        },
        {
            "type": "rerank",
            "model": "cross-encoder",
            "limit": 5
        }
    ]
)

for result in results:
    print(f"Document: {result.metadata['filename']}")
    print(f"Page: {result.metadata['page_number']}")
    print(f"Tables: {result.metadata.get('tables', [])}")
    print(f"Text: {result.content[:300]}")

Frequently Asked Questions

What is OCR and how does it work?

OCR (Optical Character Recognition) is the technology that converts images of text into machine-readable text data. Modern OCR uses deep learning models to detect text regions in images, recognize individual characters and words, and reconstruct the document's text content. Mixpeek extends traditional OCR with multimodal understanding -- combining text extraction with visual layout analysis, table detection, and document embeddings for comprehensive document search.

How is Mixpeek OCR different from traditional OCR services?

Traditional OCR services extract text and return it as a file. You then need to build your own search infrastructure to make that text queryable. Mixpeek is an end-to-end platform: OCR extraction runs on Ray GPU clusters, extracted text is automatically indexed alongside visual document embeddings in Qdrant namespaces, and composable retriever pipelines let you search immediately. You also get multimodal context -- the visual layout and appearance of documents is searchable alongside the extracted text.

What document formats does Mixpeek OCR support?

Mixpeek OCR supports PDF (single and multi-page, both scanned and digital), JPEG, PNG, TIFF (including multi-page), WebP, BMP, and HEIC image formats. For PDFs, Mixpeek intelligently routes digital PDFs through text extraction and scanned PDFs through OCR processing. The platform handles format conversion, resolution optimization, and page splitting automatically.

Can Mixpeek extract tables from documents?

Yes. Mixpeek's OCR extractors include table detection and extraction capabilities. They identify table boundaries, detect row and column structure, and extract cell content while preserving the relational layout. Table data is stored as structured metadata on documents in your namespace, making it filterable and searchable through retriever pipelines.

How accurate is Mixpeek OCR for handwritten text?

Accuracy depends on handwriting legibility and image quality. Mixpeek uses state-of-the-art ICR (Intelligent Character Recognition) models trained on large handwriting datasets. For clean handwriting on standard forms, accuracy typically exceeds 90%. For degraded or unusual handwriting, accuracy may be lower. You can improve results by using custom models fine-tuned on your specific handwriting styles via the Docker plugin system.

Does Mixpeek preserve document layout during OCR?

Yes. Mixpeek's layout-aware OCR preserves the spatial structure of documents -- headers, paragraphs, columns, sidebars, tables, and footnotes are identified and segmented. Each text block includes bounding box coordinates and reading order information. This means search results can reference specific regions of a document, not just raw text strings.

How many languages does Mixpeek OCR support?

Mixpeek OCR supports 100+ languages through models like PaddleOCR and Tesseract. This includes Latin, Cyrillic, Arabic, Chinese, Japanese, Korean, Devanagari, and other scripts. Mixed-language documents are handled automatically -- the system detects script types and applies appropriate recognition models per text region.

Can I combine OCR with semantic search in Mixpeek?

Yes. This is a core advantage of Mixpeek's approach. OCR-extracted text is embedded using text embedding models and indexed alongside visual document embeddings in the same namespace. A single retriever query can combine semantic search over extracted text content, visual similarity over document appearance, metadata filtering by document type or date, and keyword matching for exact terms -- all in one composable pipeline.

Transform Documents into Searchable Intelligence

Stop extracting text into files that nobody searches. Build document intelligence with OCR, multimodal embeddings, and composable retriever pipelines.