How OCR Actually Works: Detection, Recognition, Reading Order, and Tables

The Black Box That Quietly Corrupts Your Search

An AI agent that searches documents almost always sits downstream of OCR, optical character recognition, even when nobody calls it that. A PDF is parsed, a scan is read, a screenshot is transcribed, and the resulting text is chunked, embedded, and indexed. If the OCR step is wrong, every layer above it is wrong too, and the failure is silent: the agent retrieves a chunk, the chunk contains plausible text, and the text is subtly garbled in a way that no similarity score can detect.

The classic symptoms are familiar to anyone who has shipped document search. A two-column page comes back with sentences from the left column interleaved with the right. A financial table turns into a flat run of numbers with no headers attached. A footnote lands in the middle of a paragraph. A scanned form reads "1OO" where it should read "100." None of these are embedding problems or retriever problems. They are OCR problems, and you cannot fix them in the retriever.

The fix starts with understanding that OCR is not one model. It is a pipeline of distinct stages, each of which can fail independently, and knowing which stage failed tells you what to do about it. This guide walks the pipeline that turns a page image into searchable, structure-preserving text.

The Pipeline at a Glance

A modern OCR or document-AI system has four stages that matter for retrieval:

1. Text detection. Find *where* the text is. Output: boxes or polygons around words, lines, or text regions. No characters are read yet. 2. Text recognition. Read *what* each detected region says. Output: a character string per region, usually with a confidence. 3. Layout analysis and reading order. Group regions into paragraphs, columns, headers, captions, and put them in the order a human would read. 4. Structure recognition. For tables, forms, and key-value content, recover the relationships: which cell is under which header, which value belongs to which field.

Older engines fused detection and recognition loosely and skipped 3 and 4 almost entirely, which is exactly why legacy OCR scrambles anything that is not a single clean column of prose. Each stage is worth understanding on its own.

Stage 1: Text Detection (Finding the Words)

Detection answers a deceptively hard question: which pixels are text? Text in the wild is rotated, curved, varies wildly in size, sits on noisy backgrounds, and can be dense (a spreadsheet) or sparse (a logo). Two families of detectors dominate.

Regression-based detectors predict bounding boxes directly, the way an object detector predicts boxes around dogs and cars. EAST is the canonical example: a fully convolutional network outputs, for every pixel, a score for "is this text" plus the geometry of the box that pixel belongs to. It is fast and works well for straight, well-separated text, but struggles with long lines and curved text because a single rotated rectangle cannot describe a curve.

Segmentation-based detectors predict a per-pixel text or not-text mask, then group pixels into regions. CRAFT predicts two heat maps, one for character centers and one for the affinity between adjacent characters, then links characters into words. DBNet (Differentiable Binarization) learns the binarization threshold itself instead of hard-coding it, which makes it robust to faint or uneven ink. Segmentation detectors handle curved and arbitrarily shaped text far better, at some extra cost, because they output polygons rather than rectangles.

The output of detection is purely spatial: a set of regions, often word-level or line-level, each with a polygon and a detection confidence. A subtle but important point for downstream search: detection granularity decides recognition granularity. Word-level boxes give you word confidences (useful for filtering garbage); line-level boxes give better recognition accuracy because the recognizer sees more context, but coarser confidence. Most production engines detect at the line level and recognize whole lines.

Stage 2: Text Recognition (Reading the Characters)

Recognition takes one cropped region, usually a single text line normalized to a fixed height, and produces a character string. This is a sequence-to-sequence problem: a variable-width image goes in, a variable-length string comes out, and there is no fixed alignment between pixel columns and characters. Three decoding strategies solve the alignment problem, and they behave very differently in the failure cases an agent will hit.

CTC decoding

A convolutional or recurrent backbone slices the line image into a horizontal sequence of frames and predicts a character distribution per frame, including a special blank symbol. Connectionist Temporal Classification (CTC) then collapses repeated characters and removes blanks to produce the final string. The rule: collapse consecutive identical labels, then drop blanks, so the frame sequence h h - e l l - l o becomes hello (the blank between the two l groups is what keeps them from merging).

frames:   h  h  -  e  l  l  -  l  o
collapse: h     -  e  l     -  l  o
drop -  : h        e  l        l  o   -> "hello"

CTC is fast, parallel, and the workhorse of high-throughput OCR. Its weakness is that each frame is predicted fairly independently, so it has a weak internal language model and can produce locally plausible character errors (rn read as m, 1 as l).

Attention decoding

An encoder-decoder with attention generates characters one at a time, attending to the relevant slice of the image for each output character. This gives an implicit language model (the decoder conditions each character on the ones before it), which improves accuracy on natural text. The cost is speed (autoregressive, one character per step) and a specific failure mode: attention drift, where the decoder loses its place on long or low-quality lines and either repeats a span or hallucinates a fluent-but-wrong continuation. That last failure is dangerous for agents because the output looks clean.

Transformer / VLM recognition

The current frontier treats recognition as image-to-text with a vision transformer encoder and a language-model decoder (the TrOCR and Donut lineage, and increasingly general document VLMs). These models bring a strong language prior, handle full lines or whole regions, and can even skip explicit detection by reading a region directly. They are the most accurate on hard inputs and the most expensive, and they inherit the LLM failure mode of confidently inventing text when the image is ambiguous, which is why confidence calibration and source-region grounding matter more, not less, as recognizers get smarter.

Why the recognizer's language model matters for search

A recognizer with a strong language model fixes teh to the for free, which helps prose. The same prior actively hurts on content that is not natural language: part numbers, license plates, chemical formulas, monetary amounts, and codes. A model that "knows" English will happily normalize SKU-X0OI into something more word-like. When an agent later does an exact-match or filter query for that code, it silently misses. The practical rule: pick or configure recognizers per content type, and never assume one model is right for both paragraphs and identifiers.

Stage 3: Layout Analysis and Reading Order

Detection plus recognition gives you a bag of text lines with positions. It does not give you a *document*. To produce text an agent can read and chunk correctly, the system has to group lines into logical blocks (paragraphs, headings, captions, lists) and order those blocks the way a human reads them. This is where the infamous "columns merged into each other" bug lives.

Reading-order reconstruction has two common approaches:

Geometric heuristics. Cluster lines into columns by x-position, sort within a column top-to-bottom, then order columns left-to-right (or by script direction). Cheap and good for clean layouts; brittle on magazines, sidebars, multi-panel pages, and rotated content.

Learned reading order. Treat blocks as nodes and predict the reading sequence with a model, often a layout-aware transformer that consumes text, position, and sometimes visual features together (the LayoutLM family is the well-known lineage). This handles complex pages far better and is what modern document-AI services rely on.

For an agent's retrieval quality, reading order is not cosmetic. Chunking happens *after* reading order is resolved, so a wrong order means a chunk that splices unrelated sentences, which produces an embedding that represents nothing coherent. A retriever cannot recover from a chunk that should never have existed. Reading-order errors are upstream of, and invisible to, every retrieval metric you track.

Stage 4: Table and Form Structure Recognition

Tables are where naive OCR does the most damage, because the meaning of a table cell is its position in a grid, and flat text throws that position away. "Q3 revenue" might OCR perfectly as the string Q3 and the string 1,240 and yet leave the agent with no way to know they belong together.

Table structure recognition (TSR) recovers the grid. The two dominant formulations:

Cell and line detection. Detect ruling lines and cell boundaries, then infer rows and columns from their intersections. Works on bordered tables; fails on borderless ones held together only by whitespace.

Structure as a sequence or graph. Models predict the table as structured markup (rows, columns, spans) directly from the image, or predict a graph linking cells to their row and column headers. This handles borderless tables, merged cells, and spanning headers, which the line-detection approach cannot.

The output that matters is the header-to-value binding: every value cell tagged with the row header and column header that govern it. Once you have that, you can serialize a cell as a self-contained, searchable record like Q3 | revenue | 1,240, and an agent can filter or retrieve on it without reconstructing the table in its head. Forms are the same problem in disguise: key-value extraction binds a field label ("Invoice Number") to its value, so the agent can answer "what is the invoice number" with a span instead of a guess.

Confidence Is a First-Class Signal, Not an Afterthought

Every stage emits confidence, and most pipelines throw it away. That is a mistake for agentic search. Detection confidence tells you whether a region is really text. Recognition confidence tells you whether a string is trustworthy enough to index as-is. A low-confidence run on a scanned form is the difference between "index this and let users find wrong answers" and "flag this page for re-scan or human review."

Carry confidence through to the index and expose it to the agent. An agent that knows a transcribed clause came from a 0.55-confidence region on a skewed scan can hedge, ask for the source image, or weight that evidence less, instead of asserting a garbled number as fact. Treating OCR confidence the way you would treat any other calibrated similarity score is what turns a brittle transcription into trustworthy evidence.

When To Skip OCR Entirely

Understanding the pipeline also tells you when to abandon it. If your documents are dense with tables, charts, diagrams, mixed layouts, or handwriting, the cascade in this guide compounds errors at every stage, and a vision-language retriever that embeds the page image directly can outperform the whole chain by never decomposing the page into fragile intermediate text. That approach (ColPali-style late interaction over page images) is covered in depth in Visual Document Retrieval. The honest framing: OCR and visual retrieval are complementary. OCR gives you exact, filterable, machine-readable text (essential for codes, amounts, and compliance lookups); visual retrieval gives you robust semantic recall on messy pages. Production document agents usually run both and fuse them.

Doing This in Mixpeek

For an agent, the durable goal is that document text enters the index already detected, recognized, ordered, and structure-aware, with confidence attached, so the retriever and the agent never inherit a scrambling bug they cannot see. In Managed Mixpeek that is a feature-extractor choice on the collection rather than a hand-built four-stage pipeline: declare an OCR feature for exact text and a visual document feature for layout-robust recall, and let the retriever fuse them.

from mixpeek import Mixpeek

mx = Mixpeek(api_key="mxp_sk_...")

collection = mx.collections.create(
    collection_name="contracts_and_invoices",
    source={"type": "bucket", "uri": "s3://legal-docs"},
    feature_extractors=[
        # exact, structure-aware text: detection + recognition + reading order + tables
        {"feature": "ocr", "model": "PaddlePaddle/paddleocr"},
        # layout-robust semantic recall over the page image itself
        {"feature": "visual_document_embedding", "model": "vidore/colqwen2-v1.0"},
    ],
)

retriever = mx.retrievers.create(
    collection_id=collection.id,
    stages=[
        # exact lookups (invoice numbers, amounts, clause text) hit OCR text
        {"type": "feature_search", "feature": "ocr", "top_k": 100},
        # semantic / messy-page recall hits the page-image embedding
        {"type": "feature_search", "feature": "visual_document_embedding", "top_k": 100},
        # merge by rank so the two scales never get compared directly
        {"type": "rank_fusion", "method": "rrf"},
    ],
)

results = mx.retrievers.execute(
    retriever_id=retriever.id,
    query="net 30 payment terms on the Q3 invoice",
    return_fields=["doc_id", "page", "ocr_text", "ocr_confidence", "bbox"],
)

Returning the source bounding box and OCR confidence alongside the text is what lets the agent ground a claim back to a region of the page instead of asserting a transcription it cannot verify. If you already run your own OCR and only need search, you can push the recognized text and its embeddings into MVS, Mixpeek's vector store on object storage, and keep the detection metadata as payload so structure and confidence travel with every result.

The Black Box That Quietly Corrupts Your Search

The Pipeline at a Glance

Stage 1: Text Detection (Finding the Words)

Stage 2: Text Recognition (Reading the Characters)

CTC decoding

Attention decoding

Transformer / VLM recognition

Why the recognizer's language model matters for search

Stage 3: Layout Analysis and Reading Order

Stage 4: Table and Form Structure Recognition

Confidence Is a First-Class Signal, Not an Afterthought

When To Skip OCR Entirely

Doing This in Mixpeek

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Visual Document Retrieval: How AI Agents Search Documents Without OCR

Streaming Video Understanding: How Agents Watch an Unbounded Live Feed in Real Time

Perceptual Image Hashing: How Agents Recognize the Same Picture After It Has Been Re-Encoded, Cropped, and Recolored