NEWVectors or files. Pick a path.Start →
    Agent Perception
    19 min read
    Updated 2026-06-19

    How OCR Actually Works: Detection, Recognition, Reading Order, and Tables

    Most pipelines treat OCR as a black box that turns a page into text, then act surprised when tables scramble and columns merge. This guide opens the box: how text detection finds regions, how recognition decodes characters, how reading order is reconstructed, and how table structure recognition keeps a number attached to its header, so an agent can actually search what a document says.

    OCR
    Document AI
    Text Detection
    Reading Order
    Table Structure
    Agent Perception

    The Black Box That Quietly Corrupts Your Search



    An AI agent that searches documents almost always sits downstream of OCR, optical character recognition, even when nobody calls it that. A PDF is parsed, a scan is read, a screenshot is transcribed, and the resulting text is chunked, embedded, and indexed. If the OCR step is wrong, every layer above it is wrong too, and the failure is silent: the agent retrieves a chunk, the chunk contains plausible text, and the text is subtly garbled in a way that no similarity score can detect.

    The classic symptoms are familiar to anyone who has shipped document search. A two-column page comes back with sentences from the left column interleaved with the right. A financial table turns into a flat run of numbers with no headers attached. A footnote lands in the middle of a paragraph. A scanned form reads "1OO" where it should read "100." None of these are embedding problems or retriever problems. They are OCR problems, and you cannot fix them in the retriever.

    The fix starts with understanding that OCR is not one model. It is a pipeline of distinct stages, each of which can fail independently, and knowing which stage failed tells you what to do about it. This guide walks the pipeline that turns a page image into searchable, structure-preserving text.

    The Pipeline at a Glance



    A modern OCR or document-AI system has four stages that matter for retrieval:

    1. Text detection. Find *where* the text is. Output: boxes or polygons around words, lines, or text regions. No characters are read yet. 2. Text recognition. Read *what* each detected region says. Output: a character string per region, usually with a confidence. 3. Layout analysis and reading order. Group regions into paragraphs, columns, headers, captions, and put them in the order a human would read. 4. Structure recognition. For tables, forms, and key-value content, recover the relationships: which cell is under which header, which value belongs to which field.

    Older engines fused detection and recognition loosely and skipped 3 and 4 almost entirely, which is exactly why legacy OCR scrambles anything that is not a single clean column of prose. Each stage is worth understanding on its own.

    Stage 1: Text Detection (Finding the Words)



    Detection answers a deceptively hard question: which pixels are text? Text in the wild is rotated, curved, varies wildly in size, sits on noisy backgrounds, and can be dense (a spreadsheet) or sparse (a logo). Two families of detectors dominate.

    Regression-based detectors predict bounding boxes directly, the way an object detector predicts boxes around dogs and cars. EAST is the canonical example: a fully convolutional network outputs, for every pixel, a score for "is this text" plus the geometry of the box that pixel belongs to. It is fast and works well for straight, well-separated text, but struggles with long lines and curved text because a single rotated rectangle cannot describe a curve.

    Segmentation-based detectors predict a per-pixel text or not-text mask, then group pixels into regions. CRAFT predicts two heat maps, one for character centers and one for the affinity between adjacent characters, then links characters into words. DBNet (Differentiable Binarization) learns the binarization threshold itself instead of hard-coding it, which makes it robust to faint or uneven ink. Segmentation detectors handle curved and arbitrarily shaped text far better, at some extra cost, because they output polygons rather than rectangles.

    The output of detection is purely spatial: a set of regions, often word-level or line-level, each with a polygon and a detection confidence. A subtle but important point for downstream search: detection granularity decides recognition granularity. Word-level boxes give you word confidences (useful for filtering garbage); line-level boxes give better recognition accuracy because the recognizer sees more context, but coarser confidence. Most production engines detect at the line level and recognize whole lines.

    Stage 2: Text Recognition (Reading the Characters)



    Recognition takes one cropped region, usually a single text line normalized to a fixed height, and produces a character string. This is a sequence-to-sequence problem: a variable-width image goes in, a variable-length string comes out, and there is no fixed alignment between pixel columns and characters. Three decoding strategies solve the alignment problem, and they behave very differently in the failure cases an agent will hit.

    CTC decoding



    A convolutional or recurrent backbone slices the line image into a horizontal sequence of frames and predicts a character distribution per frame, including a special blank symbol. Connectionist Temporal Classification (CTC) then collapses repeated characters and removes blanks to produce the final string. The rule: collapse consecutive identical labels, then drop blanks, so the frame sequence `h h - e l l - l o` becomes `hello` (the blank between the two `l` groups is what keeps them from merging).

    frames:   h  h  -  e  l  l  -  l  o
    collapse: h     -  e  l     -  l  o
    drop -  : h        e  l        l  o   -> "hello"
    


    CTC is fast, parallel, and the workhorse of high-throughput OCR. Its weakness is that each frame is predicted fairly independently, so it has a weak internal language model and can produce locally plausible character errors (`rn` read as `m`, `1` as `l`).

    Attention decoding



    An encoder-decoder with attention generates characters one at a time, attending to the relevant slice of the image for each output character. This gives an implicit language model (the decoder conditions each character on the ones before it), which improves accuracy on natural text. The cost is speed (autoregressive, one character per step) and a specific failure mode: attention drift, where the decoder loses its place on long or low-quality lines and either repeats a span or hallucinates a fluent-but-wrong continuation. That last failure is dangerous for agents because the output looks clean.

    Transformer / VLM recognition



    The current frontier treats recognition as image-to-text with a vision transformer encoder and a language-model decoder (the TrOCR and Donut lineage, and increasingly general document VLMs). These models bring a strong language prior, handle full lines or whole regions, and can even skip explicit detection by reading a region directly. They are the most accurate on hard inputs and the most expensive, and they inherit the LLM failure mode of confidently inventing text when the image is ambiguous, which is why confidence calibration and source-region grounding matter more, not less, as recognizers get smarter.

    Why the recognizer's language model matters for search



    A recognizer with a strong language model fixes `teh` to `the` for free, which helps prose. The same prior actively hurts on content that is not natural language: part numbers, license plates, chemical formulas, monetary amounts, and codes. A model that "knows" English will happily normalize `SKU-X0OI` into something more word-like. When an agent later does an exact-match or filter query for that code, it silently misses. The practical rule: pick or configure recognizers per content type, and never assume one model is right for both paragraphs and identifiers.

    Stage 3: Layout Analysis and Reading Order



    Detection plus recognition gives you a bag of text lines with positions. It does not give you a *document*. To produce text an agent can read and chunk correctly, the system has to group lines into logical blocks (paragraphs, headings, captions, lists) and order those blocks the way a human reads them. This is where the infamous "columns merged into each other" bug lives.

    Reading-order reconstruction has two common approaches:

  1. Geometric heuristics. Cluster lines into columns by x-position, sort within a column top-to-bottom, then order columns left-to-right (or by script direction). Cheap and good for clean layouts; brittle on magazines, sidebars, multi-panel pages, and rotated content.
  2. Learned reading order. Treat blocks as nodes and predict the reading sequence with a model, often a layout-aware transformer that consumes text, position, and sometimes visual features together (the LayoutLM family is the well-known lineage). This handles complex pages far better and is what modern document-AI services rely on.


  3. For an agent's retrieval quality, reading order is not cosmetic. Chunking happens *after* reading order is resolved, so a wrong order means a chunk that splices unrelated sentences, which produces an embedding that represents nothing coherent. A retriever cannot recover from a chunk that should never have existed. Reading-order errors are upstream of, and invisible to, every retrieval metric you track.

    Stage 4: Table and Form Structure Recognition



    Tables are where naive OCR does the most damage, because the meaning of a table cell is its position in a grid, and flat text throws that position away. "Q3 revenue" might OCR perfectly as the string `Q3` and the string `1,240` and yet leave the agent with no way to know they belong together.

    Table structure recognition (TSR) recovers the grid. The two dominant formulations:

  4. Cell and line detection. Detect ruling lines and cell boundaries, then infer rows and columns from their intersections. Works on bordered tables; fails on borderless ones held together only by whitespace.
  5. Structure as a sequence or graph. Models predict the table as structured markup (rows, columns, spans) directly from the image, or predict a graph linking cells to their row and column headers. This handles borderless tables, merged cells, and spanning headers, which the line-detection approach cannot.


  6. The output that matters is the header-to-value binding: every value cell tagged with the row header and column header that govern it. Once you have that, you can serialize a cell as a self-contained, searchable record like `Q3
    revenue
    1,240`, and an agent can filter or retrieve on it without reconstructing the table in its head. Forms are the same problem in disguise: key-value extraction binds a field label ("Invoice Number") to its value, so the agent can answer "what is the invoice number" with a span instead of a guess.

    Confidence Is a First-Class Signal, Not an Afterthought



    Every stage emits confidence, and most pipelines throw it away. That is a mistake for agentic search. Detection confidence tells you whether a region is really text. Recognition confidence tells you whether a string is trustworthy enough to index as-is. A low-confidence run on a scanned form is the difference between "index this and let users find wrong answers" and "flag this page for re-scan or human review."

    Carry confidence through to the index and expose it to the agent. An agent that knows a transcribed clause came from a 0.55-confidence region on a skewed scan can hedge, ask for the source image, or weight that evidence less, instead of asserting a garbled number as fact. Treating OCR confidence the way you would treat any other calibrated similarity score is what turns a brittle transcription into trustworthy evidence.

    When To Skip OCR Entirely



    Understanding the pipeline also tells you when to abandon it. If your documents are dense with tables, charts, diagrams, mixed layouts, or handwriting, the cascade in this guide compounds errors at every stage, and a vision-language retriever that embeds the page image directly can outperform the whole chain by never decomposing the page into fragile intermediate text. That approach (ColPali-style late interaction over page images) is covered in depth in Visual Document Retrieval. The honest framing: OCR and visual retrieval are complementary. OCR gives you exact, filterable, machine-readable text (essential for codes, amounts, and compliance lookups); visual retrieval gives you robust semantic recall on messy pages. Production document agents usually run both and fuse them.

    Doing This in Mixpeek



    For an agent, the durable goal is that document text enters the index already detected, recognized, ordered, and structure-aware, with confidence attached, so the retriever and the agent never inherit a scrambling bug they cannot see. In Managed Mixpeek that is a feature-extractor choice on the collection rather than a hand-built four-stage pipeline: declare an OCR feature for exact text and a visual document feature for layout-robust recall, and let the retriever fuse them.

    from mixpeek import Mixpeek

    mx = Mixpeek(api_key="mxp_sk_...")

    collection = mx.collections.create( collection_name="contracts_and_invoices", source={"type": "bucket", "uri": "s3://legal-docs"}, feature_extractors=[ # exact, structure-aware text: detection + recognition + reading order + tables {"feature": "ocr", "model": "PaddlePaddle/paddleocr"}, # layout-robust semantic recall over the page image itself {"feature": "visual_document_embedding", "model": "vidore/colqwen2-v1.0"}, ], )

    retriever = mx.retrievers.create( collection_id=collection.id, stages=[ # exact lookups (invoice numbers, amounts, clause text) hit OCR text {"type": "feature_search", "feature": "ocr", "top_k": 100}, # semantic / messy-page recall hits the page-image embedding {"type": "feature_search", "feature": "visual_document_embedding", "top_k": 100}, # merge by rank so the two scales never get compared directly {"type": "rank_fusion", "method": "rrf"}, ], )

    results = mx.retrievers.execute( retriever_id=retriever.id, query="net 30 payment terms on the Q3 invoice", return_fields=["doc_id", "page", "ocr_text", "ocr_confidence", "bbox"], )


    Returning the source bounding box and OCR confidence alongside the text is what lets the agent ground a claim back to a region of the page instead of asserting a transcription it cannot verify. If you already run your own OCR and only need search, you can push the recognized text and its embeddings into MVS, Mixpeek's vector store on object storage, and keep the detection metadata as payload so structure and confidence travel with every result.

    Further Reading



  7. Visual Document Retrieval: How AI Agents Search Documents Without OCR -- when to embed the page image instead of decoding it
  8. Structured Extraction from Unstructured Documents -- turning recognized text and tables into queryable fields
  9. Multimodal Chunking Strategies -- why reading order has to be right before you chunk
  10. Calibrating Similarity Scores -- treating OCR confidence as a first-class, calibrated signal
  11. Managed Mixpeek

    Put multimodal search to work

    Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.

    Start with Managed
    MVS · bring your own

    Already have vectors?

    Keep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.

    Start with MVS

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs

    Related guides

    Agent Perception

    Visual Document Retrieval: How AI Agents Search Documents Without OCR

    A deep dive into how vision-language models are replacing brittle OCR pipelines for document search. Covers ColPali, ColQwen, late interaction scoring, and production architecture for page-as-image retrieval.

    Read guide →
    Agent Perception

    Streaming Video Understanding: How Agents Watch an Unbounded Live Feed in Real Time

    A first-principles guide to online video understanding -- how an agent perceives a live, unbounded stream it cannot store or re-watch. Covers the causal constraint, ring buffers and fixed frame budgets, token merging and KV-cache pruning, hierarchical short-term and long-term memory, entity banks for cross-time identity, event-triggered indexing, and how a streaming front end feeds a searchable retrieval index so the agent can answer questions about something that happened minutes or hours ago.

    Read guide →
    Agent Perception

    Perceptual Image Hashing: How Agents Recognize the Same Picture After It Has Been Re-Encoded, Cropped, and Recolored

    A first-principles guide to perceptual image hashing -- the algorithm that decides whether two images are the same content even after resizing, JPEG re-compression, watermarking, or a tweaked crop. Covers average hashing, the DCT-based pHash, difference hashing, wavelet hashing, Hamming distance matching, multi-index BK-tree lookups, and when an agent should reach for a hash versus an embedding for visual identity and frame deduplication.

    Read guide →