Optical Context Compression: Reading Documents as Images, Not Text

A Picture of Text Can Be Cheaper Than the Text

Here is a result that sounds backwards. Take a page of a document, roughly a thousand words. Tokenize it as text and a modern LLM sees on the order of 1,000 to 1,500 tokens. Now render that same page as an image and hand it to a good vision encoder. The encoder can represent the whole page in about 100 continuous vision tokens, and a decoder can read the text back at roughly 97% accuracy. Vision, long treated as the *expensive* modality an agent pays extra to look at, turns out to be a strong compressor of text.

This is the core claim behind optical context compression (also called contexts optical compression), the direction popularized by DeepSeek-OCR in late 2025 and extended by the document-VLM models that followed. It reframes OCR from "convert pixels to characters" into "use pixels as a dense, decodable representation of language," and it gives agents a new knob for the oldest problem in long-context work: the context window is small and attention is quadratic.

This guide is about the idea, the architecture that makes it work, and where it stops working. The concepts are vendor-neutral. Mixpeek shows up at the end as one place the technique is already in production.

Why Text Tokens Are the Bottleneck

Every token an agent reads costs twice. It costs memory, because self-attention is O(n squared) in sequence length, so doubling the context roughly quadruples the attention compute. And it costs *budget*, because the context window is a hard ceiling: a 100-page contract, a quarter of financial filings, or a week of meeting transcripts simply does not fit, so the agent has to chunk, summarize, or drop material and hope the retrieval step guessed right about what mattered.

The industry response has mostly been to make the window bigger and the attention cheaper (sparse attention, KV-cache tricks, retrieval). Optical compression attacks the problem from the other side: make each unit of information carry more content. If one vision token can stand in for ten text tokens with acceptable fidelity, a 10-page document fits in the token budget of one.

The Core Idea: Optical 2D Mapping

Text is a 1D stream of tokens. A document is inherently 2D: characters have positions, lines have order, tables have rows and columns. When you tokenize text you throw the 2D structure away and pay per character-chunk. When you *render* text into an image, you keep the 2D layout for free and hand the compression problem to a vision encoder that was built to summarize dense pixels into a compact feature set.

The pipeline is three stages:

1. Render or scan the text as a 2D image (a page, a slide, a screenshot). 2. Encode the image into a small set of continuous vision tokens with a vision encoder tuned for high compression. 3. Decode those vision tokens back into text (or structured markup) with a language model.

Step 2 is where the savings live. The number of vision tokens is a design choice, and pushing it down is exactly the compression ratio. The interesting engineering question is: how do you keep the token count tiny while still processing a high-resolution page, without the encoder's activations exploding?

How the Encoder Achieves High Compression

DeepSeek-OCR's answer is an encoder called DeepEncoder, about 250M parameters, built to stay cheap at high input resolution. It cascades two well-known vision backbones with a compressor between them:

A SAM-style ViT first, using window attention. High-resolution pages produce a lot of patch tokens. Window (local) attention keeps the cost of that first stage roughly linear in the number of patches, so you can feed a full page without a quadratic blowup. This stage does the fine-grained "see the ink" work.

A 16x convolutional compressor at the bridge. Between the two backbones, a convolutional downsampler cuts the token count by 16x. This is the step that turns thousands of patch tokens into hundreds.

A CLIP-style ViT second, using global attention. Global (all-to-all) attention is expensive, so you only pay for it on the *already-compressed* handful of tokens. This stage does the "understand the whole page" work, tying distant regions together.

The pattern is the lesson: spend local attention on the many raw tokens, spend global attention only on the few compressed ones. That ordering is what lets a 250M-parameter encoder handle a dense page and still emit around 100 to 800 vision tokens. The decoder in DeepSeek-OCR is a small mixture-of-experts LLM (3B total, ~570M active) that reconstructs text from those tokens.

The Compression-versus-Precision Curve

Optical compression is lossy, and the loss is tunable. The reported curve is the headline number to remember:

At a compression ratio under 10x (text tokens are fewer than 10 times the vision tokens), decoding precision is about 97%. This is the safe operating range: near-lossless, big savings.

At 20x compression, precision drops to roughly 60%. Still useful for gist or search, not for exact reconstruction.

"Precision" here is an OCR-style edit-distance accuracy: how faithfully the decoder reproduces the original characters. The practical reading is that there is a knob, and for most documents you can sit comfortably in the near-lossless zone while still cutting tokens by an order of magnitude. Push past 10x and you trade fidelity for reach.

The Benchmarks That Made People Notice

Two comparisons on OmniDocBench (a document-parsing benchmark) are what turned this from a curiosity into a trend:

It matched or beat GOT-OCR2.0, which uses 256 tokens per page, using only about 100 vision tokens.

It outperformed MinerU2.0, which averages 6,000+ tokens per page, using fewer than 800 vision tokens.

An order of magnitude fewer tokens at equal or better parsing quality is the kind of result that changes how people budget context. The 2026 document-VLM models extend the line: Baidu's Unlimited-OCR, for instance, applies the same compressed-vision-token approach to *multi-page* one-shot parsing with a 32K-token decoder context, so a whole PDF is read in a single pass instead of page by page.

The Memory Analogy: Tiered Optical Compression

The most agent-relevant framing is a memory hierarchy. Human memory does not store last year at the same fidelity as last hour; older, less-relevant context gets compressed and blurred. Optical compression gives an agent the same lever: render recent or high-value context at low compression (crisp, ~97%), and older or lower-value context at high compression (blurry, cheaper). Instead of a hard cutoff where old context is dropped, you get a smooth decay where old context stays present but costs less. That maps cleanly onto how an agent should manage a long-running session or a large corpus it revisits.

What This Means for Agents That Have to Read

Three concrete implications for anything that needs to *see* text in the wild:

1. Longer documents fit. The same context budget now covers roughly an order of magnitude more pages. Contracts, filings, manuals, and long transcripts-rendered-as-pages stop needing aggressive pre-chunking. 2. OCR and compression are one pass. You no longer run a detector, then a recognizer, then a separate summarizer. A single vision-language model turns a page image into compact tokens that are already decodable to text and markup. Fewer stages, fewer places to lose layout. 3. Layout survives because it is spatial. Tables, reading order, headers, and figure captions are 2D facts. Because the representation starts as an image, the model can preserve them in the decoded markup, which is exactly what downstream chunking and section-aware retrieval need.

Limits and Open Questions

This is not a free lunch, and treating it like one will burn you:

It is lossy above ~10x. For anything requiring exact text (legal quoting, code, numbers you will do math on), stay in the near-lossless zone or keep the source.

Reasoning over compressed text is unproven at the edges. Reconstructing characters is not the same as reasoning over them; heavy compression may hurt tasks that need every token.

Rendering quality matters. Scan noise, tiny fonts, and dense multi-column layouts push you up the error curve. Resolution and page-tiling choices are real hyperparameters (models expose modes like a 640px "fast" pass versus a 1024px "base" pass).

It is overkill for short text. If the content already fits comfortably as text tokens, rendering it to pixels just adds an encoder. The win shows up at length.

Where This Lands in Practice: Mixpeek

Optical compression is the reason "OCR" and "document understanding" have merged into a single vision-language step, and it is how Mixpeek's OCR extractor turns document images into searchable, layout-preserving text. Point a Managed collection at a bucket of PDFs and Mixpeek runs a document VLM in this lineage, for example Baidu's Unlimited-OCR for one-shot multi-page parsing, or PaddleOCR-VL, recovering tables, reading order, and headings rather than a flat character dump. The recovered text is then embedded and indexed, so an agent searches *what the document says*, including its structure, at a fraction of the token cost of stuffing raw pages into a prompt.

If you are building the retrieval side of this, the companion guides on structured extraction from unstructured documents, visual document retrieval, and multimodal chunking strategies pick up where this one leaves off. The short version: read the page as an image, compress it into vision tokens, decode structure not just characters, and let the agent search the result.

A Picture of Text Can Be Cheaper Than the Text

Why Text Tokens Are the Bottleneck

The Core Idea: Optical 2D Mapping

How the Encoder Achieves High Compression

The Compression-versus-Precision Curve

The Benchmarks That Made People Notice

The Memory Analogy: Tiered Optical Compression

What This Means for Agents That Have to Read

Limits and Open Questions

Where This Lands in Practice: Mixpeek

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Structured Extraction from Unstructured Documents: How Vision-Language Models Replace OCR Pipelines

Efficient Attention: How Models Read Hour-Long Video and Book-Length Documents

Long-Context Video Understanding for Agent Perception