A Picture of Text Can Be Cheaper Than the Text
Here is a result that sounds backwards. Take a page of a document, roughly a thousand words. Tokenize it as text and a modern LLM sees on the order of 1,000 to 1,500 tokens. Now render that same page as an image and hand it to a good vision encoder. The encoder can represent the whole page in about 100 continuous vision tokens, and a decoder can read the text back at roughly 97% accuracy. Vision, long treated as the *expensive* modality an agent pays extra to look at, turns out to be a strong compressor of text.
This is the core claim behind optical context compression (also called contexts optical compression), the direction popularized by DeepSeek-OCR in late 2025 and extended by the document-VLM models that followed. It reframes OCR from "convert pixels to characters" into "use pixels as a dense, decodable representation of language," and it gives agents a new knob for the oldest problem in long-context work: the context window is small and attention is quadratic.
This guide is about the idea, the architecture that makes it work, and where it stops working. The concepts are vendor-neutral. Mixpeek shows up at the end as one place the technique is already in production.
Why Text Tokens Are the Bottleneck
Every token an agent reads costs twice. It costs memory, because self-attention is O(n squared) in sequence length, so doubling the context roughly quadruples the attention compute. And it costs *budget*, because the context window is a hard ceiling: a 100-page contract, a quarter of financial filings, or a week of meeting transcripts simply does not fit, so the agent has to chunk, summarize, or drop material and hope the retrieval step guessed right about what mattered.
The industry response has mostly been to make the window bigger and the attention cheaper (sparse attention, KV-cache tricks, retrieval). Optical compression attacks the problem from the other side: make each unit of information carry more content. If one vision token can stand in for ten text tokens with acceptable fidelity, a 10-page document fits in the token budget of one.
The Core Idea: Optical 2D Mapping
Text is a 1D stream of tokens. A document is inherently 2D: characters have positions, lines have order, tables have rows and columns. When you tokenize text you throw the 2D structure away and pay per character-chunk. When you *render* text into an image, you keep the 2D layout for free and hand the compression problem to a vision encoder that was built to summarize dense pixels into a compact feature set.
The pipeline is three stages:
1. Render or scan the text as a 2D image (a page, a slide, a screenshot). 2. Encode the image into a small set of continuous vision tokens with a vision encoder tuned for high compression. 3. Decode those vision tokens back into text (or structured markup) with a language model.
Step 2 is where the savings live. The number of vision tokens is a design choice, and pushing it down is exactly the compression ratio. The interesting engineering question is: how do you keep the token count tiny while still processing a high-resolution page, without the encoder's activations exploding?
How the Encoder Achieves High Compression
DeepSeek-OCR's answer is an encoder called DeepEncoder, about 250M parameters, built to stay cheap at high input resolution. It cascades two well-known vision backbones with a compressor between them:
The pattern is the lesson: spend local attention on the many raw tokens, spend global attention only on the few compressed ones. That ordering is what lets a 250M-parameter encoder handle a dense page and still emit around 100 to 800 vision tokens. The decoder in DeepSeek-OCR is a small mixture-of-experts LLM (3B total, ~570M active) that reconstructs text from those tokens.
The Compression-versus-Precision Curve
Optical compression is lossy, and the loss is tunable. The reported curve is the headline number to remember:
"Precision" here is an OCR-style edit-distance accuracy: how faithfully the decoder reproduces the original characters. The practical reading is that there is a knob, and for most documents you can sit comfortably in the near-lossless zone while still cutting tokens by an order of magnitude. Push past 10x and you trade fidelity for reach.
The Benchmarks That Made People Notice
Two comparisons on OmniDocBench (a document-parsing benchmark) are what turned this from a curiosity into a trend:
An order of magnitude fewer tokens at equal or better parsing quality is the kind of result that changes how people budget context. The 2026 document-VLM models extend the line: Baidu's Unlimited-OCR, for instance, applies the same compressed-vision-token approach to *multi-page* one-shot parsing with a 32K-token decoder context, so a whole PDF is read in a single pass instead of page by page.
The Memory Analogy: Tiered Optical Compression
The most agent-relevant framing is a memory hierarchy. Human memory does not store last year at the same fidelity as last hour; older, less-relevant context gets compressed and blurred. Optical compression gives an agent the same lever: render recent or high-value context at low compression (crisp, ~97%), and older or lower-value context at high compression (blurry, cheaper). Instead of a hard cutoff where old context is dropped, you get a smooth decay where old context stays present but costs less. That maps cleanly onto how an agent should manage a long-running session or a large corpus it revisits.
What This Means for Agents That Have to Read
Three concrete implications for anything that needs to *see* text in the wild:
1. Longer documents fit. The same context budget now covers roughly an order of magnitude more pages. Contracts, filings, manuals, and long transcripts-rendered-as-pages stop needing aggressive pre-chunking. 2. OCR and compression are one pass. You no longer run a detector, then a recognizer, then a separate summarizer. A single vision-language model turns a page image into compact tokens that are already decodable to text and markup. Fewer stages, fewer places to lose layout. 3. Layout survives because it is spatial. Tables, reading order, headers, and figure captions are 2D facts. Because the representation starts as an image, the model can preserve them in the decoded markup, which is exactly what downstream chunking and section-aware retrieval need.
Limits and Open Questions
This is not a free lunch, and treating it like one will burn you:
Where This Lands in Practice: Mixpeek
Optical compression is the reason "OCR" and "document understanding" have merged into a single vision-language step, and it is how Mixpeek's OCR extractor turns document images into searchable, layout-preserving text. Point a Managed collection at a bucket of PDFs and Mixpeek runs a document VLM in this lineage, for example Baidu's Unlimited-OCR for one-shot multi-page parsing, or PaddleOCR-VL, recovering tables, reading order, and headings rather than a flat character dump. The recovered text is then embedded and indexed, so an agent searches *what the document says*, including its structure, at a fraction of the token cost of stuffing raw pages into a prompt.
If you are building the retrieval side of this, the companion guides on structured extraction from unstructured documents, visual document retrieval, and multimodal chunking strategies pick up where this one leaves off. The short version: read the page as an image, compress it into vision tokens, decode structure not just characters, and let the agent search the result.