How Vision-Language Models Fuse Image and Text Tokens

The Component That Lets an Agent See

When an AI agent describes a video frame, answers a question about a screenshot, or reads a chart inside a document, the work is done by a vision-language model (VLM). A VLM is the bridge between two worlds that do not naturally speak the same language: an image is a grid of pixels, and a transformer language model only consumes a sequence of token embeddings. The entire job of a VLM architecture is to convert what a camera saw into tokens the language model can attend to, alongside the words of a prompt, so the model can reason over both at once.

Most teams treat the VLM as a black box: image in, caption out. That works until it does not, and when it fails the failures are specific and architectural. The model invents text that is not in the image, misses small objects, scrambles a table, or ignores half a high-resolution page. Every one of those failures traces back to a concrete decision in how the model turns pixels into tokens and how it fuses them with text. This guide opens the box.

The Three-Part Skeleton

Almost every modern VLM, regardless of vendor, is built from three parts wired in series.

   image ──> [ Vision Encoder ] ──> patch features
                                        |
                                        v
                              [ Projector / Connector ]
                                        |
                                        v
   text  ──> [ Tokenizer ] ──> word tokens ──┐
                                              v
                                   [ Language Model ] ──> output text

1. Vision encoder. A pretrained image model (usually a Vision Transformer) that turns the image into a grid of feature vectors, one per image patch. 2. Projector (also called connector or adapter). A small module that maps those visual features into the language model's embedding space and, often, reduces how many of them there are. 3. Language model. A standard decoder-only transformer that consumes the projected visual tokens plus the text tokens and generates the answer.

The art is almost entirely in parts 2 and 3: how the visual features become tokens, and how those tokens get fused with text. But you cannot reason about fusion without first understanding what comes out of the encoder.

Step 1: The Vision Encoder Turns an Image Into Patches

A Vision Transformer does not look at an image pixel by pixel. It cuts the image into a grid of fixed-size patches, typically 14x14 or 16x16 pixels each, flattens each patch, and projects it into a vector. A 224x224 image at patch size 14 becomes a 16x16 grid, which is 256 patches, so 256 vectors. A 336x336 image becomes 576 patches. The encoder runs self-attention over this patch sequence so each patch vector ends up encoding not just its own pixels but its relationship to the rest of the image.

Two numbers from this step govern everything downstream:

Patch count = visual token count (before any reduction). This is how many tokens the image will cost in the language model's context. It grows with the square of the input resolution. Double the resolution and you roughly quadruple the patch count.

Encoder training objective. Most VLM vision encoders are initialized from a contrastive model like CLIP or SigLIP, because contrastive pretraining already aligns visual features with language-like semantics. That alignment is why a small projector is often enough to connect the two halves. (The loss that creates this alignment is covered in Contrastive Learning: How CLIP, SigLIP, and CLAP Work.)

The encoder's resolution is a hard ceiling on what the model can read. If a contract clause or a small UI label occupies fewer pixels than a patch can resolve, the information is gone before fusion ever happens. This is the root cause of most "the VLM cannot read fine print" complaints, and it is why high-resolution handling (below) matters so much.

Step 2: The Projector Turns Patches Into Language Tokens

The vision encoder outputs vectors in its own space and dimensionality. The language model expects tokens in its embedding space. The projector closes that gap, and the design choice here splits VLMs into families.

MLP projector (the LLaVA family). The simplest and now most common connector is a two-layer MLP applied to each patch feature independently. It maps each visual vector into the language model's embedding dimension and changes nothing about the count: 576 patches become 576 visual tokens. It is cheap to train and surprisingly strong, but it passes every patch through untouched, so the image token cost stays high.

Query-based resampler (Q-Former, Perceiver Resampler). Instead of forwarding all patches, a resampler introduces a small fixed set of learned query vectors, say 32 or 64, that attend to the full patch grid through cross-attention and distill it into that fixed number of output tokens. The BLIP-2 Q-Former and the Flamingo Perceiver Resampler work this way. The win is a constant, small token budget regardless of image size; the cost is an information bottleneck. Whatever the queries fail to pull out is lost, which can hurt fine-grained reading tasks.

   576 patch features
        |  cross-attention
        v
   [ 64 learned queries ]  ──>  64 visual tokens   (resampler: fixed, lossy)

   576 patch features  ──> MLP per patch ──> 576 visual tokens  (MLP: full, expensive)

The trade is direct: a resampler buys you a cheap context footprint at the risk of dropping detail; an MLP keeps every patch at the cost of context budget. This is the same density-versus-cost tension that shows up everywhere in multimodal systems, including how you sample frames from video, discussed in Video Frame Sampling for Embeddings.

Step 3: Fusion, Where Vision Meets Language

Now you have a sequence of visual tokens and a sequence of text tokens. How the language model combines them is the fusion strategy, and there are two dominant approaches.

Prefix / early fusion (token concatenation)

The visual tokens are simply placed into the input sequence alongside the text tokens, and the whole thing is fed to the decoder. A typical layout interleaves them:

[BOS] <img_1> <img_2> ... <img_576> "What" "is" "on" "the" "sign" "?"
       \______ visual tokens ______/  \________ text tokens ________/

The language model's normal self-attention does the fusion: every text token can attend to every visual token and vice versa, layer after layer. This is the LLaVA approach, and it dominates today because it reuses the language model unchanged and lets the full depth of the model reason jointly over pixels and words. The cost is that visual tokens consume the context window directly, and self-attention is quadratic in sequence length, so a high-resolution image with thousands of visual tokens gets expensive fast.

Cross-attention / gated fusion

Instead of inserting visual tokens into the sequence, the architecture inserts new cross-attention layers into the language model. The text stream stays the main sequence, and at intervals it attends out to the visual features through dedicated cross-attention blocks, often gated so the model can learn how much to let vision influence each layer. Flamingo pioneered this. The benefit is that text sequence length is decoupled from image token count, so you can feed many images or video frames without blowing up the text context. The cost is added parameters and a more complex training recipe, and historically slightly weaker fine-grained grounding than full early fusion.

   text tokens ──> self-attn ──> [ gated cross-attn ] ──> ...
                                        ^
                                        | attends to
                                  visual features

The practical rule of thumb: early-fusion / concatenation models tend to be stronger at precise single-image reading (documents, charts, OCR-like tasks); cross-attention models scale more gracefully to many images and video because they do not pay text-context cost per frame. Neither is universally better, and the field has largely converged on concatenation for single rich images and cross-attention or heavy resampling for many-image and video settings.

The High-Resolution Problem and the Tiling Fix

A fixed encoder resolution (say 336x336) is fine for a photo of a dog and useless for a dense invoice or a 4K screenshot. Downscaling a detailed page to 336 pixels destroys the text. The standard fix is tiling (also called AnyRes or dynamic resolution): split the high-resolution image into a grid of tiles, run each tile through the encoder at native resolution, and also encode a downscaled thumbnail of the whole image for global context. Each tile contributes its own block of visual tokens.

   4K page ──┬──> tile(0,0) ──> encoder ──> 576 tokens
             ├──> tile(0,1) ──> encoder ──> 576 tokens
             ├──> ...                          ...
             └──> thumbnail  ──> encoder ──> 576 tokens (global view)

This is why a VLM can suddenly read fine print when you enable high-resolution mode, and also why your token cost (and latency, and bill) can jump several-fold on the same image. A 3x3 tiling plus thumbnail is ten encoder passes and ten times the visual tokens. Understanding tiling is the difference between "the model cannot read my documents" and "the model can, for 10x the tokens," which is a budget decision, not a capability one. For document-heavy retrieval, an alternative is to keep the page as an image end to end, covered in Visual Document Retrieval.

How Training Locks the Parts Together

VLMs are almost never trained end to end from scratch. The dominant recipe is two stages:

1. Alignment (projector) pretraining. Freeze the vision encoder and the language model; train only the projector on a large set of image-caption pairs. This teaches the connector to speak the language model's dialect without disturbing either pretrained half. It is cheap because the projector is small. 2. Instruction tuning. Unfreeze the language model (and sometimes the encoder) and train on multimodal instruction data: questions about images, reasoning over charts, multi-turn visual dialogue. This is what turns a captioner into something an agent can actually instruct.

Two consequences matter for anyone deploying a VLM. First, because the encoder is often frozen, the model's perceptual ceiling is set by a vision model that was trained for a different objective; it can describe what it was taught to align with and struggles with anything outside that distribution. Second, the language model half carries strong text priors, which is the mechanistic source of hallucination: when the visual evidence is weak (small object, low resolution, ambiguous tile), the decoder falls back on what is statistically likely given the prompt rather than what is in the image. Knowing this tells you the fix is usually more pixels (higher resolution, tiling) or a better encoder, not a longer prompt.

What This Means for an Agent Stack

An agent that uses a VLM to perceive should treat these internals as operational levers, not trivia.

Resolution is a recall setting. If the agent misses small details, the first lever is encoder resolution and tiling, not prompt engineering. Pixels the encoder cannot resolve cannot be reasoned about.

Visual tokens are a budget. Every image token competes with text in the context window and in cost. Resampled models are cheap per image but lossy; concatenation models are faithful but expensive at high resolution. Choose per task.

Hallucination is an evidence problem. When the model invents content, suspect insufficient visual tokens for the detail in question before you suspect the prompt. (For how to measure this rigorously, see Agent Perception Evals.)

The VLM is for reasoning, not retrieval. A generative VLM is the wrong tool for searching a million frames; it is the right tool for reading the handful a retriever already surfaced. The clean pattern is retrieve with embeddings, then reason with a VLM over the top results.

Doing This in Mixpeek

In an agent pipeline, the VLM sits at the reasoning step, after retrieval has narrowed millions of frames or pages to a handful. Mixpeek's ingestion runs the perception layer (frame sampling, scene boundaries, transcripts, embeddings) so that retrieval is fast and cheap, and the VLM is only invoked on the small candidate set a retriever returns, not on the whole corpus.

from mixpeek import Mixpeek

client = Mixpeek(api_key="mxp_sk_...")

# 1. Retrieve the handful of frames worth reasoning over (embeddings, not a VLM)
results = client.retrievers.execute(
    retriever_id="video_frames",
    inputs={"text": "the slide showing the Q3 revenue table"},
    top_k=5,
)

# 2. Hand only those frames to a VLM for fine-grained reading.
#    Enable high-resolution tiling so the table text survives the encoder.
for hit in results.results:
    answer = client.vlm.read(
        image_url=hit["keyframe_url"],
        prompt="Extract the Q3 revenue figure from this table.",
        high_resolution=True,   # tile the frame so small text is resolvable
    )
    print(hit["timestamp"], answer)

The architectural point carries straight into the API: high_resolution=True is the tiling lever from this guide, trading visual tokens for the ability to read fine print, and calling the VLM only on retrieved candidates keeps the expensive, token-hungry component off the hot path of search. The agent perceives with embeddings and reasons with a VLM, each used where its architecture is strong.

Key Takeaways

1. Every VLM is a vision encoder, a projector, and a language model in series; the encoder turns the image into patch features, the projector turns those into language tokens, and the language model fuses them with text.

2. Patch count equals visual token cost, and it grows with the square of resolution. This is the lever behind both detail and expense.

3. Projectors split into MLP (keep every patch, faithful, expensive) and resamplers (fixed small token count, cheap, lossy). The choice is a fidelity-versus-budget trade.

4. Fusion is either concatenation (visual tokens in the sequence, strong single-image reading) or cross-attention (text attends out to vision, scales to many images and video).

5. Tiling / high-resolution mode is what lets a VLM read fine print, at a multiplied token cost; hallucination is usually a lack of visual evidence, fixed with more pixels, not a longer prompt.

The Component That Lets an Agent See

The Three-Part Skeleton

Step 1: The Vision Encoder Turns an Image Into Patches

Step 2: The Projector Turns Patches Into Language Tokens

Step 3: Fusion, Where Vision Meets Language

Prefix / early fusion (token concatenation)

Cross-attention / gated fusion

The High-Resolution Problem and the Tiling Fix

How Training Locks the Parts Together

What This Means for an Agent Stack

Doing This in Mixpeek

Key Takeaways

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Long-Context Video Understanding for Agent Perception

Streaming Video Understanding: How Agents Watch an Unbounded Live Feed in Real Time

Perceptual Image Hashing: How Agents Recognize the Same Picture After It Has Been Re-Encoded, Cropped, and Recolored