The Problem: A Chunk Without Context Is Unsearchable
Almost every document RAG system splits long text into chunks, embeds each chunk independently, and stores the vectors. This is the default, and it has a quiet failure mode that an AI agent feels every day: a chunk torn out of its document loses the context that made it findable.
Consider a chunk that reads:
> "He refused the offer, and the board accepted his resignation the following week."
Who is "he"? What offer? The answer lives three paragraphs earlier — "the CEO, Maria Chen, was presented with a buyout proposal." Embedded on its own, this chunk has no signal connecting it to "Maria Chen" or "buyout." An agent asking "what happened with the CEO's buyout?" will rank this chunk low, even though it is the exact passage that answers the question.
This is not an edge case. Pronouns, acronyms defined once, section headings that scope everything below them, tables whose meaning lives in a caption, transcript turns that only make sense given the previous speaker — all of these break under naive chunking. The retrieval failure is invisible: the system returns *something*, just not the right thing, and the agent confidently reasons over the wrong evidence.
Two techniques fix this without making chunks huge (which over-compresses semantics and hurts precision): late chunking and contextual retrieval. They attack the same root cause from opposite directions.
Why Not Just Use Bigger Chunks?
The naive instinct is to make chunks larger so each one carries more context. This trades one problem for another. Dense embedding models compress an entire chunk into a single fixed-size vector. The more text you stuff into one chunk, the more distinct ideas get averaged into one point in vector space, and the embedding drifts toward a bland centroid that matches everything weakly and nothing strongly.
So you are caught between two failures:
Both late chunking and contextual retrieval let you keep chunks small *and* contextual.
Technique 1: Late Chunking
Late chunking, introduced by Jina AI in 2024, comes from a simple observation about *when* you split. Naive chunking splits the text first, then embeds each piece. Late chunking embeds first, then splits.
The Algorithm
A long-context embedding model (8K+ tokens) is a transformer whose attention lets every token see every other token. Late chunking exploits this:
1. Concatenate the chunks (or take the whole document) as one sequence. 2. Run a single forward pass through the long-context model. Now every token embedding has attended to the full document — the token for "he" has already absorbed information from "Maria Chen" three paragraphs back. 3. Mark chunk boundaries over the token sequence (by sentence, fixed size, or semantic split). 4. Mean-pool within each chunk's token span to produce one vector per chunk.
The difference is entirely in the pooling boundary. Naive chunking pools *before* the chunks have seen each other; late chunking pools *after*.
Naive chunking:
split --> [chunk A][chunk B][chunk C]
embed each independently (no cross-chunk attention)
pool --> vec(A), vec(B), vec(C)
Late chunking:
concatenate --> [chunk A chunk B
chunk C]
ONE forward pass with full attention across A,B,C
token embeddings: t1 t2 t3 ... tN (each conditioned on all others)
pool within boundaries --> vec(A) from t1..ti
vec(B) from ti+1..tj
vec(C) from tj+1..tN
Pseudocode
def late_chunk(document, model, chunk_spans):
# 1. ONE forward pass over the whole document
token_embeddings = model.encode_tokens(document) # shape: [num_tokens, dim]
# 2. Mean-pool within each chunk's token range
chunk_vectors = []
for (start, end) in chunk_spans: # spans are token indices
span = token_embeddings[start:end] # tokens already saw full doc
chunk_vectors.append(span.mean(axis=0))
return chunk_vectors
Why It Works
The contextual signal is injected by attention, for free, at encode time. No extra model calls, no extra storage — you still store one vector per chunk. The chunk for "He refused the offer..." now carries a trace of "Maria Chen" and "buyout" because those tokens were in the attention window during the single forward pass.
The cost is that you need a genuine long-context embedding model and enough context budget to fit the relevant span. For documents longer than the model's window, you process overlapping macro-windows and pool within each.
What the Numbers Say
In Jina's evaluation across three models and four retrieval datasets, late chunking delivered a consistent ~2-4% relative nDCG improvement over naive chunking (sentence, fixed-size, and semantic boundaries alike) with zero additional training. The gains concentrate exactly where you expect: chunks that are ambiguous on their own.
Technique 2: Contextual Retrieval
Anthropic's contextual retrieval (2024) attacks the same problem with an LLM instead of attention. Instead of relying on the embedding model to propagate context, it *writes the context into the chunk text* before embedding.
The Algorithm
For each chunk, call an LLM with the whole document and that chunk, and ask it to produce a short situating description. Prepend that description to the chunk, then embed and index the augmented chunk.
For each chunk:
context = LLM("Given this document: {doc}
Write 1-2 sentences situating this chunk: {chunk}")
augmented_chunk = context + "\n\n" + chunk
index(embed(augmented_chunk)) # dense vector
index(bm25(augmented_chunk)) # sparse / keyword
A generated context might read: *"This chunk is from the Q3 board minutes describing CEO Maria Chen's response to the buyout proposal."* Now both the dense embedding and the keyword index contain "Maria Chen" and "buyout" — the chunk is findable by either path.
The Full Pipeline
Anthropic's published pipeline stacks three reinforcing steps, each adding measurable lift to retrieval recall:
The lesson generalizes beyond this one technique: contextualization, hybrid dense+sparse retrieval, and reranking are complementary, not redundant.
The Tradeoff
Contextual retrieval is more expensive than late chunking — you make one LLM call per chunk at index time. Prompt caching makes this affordable (you cache the document once and vary only the chunk), but it is still real cost and latency at ingestion. In exchange you get a human-readable, hybrid-searchable chunk that does not depend on a long-context embedding model, and the context is explicit rather than implicit in the vector.
Choosing Between Them
| Dimension | Naive chunking | Late chunking | Contextual retrieval |
| Where context comes from | nowhere | attention (single forward pass) | LLM-generated text prefix |
| Index-time cost | lowest | one long-context encode | one LLM call per chunk |
| Needs long-context encoder | no | yes | no |
| Works with BM25 / keyword | poorly | poorly | yes (text is augmented) |
| Storage | 1 vector/chunk | 1 vector/chunk | 1 vector/chunk (+ longer text) |
| Best when | short, self-contained docs | long docs, latency-sensitive ingest | names/codes/entities matter, hybrid search |
This Applies to More Than Text
The same failure mode appears in every modality an agent perceives:
Whenever you decompose continuous content into searchable pieces, ask: does each piece still carry the context that makes it findable? If not, you need one of these techniques.
Doing This in Mixpeek
Mixpeek ingests documents into collections through feature extractors, and the chunking + embedding strategy is part of the collection's extraction config — so you decide the contextualization policy once, at the collection level, and every document inherits it. Retrievers then search those chunk vectors, and Mixpeek's hybrid retriever fuses dense and BM25 stages, which is exactly the dense+sparse pattern contextual retrieval relies on.
from mixpeek import Mixpeek
client = Mixpeek(api_key="mxp_sk_...")
# A document collection whose chunks are embedded with a long-context model
# (late-chunking-style context) and also indexed for keyword search.
collection = client.collections.create(
namespace_id="ns_docs",
collection_name="board-minutes",
source={"type": "documents"},
feature_extractors=[
{
"feature_extractor_name": "text_chunk_embedding",
"parameters": {
"chunk_strategy": "semantic", # sentence / fixed / semantic boundaries
"embedding_model": "long-context-text", # 8K+ context window
},
}
],
)
# Hybrid retriever: dense chunk vectors + BM25 over the same chunk text.
retriever = client.retrievers.create(
namespace_id="ns_docs",
collection_ids=[collection.collection_id],
retriever_name="board-minutes-hybrid",
stages=[
{"stage_name": "knn_search", "parameters": {"top_k": 50}},
{"stage_name": "keyword_search", "parameters": {"top_k": 50}},
{"stage_name": "rrf_fusion", "parameters": {}},
{"stage_name": "score_threshold", "parameters": {"min_score": 0.4}},
],
)
# The agent's query now matches the contextual chunk, not a bare fragment.
results = client.retrievers.execute(
retriever_id=retriever.retriever_id,
inputs={"text": "what happened with the CEO's buyout?"},
top_k=10,
)
If you need explicit, LLM-generated context per chunk (the Anthropic-style prefix), do that contextualization in your ingestion code before upload, and store the augmented text on the document — the dense and keyword stages above will then both benefit. Keep the chunking and embedding choices versioned with the collection: changing the chunk strategy or embedding model changes what is findable, so treat it like a migration, not a tweak.