Agents Ask the Same Thing a Hundred Ways
Point an agent at a task and it does not ask one question, it asks dozens: it decomposes, retries, re-derives, and re-checks. Put that agent in front of many users and the traffic is full of near-duplicates, the same intent phrased a hundred ways. "What is your refund policy," "how do refunds work," "can I get my money back" are one question. Every one of them, by default, re-runs the whole expensive pipeline: embed, retrieve, rerank, and generate.
A traditional cache does not help, because it keys on the exact string (or a hash of it). "how do refunds work" and "can I get my money back" hash to different keys, so the cache misses on both even though the answer is identical. The fix is to cache by meaning. This guide is about how that works, and, more importantly, where it goes wrong. The concepts are vendor-neutral; Mixpeek shows up at the end.
The Core Idea: A Cache That Is a Vector Index
A semantic cache is, mechanically, a tiny vector search index of past queries:
1. When a query comes in, embed it into a vector. 2. Search the cache index for the nearest stored query embedding. 3. If the top match's similarity is above a threshold, it is a cache hit: return the stored answer (or results) without running the real pipeline. 4. Otherwise it is a miss: run the full pipeline, then store the new (query embedding, answer) pair so the next near-duplicate hits.
That is the whole loop. The cache lookup is a single approximate-nearest-neighbor query, microseconds to low milliseconds, versus a full retrieval-plus-generation round trip that can be hundreds of milliseconds and real money. On agent workloads with heavy near-duplication, hit rates of 30 to 70 percent are common, and every hit is latency and cost you did not pay.
The Threshold Is the Entire Ballgame
Everything rides on one number: how similar is similar enough to reuse an answer. This is a precision versus recall dial, and both failure directions are bad:
There is no universal right value, because it depends on your embedding model and your domain (legal and medical queries need tighter thresholds than casual FAQ). The practical method is to calibrate: take a labeled set of query pairs marked same-answer or different-answer, sweep the threshold, and pick the point where false-hit rate is acceptably low while hit rate is still worthwhile. Similarity scores are not probabilities and their scale differs per model, so calibrate on your own data rather than copying a magic number. The companion guide on calibrating similarity scores covers this directly.
Where Semantic Caches Break
The threshold is necessary but not sufficient. The subtle failures are semantic:
The general defense is to make the cache key richer than the query embedding alone: include tenant, user scope, and any structured filters, and reserve pure-semantic matching for genuinely global, non-personalized answers.
Invalidation: The Hard Part
There are two easy strategies and one hard one:
This is why semantic caching is easiest on stable knowledge (product docs, policies, reference material) and hardest on fast-changing data (live inventory, prices, breaking events), where you either accept staleness or set aggressive TTLs.
Cache More Than Answers
The pattern generalizes past final answers:
Why This Matters for Agents Specifically
Human search is bursty and diverse; agent search is repetitive and parallel. An agent planning a task issues many overlapping sub-queries, and a swarm of agents on similar tasks issues many overlapping queries across sessions. That redundancy is exactly what a semantic cache eats. The payoff is not only cost: it is latency and consistency. Cache hits return in milliseconds, and reusing a vetted answer for equivalent queries makes the agent's behavior more consistent than re-deriving it fresh each time (with all the variance that entails). The caution is the mirror image: a wrong cached answer is now served fast and confidently to every near-duplicate, so the correctness bar on the threshold and the key design is higher for agents than for a human-in-the-loop app.
Where This Lands in Practice: Mixpeek
A semantic cache sits naturally in front of a retrieval layer, and it is built from the same primitives you already have: an embedding model and a vector index. On Mixpeek, the retriever execution path is where this fits: embed the incoming query, check it against a small index of recent query embeddings scoped by namespace and filters, and only run the full multi-stage retrieval pipeline on a miss. Because the cache is itself a vector search, it uses the same ANN machinery and the same similarity-score calibration as the main index, and it inherits the same invalidation concerns as any incremental index. The short version: cache by meaning, keep the key scoped to who and what, calibrate the threshold on your own data, and evict on change, not just on a clock.