Semantic Caching: How Agents Skip Work They Have Already Done

A vendor-neutral guide to caching by meaning instead of by exact string. Covers why hash-based caches almost never hit on agent traffic, how a semantic cache is really a tiny vector index of query embeddings, the similarity-threshold precision/recall tradeoff that makes or breaks it, the failure modes (false hits, staleness, negation and entity flips), invalidation strategies, and how to cache retrieval results and tool calls, not just answers, for agents that fan out many near-duplicate queries.

Semantic Cache

Retrieval

Agent Infrastructure

Embeddings

Latency

Cost

Agents Ask the Same Thing a Hundred Ways

Point an agent at a task and it does not ask one question, it asks dozens: it decomposes, retries, re-derives, and re-checks. Put that agent in front of many users and the traffic is full of near-duplicates, the same intent phrased a hundred ways. "What is your refund policy," "how do refunds work," "can I get my money back" are one question. Every one of them, by default, re-runs the whole expensive pipeline: embed, retrieve, rerank, and generate.

A traditional cache does not help, because it keys on the exact string (or a hash of it). "how do refunds work" and "can I get my money back" hash to different keys, so the cache misses on both even though the answer is identical. The fix is to cache by meaning. This guide is about how that works, and, more importantly, where it goes wrong. The concepts are vendor-neutral; Mixpeek shows up at the end.

The Core Idea: A Cache That Is a Vector Index

A semantic cache is, mechanically, a tiny vector search index of past queries:

1. When a query comes in, embed it into a vector. 2. Search the cache index for the nearest stored query embedding. 3. If the top match's similarity is above a threshold, it is a cache hit: return the stored answer (or results) without running the real pipeline. 4. Otherwise it is a miss: run the full pipeline, then store the new (query embedding, answer) pair so the next near-duplicate hits.

That is the whole loop. The cache lookup is a single approximate-nearest-neighbor query, microseconds to low milliseconds, versus a full retrieval-plus-generation round trip that can be hundreds of milliseconds and real money. On agent workloads with heavy near-duplication, hit rates of 30 to 70 percent are common, and every hit is latency and cost you did not pay.

The Threshold Is the Entire Ballgame

Everything rides on one number: how similar is similar enough to reuse an answer. This is a precision versus recall dial, and both failure directions are bad:

Threshold too loose (too low): you get false hits, returning a cached answer for a query that only looks similar. "What is the capital of France" and "what was the capital of France in 1789" embed close together but have different answers. A loose threshold serves the wrong one confidently. In a semantic cache, a false hit is not a slow response, it is a wrong response.

Threshold too tight (too high): almost nothing hits, and you have paid for an embedding and an index lookup on every query for near-zero benefit.

There is no universal right value, because it depends on your embedding model and your domain (legal and medical queries need tighter thresholds than casual FAQ). The practical method is to calibrate: take a labeled set of query pairs marked same-answer or different-answer, sweep the threshold, and pick the point where false-hit rate is acceptably low while hit rate is still worthwhile. Similarity scores are not probabilities and their scale differs per model, so calibrate on your own data rather than copying a magic number. The companion guide on calibrating similarity scores covers this directly.

Where Semantic Caches Break

The threshold is necessary but not sufficient. The subtle failures are semantic:

Negation and small operators. "Show accounts that are active" and "show accounts that are not active" are near-identical in embedding space but opposite in meaning. Embeddings blur exactly the little words that flip an answer. This is the classic semantic-cache footgun.

Entity and number swaps. "revenue in Q1" versus "revenue in Q3," or "policy for Acme" versus "policy for Globex," sit close in vector space but must not share an answer. Naive semantic caching leaks one tenant's or one entity's answer to another.

Staleness. The cached answer was correct when stored. If the underlying documents change, the cache keeps serving the old answer. A cache with no invalidation is a bug that ages into a data-integrity problem.

Personalization and permissions. If the answer depends on who is asking (their tenant, their access level), a query-only cache key is wrong. The cache key has to include the authorization scope, or you serve someone content they should not see.

The general defense is to make the cache key richer than the query embedding alone: include tenant, user scope, and any structured filters, and reserve pure-semantic matching for genuinely global, non-personalized answers.

Invalidation: The Hard Part

There are two easy strategies and one hard one:

TTL (time to live): every entry expires after N minutes or hours. Simple, and bounds staleness, but evicts good entries and keeps bad ones until they age out.

Event-based eviction: when a document is updated or deleted, drop the cache. Crude version: clear everything on any write (safe, low hit rate). The hard version: figure out which cached answers a specific data change actually affects, and evict only those. Doing that precisely is an open problem, because it requires knowing the provenance of every cached answer (which source chunks it depended on). A practical middle ground is to tag each cache entry with the document IDs that produced it, and evict entries whose sources changed.

This is why semantic caching is easiest on stable knowledge (product docs, policies, reference material) and hardest on fast-changing data (live inventory, prices, breaking events), where you either accept staleness or set aggressive TTLs.

Cache More Than Answers

The pattern generalizes past final answers:

Cache retrieval results. Store the retrieved document set for a query embedding, so a near-duplicate skips the retrieval stage but still runs a fresh generation. This is safer than caching the final answer because it re-reasons over current context while still saving the expensive search.

Cache tool and function-call results. An agent that repeatedly calls the same tool with semantically-equivalent arguments can cache by the argument embedding.

Cache multimodally. The query can be an image or an audio clip, not just text. "Find frames like this one" is cacheable by the image embedding the same way text is, which matters for agents doing repeated visual search over a library.

Why This Matters for Agents Specifically

Human search is bursty and diverse; agent search is repetitive and parallel. An agent planning a task issues many overlapping sub-queries, and a swarm of agents on similar tasks issues many overlapping queries across sessions. That redundancy is exactly what a semantic cache eats. The payoff is not only cost: it is latency and consistency. Cache hits return in milliseconds, and reusing a vetted answer for equivalent queries makes the agent's behavior more consistent than re-deriving it fresh each time (with all the variance that entails). The caution is the mirror image: a wrong cached answer is now served fast and confidently to every near-duplicate, so the correctness bar on the threshold and the key design is higher for agents than for a human-in-the-loop app.

Where This Lands in Practice: Mixpeek

A semantic cache sits naturally in front of a retrieval layer, and it is built from the same primitives you already have: an embedding model and a vector index. On Mixpeek, the retriever execution path is where this fits: embed the incoming query, check it against a small index of recent query embeddings scoped by namespace and filters, and only run the full multi-stage retrieval pipeline on a miss. Because the cache is itself a vector search, it uses the same ANN machinery and the same similarity-score calibration as the main index, and it inherits the same invalidation concerns as any incremental index. The short version: cache by meaning, keep the key scoped to who and what, calibrate the threshold on your own data, and evict on change, not just on a clock.

Semantic Caching: How Agents Skip Work They Have Already Done

Agents Ask the Same Thing a Hundred Ways

The Core Idea: A Cache That Is a Vector Index

The Threshold Is the Entire Ballgame

Where Semantic Caches Break

Invalidation: The Hard Part

Cache More Than Answers

Why This Matters for Agents Specifically

Where This Lands in Practice: Mixpeek

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Calibrating Similarity Scores: What Cosine Similarity Actually Means for Retrieval

BM25 and the Inverted Index: The Lexical Retriever Every Hybrid Search Treats as a Black Box

Hybrid Search Fusion: How to Combine Dense and Lexical Retrieval Without Breaking Ranking