NEWVectors or files. Pick a path.Start →
    Retrieval
    15 min read
    Updated 2026-07-02

    Semantic Caching: How Agents Skip Work They Have Already Done

    A vendor-neutral guide to caching by meaning instead of by exact string. Covers why hash-based caches almost never hit on agent traffic, how a semantic cache is really a tiny vector index of query embeddings, the similarity-threshold precision/recall tradeoff that makes or breaks it, the failure modes (false hits, staleness, negation and entity flips), invalidation strategies, and how to cache retrieval results and tool calls, not just answers, for agents that fan out many near-duplicate queries.

    Semantic Cache
    Retrieval
    Agent Infrastructure
    Embeddings
    Latency
    Cost

    Agents Ask the Same Thing a Hundred Ways



    Point an agent at a task and it does not ask one question, it asks dozens: it decomposes, retries, re-derives, and re-checks. Put that agent in front of many users and the traffic is full of near-duplicates, the same intent phrased a hundred ways. "What is your refund policy," "how do refunds work," "can I get my money back" are one question. Every one of them, by default, re-runs the whole expensive pipeline: embed, retrieve, rerank, and generate.

    A traditional cache does not help, because it keys on the exact string (or a hash of it). "how do refunds work" and "can I get my money back" hash to different keys, so the cache misses on both even though the answer is identical. The fix is to cache by meaning. This guide is about how that works, and, more importantly, where it goes wrong. The concepts are vendor-neutral; Mixpeek shows up at the end.

    The Core Idea: A Cache That Is a Vector Index



    A semantic cache is, mechanically, a tiny vector search index of past queries:

    1. When a query comes in, embed it into a vector. 2. Search the cache index for the nearest stored query embedding. 3. If the top match's similarity is above a threshold, it is a cache hit: return the stored answer (or results) without running the real pipeline. 4. Otherwise it is a miss: run the full pipeline, then store the new (query embedding, answer) pair so the next near-duplicate hits.

    That is the whole loop. The cache lookup is a single approximate-nearest-neighbor query, microseconds to low milliseconds, versus a full retrieval-plus-generation round trip that can be hundreds of milliseconds and real money. On agent workloads with heavy near-duplication, hit rates of 30 to 70 percent are common, and every hit is latency and cost you did not pay.

    The Threshold Is the Entire Ballgame



    Everything rides on one number: how similar is similar enough to reuse an answer. This is a precision versus recall dial, and both failure directions are bad:

  1. Threshold too loose (too low): you get false hits, returning a cached answer for a query that only looks similar. "What is the capital of France" and "what was the capital of France in 1789" embed close together but have different answers. A loose threshold serves the wrong one confidently. In a semantic cache, a false hit is not a slow response, it is a wrong response.
  2. Threshold too tight (too high): almost nothing hits, and you have paid for an embedding and an index lookup on every query for near-zero benefit.


  3. There is no universal right value, because it depends on your embedding model and your domain (legal and medical queries need tighter thresholds than casual FAQ). The practical method is to calibrate: take a labeled set of query pairs marked same-answer or different-answer, sweep the threshold, and pick the point where false-hit rate is acceptably low while hit rate is still worthwhile. Similarity scores are not probabilities and their scale differs per model, so calibrate on your own data rather than copying a magic number. The companion guide on calibrating similarity scores covers this directly.

    Where Semantic Caches Break



    The threshold is necessary but not sufficient. The subtle failures are semantic:

  4. Negation and small operators. "Show accounts that are active" and "show accounts that are not active" are near-identical in embedding space but opposite in meaning. Embeddings blur exactly the little words that flip an answer. This is the classic semantic-cache footgun.
  5. Entity and number swaps. "revenue in Q1" versus "revenue in Q3," or "policy for Acme" versus "policy for Globex," sit close in vector space but must not share an answer. Naive semantic caching leaks one tenant's or one entity's answer to another.
  6. Staleness. The cached answer was correct when stored. If the underlying documents change, the cache keeps serving the old answer. A cache with no invalidation is a bug that ages into a data-integrity problem.
  7. Personalization and permissions. If the answer depends on who is asking (their tenant, their access level), a query-only cache key is wrong. The cache key has to include the authorization scope, or you serve someone content they should not see.


  8. The general defense is to make the cache key richer than the query embedding alone: include tenant, user scope, and any structured filters, and reserve pure-semantic matching for genuinely global, non-personalized answers.

    Invalidation: The Hard Part



    There are two easy strategies and one hard one:

  9. TTL (time to live): every entry expires after N minutes or hours. Simple, and bounds staleness, but evicts good entries and keeps bad ones until they age out.
  10. Event-based eviction: when a document is updated or deleted, drop the cache. Crude version: clear everything on any write (safe, low hit rate). The hard version: figure out which cached answers a specific data change actually affects, and evict only those. Doing that precisely is an open problem, because it requires knowing the provenance of every cached answer (which source chunks it depended on). A practical middle ground is to tag each cache entry with the document IDs that produced it, and evict entries whose sources changed.


  11. This is why semantic caching is easiest on stable knowledge (product docs, policies, reference material) and hardest on fast-changing data (live inventory, prices, breaking events), where you either accept staleness or set aggressive TTLs.

    Cache More Than Answers



    The pattern generalizes past final answers:

  12. Cache retrieval results. Store the retrieved document set for a query embedding, so a near-duplicate skips the retrieval stage but still runs a fresh generation. This is safer than caching the final answer because it re-reasons over current context while still saving the expensive search.
  13. Cache tool and function-call results. An agent that repeatedly calls the same tool with semantically-equivalent arguments can cache by the argument embedding.
  14. Cache multimodally. The query can be an image or an audio clip, not just text. "Find frames like this one" is cacheable by the image embedding the same way text is, which matters for agents doing repeated visual search over a library.


  15. Why This Matters for Agents Specifically



    Human search is bursty and diverse; agent search is repetitive and parallel. An agent planning a task issues many overlapping sub-queries, and a swarm of agents on similar tasks issues many overlapping queries across sessions. That redundancy is exactly what a semantic cache eats. The payoff is not only cost: it is latency and consistency. Cache hits return in milliseconds, and reusing a vetted answer for equivalent queries makes the agent's behavior more consistent than re-deriving it fresh each time (with all the variance that entails). The caution is the mirror image: a wrong cached answer is now served fast and confidently to every near-duplicate, so the correctness bar on the threshold and the key design is higher for agents than for a human-in-the-loop app.

    Where This Lands in Practice: Mixpeek



    A semantic cache sits naturally in front of a retrieval layer, and it is built from the same primitives you already have: an embedding model and a vector index. On Mixpeek, the retriever execution path is where this fits: embed the incoming query, check it against a small index of recent query embeddings scoped by namespace and filters, and only run the full multi-stage retrieval pipeline on a miss. Because the cache is itself a vector search, it uses the same ANN machinery and the same similarity-score calibration as the main index, and it inherits the same invalidation concerns as any incremental index. The short version: cache by meaning, keep the key scoped to who and what, calibrate the threshold on your own data, and evict on change, not just on a clock.
    Managed Mixpeek

    Put multimodal search to work

    Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.

    Start with Managed
    MVS · bring your own

    Already have vectors?

    Keep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.

    Start with MVS

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs

    Related guides

    Retrieval

    Calibrating Similarity Scores: What Cosine Similarity Actually Means for Retrieval

    A first-principles guide to similarity scores in vector search: what cosine similarity computes, why a raw score is not a confidence, why thresholds do not transfer across models or modalities, and how to calibrate -- per-model thresholds, score normalization, and Platt/isotonic mapping to probabilities -- so an AI agent can decide when a retrieved result is actually good enough to act on.

    Read guide →
    Retrieval

    BM25 and the Inverted Index: The Lexical Retriever Every Hybrid Search Treats as a Black Box

    Every hybrid search pipeline pairs dense vectors with BM25, but almost no one can say where the BM25 number actually comes from, which is exactly why fusion, tuning, and exact-match failures stay mysterious. This guide opens the box: how an inverted index turns transcripts and OCR text into posting lists, the precise BM25 scoring formula with its term-frequency saturation and length normalization, what the k1 and b parameters really do, and why the tokenizer is the silent decider of whether an agent ever finds a serial number.

    Read guide →
    Retrieval

    Hybrid Search Fusion: How to Combine Dense and Lexical Retrieval Without Breaking Ranking

    An agent searching transcripts, OCR text, and captions needs both meaning (dense vectors) and exact terms (BM25), but the two return scores on incompatible scales that you cannot simply add. This guide teaches the real fusion mechanics: why score distributions make naive normalization fail, the exact math of Reciprocal Rank Fusion and how its k parameter behaves, weighted convex combination with proper normalization, and how to choose and tune a fusion method against a labeled set.

    Read guide →