Filtered Vector Search: How Agents Combine Similarity with Hard Constraints

The Query Agents Actually Send

An AI agent almost never asks for "the nearest vectors." It asks for the nearest vectors that also satisfy a structured condition: clips from one campaign after a date, images from one bucket in one region, documents owned by one tenant with a compliance flag set. The vector half expresses what the content looks or sounds like; the filter half expresses a hard business or safety constraint that must hold.

This combination -- approximate nearest neighbor search over an unstructured embedding, gated by an exact predicate over structured metadata -- is the most common shape of real retrieval. It is also the place where systems most quietly go wrong. A filter that looks harmless can drop recall from 95% to 30% with no error, no warning, and no change to the API call. Understanding why is the difference between an agent that reasons over complete evidence and one that hallucinates because its retrieval silently returned almost nothing.

Selectivity Is the Hidden Variable

The single number that governs every decision below is selectivity: the fraction of the corpus that passes the filter.

selectivity = (rows matching the filter) / (total rows)

A filter on \(tenant\_id\) in a system with 10,000 tenants might have selectivity 0.0001 (very selective). A filter on \(language = "en"\) might have selectivity 0.7 (barely selective). The same filtering strategy that is optimal at 0.7 is catastrophic at 0.0001, and vice versa. There is no single correct way to do filtered search; there is only the right strategy for the selectivity of this particular query.

Keep that in mind through all three strategies below.

Strategy 1: Post-Filter (Search, Then Drop)

Run ordinary ANN search over the full index, get the top candidates, then throw away the ones that fail the filter.

candidates = ann_search(query, k=K)        # ignores the filter
results    = [c for c in candidates if passes(c, filter)][:k]

This is the default in many systems because it requires nothing special from the index -- the ANN graph or cells never even see the predicate.

Why It Breaks at Low Selectivity

The ANN search returns the \(K\) globally nearest vectors. If only a tiny fraction of the corpus passes the filter, almost none of those \(K\) will survive. Suppose selectivity is 0.001 and you fetch \(K = 1000\). On average, exactly one candidate passes. You asked for \(k = 20\) and got 1. The agent sees a near-empty result and concludes the content does not exist.

The standard mitigation is over-fetch: retrieve \(K' = k / selectivity\) candidates so that, in expectation, \(k\) survive. But this fails in two ways. First, you usually do not know selectivity ahead of time. Second, when selectivity is 0.0001 you would need to fetch \(K' = 200{,}000\) candidates to expect 20 survivors -- at which point you have abandoned approximate search and are nearly scanning the corpus. Post-filter degrades to brute force exactly when the filter is most useful.

When Post-Filter Is Correct

When the filter is weak (selectivity above ~0.5), almost everything passes, over-fetch is cheap, and post-filter is the simplest and fastest option. A small constant over-fetch (fetch \(2k\) or \(3k\)) absorbs the attrition with negligible cost.

Strategy 2: Pre-Filter (Restrict, Then Search)

Compute the set of IDs that pass the filter first, then search only within that set.

allowed = metadata_index.lookup(filter)    # e.g. an inverted index or B-tree
results = brute_force_knn(query, restricted_to=allowed)[:k]

At very low selectivity this is wonderful: if only 500 rows match, a brute-force distance scan over 500 vectors is trivially fast and returns exact, perfect-recall results. No graph, no approximation, no attrition.

Why It Breaks at High Selectivity

Pre-filter discards the ANN index entirely and falls back to scanning the allowed set. If the filter passes 40% of a 100-million-vector corpus, "the allowed set" is 40 million vectors, and a brute-force scan over 40 million vectors is the 40-second nightmare that ANN existed to avoid. Pre-filter is exact and fast when the allowed set is small, and unusably slow when it is large.

A subtler trap: some systems try to keep using the ANN index over the restricted set by masking -- traversing the normal HNSW graph but ignoring nodes that fail the filter. That brings us to the real problem.

The Graph-Connectivity Problem

Here is the failure that surprises people. You have an HNSW index. You want filtered results, so you traverse the graph as usual but only keep nodes that pass the filter. Surely that gives correct results?

It does not. HNSW finds neighbors by greedy graph walks: from the current node, you hop to whichever neighbor is closest to the query, repeat, and stop when you can no longer improve. The graph was built over the whole corpus. The path from the entry point to a relevant filter-passing node almost always runs through nodes that fail the filter.

If you refuse to traverse failing nodes, you cut those bridges. The greedy walk gets stranded in a region of the graph it can reach, which may be far from the true nearest filtered neighbors. Recall collapses, and it collapses worse as selectivity drops, because the filter-passing nodes become a sparser and sparser island inside a graph whose connectivity assumed every node was reachable. Measured recall of 30-60% at moderate selectivity is typical for naive filtered HNSW traversal, and the API gives no hint that anything is wrong.

Strategy 3: In-Place Predicate-Aware Traversal

The modern fix keeps using the index but makes the traversal aware of the predicate, so connectivity is preserved while the result set is constrained. Two families dominate.

Filtered DiskANN / Filtered HNSW with Routing Through Failing Nodes

The key idea: traverse through filter-failing nodes for navigation, but only admit filter-passing nodes into the result heap. A failing node is still a useful stepping stone; it just is not an answer. This restores the bridges the naive mask cut, recovering high recall while still returning only valid results.

function filtered_search(query, filter, k, ef):
    visited   = {}
    frontier  = {entry_point}        # ordered by distance to query
    results   = {}                   # only filter-passing nodes
    while frontier not empty and frontier.best improves results.worst:
        node = frontier.pop_closest()
        for nbr in neighbors(node):
            if nbr in visited: continue
            visited.add(nbr)
            d = distance(query, nbr)
            frontier.push(nbr, d)            # route through ANY node
            if passes(nbr, filter):          # but only admit matches
                results.push(nbr, d)
    return results.top(k)

The cost is a larger effective \(ef\): because many visited nodes are discarded, you must explore more of the graph to collect \(k\) passing results. Systems compensate by scaling \(ef\) up roughly in proportion to \(1 / selectivity\), with a ceiling that triggers a fall-back to pre-filter when selectivity gets extreme.

ACORN-Style Predicate-Aware Neighbor Expansion

ACORN and similar designs go further: instead of treating the existing graph as fixed, they expand a node's neighbor list at query time to include two-hop neighbors when the immediate neighbors all fail the filter. Conceptually, this constructs a filter-specific subgraph on the fly that stays connected even when passing nodes are sparse. The index is built once, predicate-agnostic, but every query can synthesize the connectivity it needs for its specific filter. This gives high recall across a wide selectivity range without per-filter indexes, at the cost of more distance computations per hop.

Building Filters Into the Structure

A third option pushes the predicate into the index layout itself. Partitioned indexes build a separate ANN structure per value of a low-cardinality, frequently-filtered field (for example, one HNSW graph per tenant or per region). A filtered query routes to the right partition and runs ordinary unfiltered search inside it -- perfect connectivity, full recall, no masking. The trade-off is combinatorial: this only works for one or a few filter fields with limited cardinality, and arbitrary predicates over other fields still need one of the runtime strategies above.

A Query Planner, Not a Strategy

Because the right choice depends on selectivity, mature systems do not pick one strategy. They estimate selectivity and route:

Estimated selectivity

Best strategy

Why

> ~0.5 (weak filter)	Post-filter with small over-fetch	Almost everything passes; ANN is already near-optimal
~0.01 to ~0.5 (moderate)	In-place predicate-aware traversal	Preserves graph connectivity; recall stays high
< ~0.01 (very selective)	Pre-filter + brute force over the allowed set	Allowed set is small enough to scan exactly

Selectivity itself is estimated the way relational databases estimate cardinality: histograms and per-value counts over the metadata columns, refreshed as data changes. A predicate combining several conditions multiplies (for independent ANDs) or unions (for ORs) the component selectivities. The planner then compares the estimated cost of each strategy and routes the query. This is exactly the cost-based optimization a SQL engine does, applied to the vector-plus-predicate world.

The reason this matters for agents specifically: an agent issues many heterogeneous queries in one reasoning chain, and their selectivities vary wildly. A fixed filtering strategy is guaranteed to be wrong for a large fraction of them. Only a planner that adapts per query keeps recall and latency stable across the whole chain.

Correctness Traps Worth Naming

Silent recall loss. The deadliest property of filtered search is that wrong strategies fail without errors. Always measure recall@k on filtered queries against a brute-force ground truth, not just on unfiltered ones.

Over-fetch that is not enough. A constant over-fetch factor (always fetch \(10k\)) looks like a fix but breaks the moment a query is more selective than the factor assumed. Tie over-fetch to estimated selectivity, or use predicate-aware traversal instead.

OR predicates explode the allowed set. ANDs make filters more selective; ORs make them less. A query plan tuned for selective ANDs can fall off a cliff when an agent issues an OR of many values.

Stale statistics mislead the planner. If histograms lag behind a fast-changing corpus, the planner mis-estimates selectivity and picks the wrong strategy. Filter planning is only as good as the freshness of its metadata statistics.

How This Looks in Mixpeek

When an agent searches a Mixpeek namespace, the filters it passes alongside the vector query are not applied as a naive post-filter. The retriever estimates predicate selectivity from namespace statistics and routes the query to the appropriate strategy, so that a highly selective tenant or date filter does not silently starve the result set, and a weak language filter does not pay for unnecessary graph exploration.

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_KEY")

# Agent query: "red sports car shots from the spring campaign, EU only"
# Vector half (what it looks like) + filter half (hard constraints).
results = client.search(
    namespace="ad-creative",
    queries=[
        {"type": "text", "value": "red sports car on a mountain road", "field_name": "embedding"}
    ],
    filters={
        "AND": [
            {"field": "campaign_id", "operator": "eq", "value": "spring-2026"},
            {"field": "region", "operator": "eq", "value": "EU"},
            {"field": "created_at", "operator": "gte", "value": "2026-03-01"}
        ]
    },
    limit=20
)

for r in results:
    print(f"{r['score']:.4f}  {r['file_name']}")

The agent expresses intent declaratively; the system decides whether campaign_id = spring-2026 AND region = EU is selective enough to favor pre-filtering, or weak enough to favor in-place traversal. The agent gets stable recall without having to reason about index internals -- which is exactly what lets it trust that an empty result means "nothing matches" rather than "the filter strategy failed."

Key Takeaways

1. Real agent queries are vector similarity plus a hard predicate, and the predicate's selectivity determines everything.

2. Post-filter is right for weak filters and catastrophic for selective ones; pre-filter is right for selective filters and catastrophic for weak ones.

3. Naively masking a shared HNSW graph cuts the bridges the greedy walk needs, collapsing recall silently. Route through failing nodes but admit only passing ones.

4. In-place predicate-aware traversal (filtered DiskANN, ACORN-style expansion) keeps connectivity across a wide selectivity range from a single predicate-agnostic index.

5. Production systems behave like a cost-based query planner: estimate selectivity from metadata statistics, then pick the strategy per query. A fixed strategy is wrong for a large share of an agent's heterogeneous queries.

The Query Agents Actually Send

Selectivity Is the Hidden Variable

Strategy 1: Post-Filter (Search, Then Drop)

Why It Breaks at Low Selectivity

When Post-Filter Is Correct

Strategy 2: Pre-Filter (Restrict, Then Search)

Why It Breaks at High Selectivity

The Graph-Connectivity Problem

Strategy 3: In-Place Predicate-Aware Traversal

Filtered DiskANN / Filtered HNSW with Routing Through Failing Nodes

ACORN-Style Predicate-Aware Neighbor Expansion

Building Filters Into the Structure

A Query Planner, Not a Strategy

Correctness Traps Worth Naming

How This Looks in Mixpeek

Key Takeaways

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Index Freshness and Incremental Updates: How Just-Ingested Content Becomes Searchable

Calibrating Similarity Scores: What Cosine Similarity Actually Means for Retrieval

Approximate Nearest Neighbor Search: The Algorithms Behind Fast Vector Retrieval