Index Freshness and Incremental Updates: How Just-Ingested Content Becomes Searchable

Why Freshness Is an Agent Problem

An AI agent that ingests unstructured content has a strict expectation that human search products usually do not: it wants to retrieve what it just stored. A perception agent transcribes a meeting, then immediately asks "what did the CFO say about Q3 guidance." A research agent ingests a PDF, then queries it in the same reasoning chain. A monitoring agent indexes a new camera clip, then checks whether a similar event happened in the last minute.

In all of these cases the content was created seconds ago. If the search index has not absorbed it yet, the agent gets a wrong answer that looks confident. It does not see an error. It sees an empty result set or stale neighbors, and it reasons forward from incomplete evidence. Index freshness is the property that decides whether the agent can see, hear, and search what it just produced.

Freshness is not free. The data structures that make approximate nearest neighbor search fast (graphs, inverted lists, quantization codebooks) are expensive to mutate. The whole engineering problem is reconciling two opposing forces:

Read efficiency wants a large, well-optimized, immutable index.

Write freshness wants every new vector visible immediately, with no rebuild.

This guide explains how production systems resolve that tension, and what each design choice costs an agent.

Defining the Freshness Metrics

Before tuning anything, name the quantities you are trading.

Indexing latency (visibility lag): the time from "content accepted" to "content returned by a query that should match it." This is the number agents care about most.

Ingest throughput: how many items per second the system can absorb without falling behind.

Query recall: the fraction of true neighbors returned. Freshness tricks often degrade recall on the most recent data first.

Write amplification: how many times a single ingested vector gets physically rewritten before it reaches its final resting structure. High write amplification burns CPU and IO and, for GPU-extracted embeddings, can quietly re-pay extraction cost if a pipeline re-derives features during a rebuild.

A system that claims "real-time indexing" has made a specific choice on each of these. There is no design that maximizes all four at once.

The Core Pattern: LSM-Style Segments

The dominant architecture for fresh vector search borrows from log-structured merge trees, the same idea behind RocksDB, Cassandra, and Lucene.

The index is not one monolithic structure. It is a set of segments:

1. Writable segment (memtable). A small, in-memory structure that accepts new vectors with cheap inserts. New content lands here first and becomes queryable almost immediately. 2. Sealed segments. When the writable segment reaches a size or age threshold, it is sealed (made immutable) and a new writable segment opens. Sealed segments are optimized for read performance. 3. Large base segments. Background processes merge many sealed segments into fewer large ones, rebuilding the ANN structure for better recall and lower per-query overhead.

A query fans out across every segment, gathers top-k candidates from each, and merges the results:

query(q, k):
    candidates = []
    for segment in all_segments:        # writable + sealed + base
        candidates += segment.search(q, k)
    return top_k(merge(candidates), k)

This is why freshness is achievable at all. The writable segment is tiny, so even a brute-force or lightly-indexed scan over it is fast, and it makes brand-new content visible without touching the large optimized segments. The large segments carry the bulk of the corpus and are rebuilt rarely.

The cost is query fan-out: more segments means more sub-searches to merge. A system that never compacts ends up with thousands of tiny segments and slow queries. A system that compacts too aggressively spends all its CPU rebuilding. Compaction policy is the dial between freshness and steady-state query cost.

Incremental Inserts in HNSW

The graph-based index HNSW (Hierarchical Navigable Small World) is naturally insert-friendly, which is why it dominates fresh-search workloads. Inserting a vector does not require a rebuild:

1. Assign the new node a random maximum layer (drawn from an exponential distribution, so most nodes live only on the bottom layer). 2. Greedily descend from the top entry point to find the nearest neighbors at each layer. 3. At each layer up to the node's max, connect it to its M closest neighbors and add back-links. 4. Prune over-full neighbor lists using the heuristic that keeps diverse, navigable connections rather than just the closest ones.

The cost is roughly \(O(M \cdot \log n)\) per insert, which is cheap enough for streaming. But two slow problems accumulate:

Entry-point drift. Early inserts shape the upper-layer graph. As the distribution of ingested content shifts (a new camera angle, a new document language, a new product category), the upper layers can become poorly representative, hurting recall on recent data. This is one reason periodic full rebuilds still matter even with incremental inserts.

Graph degradation under churn. Heavy insert-and-delete cycles fragment the neighbor lists and leave dangling or suboptimal links. Navigability degrades silently. The fix is background re-optimization, not a runtime flag.

The Delete Problem and Tombstones

Deletes are far harder than inserts in graph indexes. Removing a node tears holes in the navigation graph: its neighbors lose a hop they relied on, and the greedy search can get stranded. Physically repairing the graph on every delete is too expensive for high-churn workloads.

The near-universal answer is the tombstone, a soft delete:

1. Mark the vector as deleted with a flag in its payload or a deleted-id bitmap. Leave it physically in the graph. 2. At query time, retrieve candidates from the ANN structure as usual, then filter out any tombstoned ids before returning results to the agent. 3. During background compaction, physically drop tombstoned vectors when the segment is rebuilt, reclaiming memory and removing them from the graph for good.

Tombstones make deletes \(O(1)\) and keep the graph intact, but they create two non-intuitive costs an agent operator must understand:

Over-retrieval. If a segment is 40% tombstoned, a query for top-10 must fetch far more than 10 raw candidates to survive filtering. Systems compensate by searching with a larger ef or a wider candidate pool, which raises latency. A "deleted" document that you cannot see in results is still costing you query work until compaction runs.

Stale memory and recall drift. Tombstoned vectors still occupy RAM and still participate in graph navigation, so a heavily-churned index can be large and slow even though its logical size is small. This is exactly the kind of surprise that looks like "search got slow for no reason" until you check the tombstone ratio.

Background Compaction and Streaming Merge

Compaction is the janitor that pays down the debt that inserts and tombstones accumulate. It runs off the query path and does three jobs:

1. Merge small segments into larger ones to cut query fan-out. 2. Purge tombstones by physically rebuilding without the deleted vectors. 3. Re-optimize the graph so recent inserts get clean, navigable links.

For disk-resident indexes the canonical design is a streaming merge (popularized by FreshDiskANN): new vectors go into a small in-memory delta graph for instant visibility, deletes are recorded as tombstones, and a background process periodically folds the delta and the tombstones into the large on-disk graph. The agent always queries the union of the on-disk graph and the in-memory delta, so it sees fresh content immediately while the expensive merge happens asynchronously. In-place update schemes such as SPFresh push this further by patching the existing structure rather than rebuilding whole partitions, trading implementation complexity for lower write amplification at billion scale.

The operational lesson: compaction is not a tuning detail you can ignore. If it falls behind, freshness, recall, latency, and memory all degrade together. Monitor the segment count and tombstone ratio the way you monitor disk space.

The Sparse and Multimodal Wrinkle

Freshness is not just a dense-vector concern. Agents over unstructured content usually run hybrid retrieval, and each index type has its own freshness story:

Lexical (BM25) indexes are inverted lists keyed by term. Adding a document updates posting lists and global statistics (document frequency, average document length). A common production failure is a lexical index that does not get rebuilt on a snapshot recovery, leaving a "lexical: true" retriever that silently returns zero documents because its posting lists were never restored. New content existing in the dense index but missing from the sparse index produces hybrid results that are subtly wrong.

Payload and filter indexes must be updated transactionally with the vector, or a filter like "ingested in the last hour" will exclude content that is technically present in the vector index but missing from the filter index.

Multimodal segments make this worse: a single ingested video produces transcript vectors, frame vectors, OCR spans, and object detections, often written to different index structures. Freshness for that item is the slowest of its constituent indexes. The agent does not perceive the video as searchable until every modality it might query has landed.

The takeaway for agent perception: define freshness per query shape, not per item. "The video is indexed" is meaningless if the agent's next question hits a modality that has not finished.

Freshness Strategies and Their Tradeoffs

Strategy

Visibility lag

Recall on new data

Cost driver

Full periodic rebuild	Minutes to hours	Excellent after rebuild	Wasted recompute, high write amplification
Writable memtable + sealed segments	Seconds	Good, slightly lower on newest	Query fan-out, compaction CPU
HNSW incremental insert	Sub-second	Good, drifts under distribution shift	Graph degradation, periodic re-optimize
In-memory delta + streaming merge	Sub-second	Good	Background merge IO, memory for delta
Tombstone-only deletes	N/A (deletes)	Degrades with churn	Over-retrieval, stale memory

There is no universally correct row. A batch analytics corpus that updates nightly should prefer a periodic rebuild for maximum recall and minimum operational surface. An agent writing to its own memory in a tight loop needs sub-second visibility and must accept compaction overhead and slightly noisier recall on the freshest vectors.

How This Applies to Mixpeek

When an agent ingests content through Mixpeek, the object flows through extraction (embeddings, transcripts, OCR, detections) and into the underlying vector store (MVS). The freshness contract is what determines whether a retriever can immediately find the new object across every modality it was decomposed into.

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_KEY")

# 1. An agent ingests a new clip. Extraction produces multiple
#    feature types (visual, transcript, OCR) into the collection.
obj = client.ingest(
    namespace="agent-memory",
    bucket_id="session-clips",
    blobs=[{"type": "video", "url": "s3://bucket/clip-2719.mp4"}],
)

# 2. Before querying, confirm the object reached an indexed state.
#    Treat freshness as a query-shape property, not a single boolean:
#    poll status rather than assuming the next read will see it.
status = client.objects.get(namespace="agent-memory", object_id=obj["object_id"])
# status reflects extraction + indexing progress per feature type

# 3. Once indexed, the retriever sees the new content alongside the
#    rest of the corpus. The fan-out across writable and base segments
#    is handled by the store, not the agent.
results = client.retrievers.execute(
    namespace="agent-memory",
    retriever_id="hybrid_search",
    inputs={"query": "what did the CFO say about Q3 guidance"},
    filters={"AND": [
        {"field": "created_at", "operator": "gte", "value": "2026-06-19T00:00:00Z"}
    ]},
)

The agent-relevant design rules that fall out of everything above:

Do not assume read-after-write. An ingest call returning success means accepted, not searchable. Poll object status or design the agent loop to tolerate brief visibility lag, especially for multimodal items where the slowest modality gates freshness.

Watch the tombstone and segment health, not just logical counts. A collection with heavy churn can be slow and memory-hungry even when its logical size is small. Compaction lag is the usual culprit.

Keep dense, sparse, and payload indexes in sync. Hybrid retrievers are only as fresh as their least-fresh sub-index. A restored snapshot that skips the lexical rebuild will return confidently wrong hybrid results.

Key Takeaways

1. Freshness is whether an agent can retrieve what it just ingested, and it is the metric that most directly governs whether the agent reasons over complete evidence.

2. The standard solution is an LSM-style segment architecture: a tiny writable segment for instant visibility, sealed segments for reads, and large base segments rebuilt by background compaction.

3. HNSW makes inserts cheap but suffers entry-point drift and graph degradation under churn, so periodic re-optimization still matters.

4. Deletes use tombstones for \(O(1)\) soft removal, at the cost of over-retrieval and stale memory until compaction physically purges them.

5. Compaction and streaming merge are the load-bearing background work. If they fall behind, freshness, recall, latency, and memory all degrade together.

6. Freshness is per query shape, not per item. A multimodal object is only searchable once every modality the agent might query has been indexed, and hybrid retrieval is only as fresh as its least-fresh sub-index.

Why Freshness Is an Agent Problem

Defining the Freshness Metrics

The Core Pattern: LSM-Style Segments

Incremental Inserts in HNSW

The Delete Problem and Tombstones

Background Compaction and Streaming Merge

The Sparse and Multimodal Wrinkle

Freshness Strategies and Their Tradeoffs

How This Applies to Mixpeek

Key Takeaways

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Filtered Vector Search: How Agents Combine Similarity with Hard Constraints

Approximate Nearest Neighbor Search: The Algorithms Behind Fast Vector Retrieval

Payload Projection for Agentic Vector Search: Field Selection, Evidence Handles, and Context Budgets