The Semantic Join

Last month, a founder showed me their product video search system, powered by Vertex embeddings.

I asked it to find "videos where someone unboxes the blue variant." It returned every video containing the color blue and videos with any unboxing.

This wasn't actually a bug, it's the system doing exactly what it was designed to do.

The embedding trap

The current playbook is: ingest content, preprocess it, generate features, embed it into a vector database, and then do semantic search.

This architecture shines with vendors like Google Vertex and TwelveLabs for simple queries like "Find cooking videos", until users start hitting the index's upper-bound.

Searching for content like "the part where she adds the garlic to the cast iron" or "frames showing the warning label" will just not work. The indexes weren't designed for these types of queries.

The access pattern mismatch

Every index implies an access pattern. A transcript index supports text queries. A scene embedding index supports visual similarity. An OCR index supports finding on-screen text. Each is optimized for specific retrieval operations.

A single embedding per asset assumes a single access pattern. But real queries are heterogeneous:

"Videos mentioning competitor pricing" → transcript access pattern
"Frames with our logo visible" → object detection access pattern
"Moments someone demonstrates the product" → action recognition access pattern

One index can't serve all of these well. When users hit the quality ceiling, they're not discovering a bug, they're discovering the boundary of what that index can retrieve.

The fix is building indexes that match how people actually search.

Decomposition: multi-tier feature extraction

If a single index can't serve different access patterns, build multiple indexes from the same source, each for a different access pattern.

Video decomposes into: transcribed utterances (timestamped), detected faces (with frame locations), scene embeddings (per visual segment), extracted text overlays (OCR) and detected objects (products, logos, etc).

Documents decompose into: page images, extracted text blocks (with bounding boxes), tables (into structured data), figures (with captions), headers and sections (preserving hierarchy).

Audio decomposes into: transcribed segments, speaker embeddings, music detection, sound event classification, silence/speech boundaries.

Each extractor produces documents in its own index (we call these collections). A single source video might generate documents across six collections—all linked back to the source through lineage, all independently queryable.

This isn't "feature extraction" in the ML sense. It's creating parallel searchable representations at different granularities.

Introducing, feature extractors.

Recomposition: multi-stage retrieval

These extracted collections are only useful if you can query across them intelligently. Recomposing the source content then means you need to chain retrieval stages into sequential pipelines.

A stage can be anything from a simple KNN search to a complex pipeline of stages. Here are some examples:

KNN search: vector similarity against any collection (what most people think of when they think of vector search)
LLM filter: natural language conditions evaluated by a model
Web lookup: enrichment from external sources

Each stage can also be thought of as an agent tool, performing a specific task on the working set and passing the results to the next stage.

"Find clips where someone mentions pricing" becomes: KNN search on transcript collection → filter to segments containing dollar amounts → enrich with speaker face from face collection → return with timestamps.

"Find product shots with negative sentiment in comments" becomes: KNN search on visual collection for product matches → join to UGC metadata collection → LLM filter for negative sentiment → aggregate by product SKU.

The pipeline is a configurable way to chain together steps for each access pattern.

Introducing, retrieval pipelines.

Taxonomies: joining extraction to business context

Decomposition creates searchable collections. Recomposition lets you query across them. But there's a third piece: connecting what you've extracted to what the business already knows.

There are two types of collections in this architecture. Extracted collections are what decomposition produces—faces, objects, transcripts, scenes—derived from your content, with no inherent business meaning. Reference collections are what you already have—employee directories, product catalogs, brand registries—curated data that represents how your business thinks.

Enterprises don't need to invent taxonomies—they already have them. Product hierarchies in the PIM. Org charts in the HRIS. Brand guidelines in the DAM. These represent decades of accumulated knowledge about how the business operates.

The problem: unstructured content exists outside these systems. Marketing has a product taxonomy. They also have 50,000 videos. The two don't connect.

Taxonomies bridge this gap. They define retrieval pipelines that join extracted collections to reference collections via similarity matching—enriching unknown content with known business context.

A detected face gets joined to your employee directory, enriched with name, department, and level. A detected product gets joined to your PIM, enriched with SKU, category, and hierarchy position. A detected logo gets joined to your brand registry, enriched with brand owner and usage rights.

The hierarchy isn't something we impose. It's your hierarchy—from your PIM, your HRIS, your DAM. We're just making it queryable over unstructured content.

Introducing, multimodal taxonomies.

What this changes

"Find all training videos featuring the engineering leadership team."

Without taxonomies: manually tag every video with who appears, keep tags synchronized with org changes.

With taxonomies: faces decomposed into a collection, recomposition pipeline joins to employee directory via face similarity, enriched with department and level. When someone gets promoted to leadership, queries immediately reflect it—the join is live.

"Show me all product content for Spring 2024 Athletic Footwear."

Without taxonomies: semantic search for "athletic footwear" returns everything vaguely athletic.

With taxonomies: detected products joined to your PIM via visual similarity. Query filters by your existing category structure. You get exactly the content featuring SKUs that roll up to Spring 2024 Athletic Footwear—as defined by your merchandising team, not by embedding similarity.

Legal queries using compliance frameworks. Marketing queries using brand hierarchies. Product teams query using PIM structure.

Everyone searches unstructured content in the language they already speak.

The shift

The index is your bottleneck, fix it with three steps:

Decompose content into collections—each optimized for a specific access pattern. Transcripts, faces, objects, scenes, text: independently queryable, linked by lineage.

Recompose with multi-stage retrieval—chain search, filter, enrich, and transform stages into pipelines that match how people actually need to search.

Connect extracted collections to reference collections via taxonomies, bridging what AI extracts to what the business already knows.

The startup I mentioned? With this architecture, "videos where someone unboxes the blue variant" becomes: search product detection collection for blue variant SKU → join to video segments → filter to unboxing action classification → return with timestamps.

Same query. Completely different result. Not because the embeddings got smarter, but because the indexes match the access pattern.