The 3072-Dimension Problem

You embedded your video library with Gemini. Or CLIP, or SigLIP, or whatever the model of the month was when the project started. You stored a few million vectors in Pinecone or Qdrant or Weaviate. You wired up cosine similarity. You ran your first query.

The results were fine. Not great. Fine.

A search for "person holding a coffee cup" returned videos with people, and videos with cups, and a surprising number of videos with neither. Reranking helped a little. Hybrid BM25 helped a little more. But somewhere around the fourth or fifth round of tuning, you started to suspect that the problem wasn't your retriever, or your reranker, or your prompt. The problem was that a 3072-dimensional embedding is not a useful unit of work.

It encodes everything. The lighting, the camera angle, the dominant color, the demographic of the person on screen, the room they're in, the vibe of the room they're in. All of it, smeared across three thousand floats. Cosine similarity treats every dimension equally. Your application does not.

Most teams hit that wall about six weeks into a multimodal project. A better vector database won't get you past it.

The reframe

A data science lead we work with (thirty years in the industry, ran one of the first audio fingerprinting startups in the early 2000s) described our job to me recently in a way I haven't stopped thinking about:

Reduce the number of dimensions in a video to a handful of measurable features, then place every new video into a hierarchical structure defined by those features.

He called the output a "compact fingerprint."

When I shared that framing with an engineer at AWS, he pushed back immediately. Every kind of representation learning does that, he said. Find small, interpretable representations of complex things. He was right. The idea is old. What's rare is running it as managed infrastructure, continuously, across millions of objects, with the reduced features exposed as first-class queryable surfaces.

The job of multimodal infrastructure is not to be a faster vector database. It's the systems layer that takes you from I have embeddings to I have a usable application. That layer has a specific shape, and most of the work is in the shape.

How decomposition works in practice

The shape, in our system, is six primitives that compose in one direction.

Buckets are where raw objects land. Videos, images, documents, audio. A bucket has a schema and an optional sync connection (S3, GCS, Drive) and its only job is to be the entry point. Nothing has happened to the content yet.

Feature extractors are the decomposition step, where the dimension reduction actually lives. A shot detector turns a thirty-second video into eight scenes with timestamps. A face identity model turns a frame into bounding boxes and 512-dim face embeddings. A multimodal extractor turns a scene into a Gemini embedding plus OCR output plus dominant colors plus whatever else you configured. Custom plugins (Python code, optionally with model weights) let you do the same for proprietary file types or domain-specific features.

Collections are where the reduced features live. Each collection is the output of one extractor, which means each has its own schema, its own embedding space, its own indices. A single bucket fans out into many collections: scenes, faces, OCR text, objects, audio segments. Instead of one giant vector per video, you get many small structured records, each measurable along a dimension you chose on purpose. That's the move that makes everything else work.

Retrievers compose collections. A retriever is a multi-stage pipeline: feature search on collection A, attribute filter on collection B, LLM filter on the merged set, reciprocal rank fusion at the end. Stages pass documents forward in a working set. If you've written a MongoDB aggregation pipeline, the mental model transfers directly. Retrievers are pipelines because no real application wants a single similarity score. It wants similarity plus a metadata filter plus a rerank plus a join.

Taxonomies are the hierarchy. A taxonomy is a semantic join between two collections, where the join operation is itself a retriever pipeline. The canonical example: collection A is faces extracted from a casting database (with names attached), collection B is faces extracted from new ad creative. The taxonomy says for every face in B, find the nearest face in A above some threshold, and enrich B with the name. Run that at ingest time and every new ad arrives pre-labeled with its talent.

Clusters are the dual of taxonomies. Taxonomies impose structure top-down (you defined the casting database). Clusters discover structure bottom-up. Run HDBSCAN on the embeddings in a collection, send the centroids to an LLM, and you get labeled groups you didn't have to specify in advance. The output is, of course, another collection, which can feed another retriever, which can populate another taxonomy.

Six primitives. They compose in one direction: raw object, decomposed feature, queryable surface, composed pipeline, hierarchical placement, emergent structure. The operation is reduce to measurable features and make them addressable.

Why hierarchy matters

Reduction alone is not enough. If you stop after the extractor stage, you have a smarter vector database. Better features, sure, but still a system where every query starts from scratch.

The leverage is in the hierarchy. Once you have a taxonomy of brands, products, talent, scenes, moods (or whatever your domain calls for), every new piece of content gets placed against it at ingest time. The compact fingerprint is computed once. Its location in the hierarchy is computed once. After that, retrieval is mostly traversal.

That collapses a distinction most teams treat as fundamental: enrichment versus search.

Flat embedding world: enrichment is a separate batch job. Run a labeling pipeline, write labels back to the database, hope the labels stay fresh
Decomposed-and-placed world: enrichment is the same operation as search, run in reverse

The taxonomy that locates a new ad in your brand hierarchy is the same retriever pipeline a user would invoke to ask "show me ads in this brand." One traversal, two directions.

Reducing dimensions is table stakes. Making the reduction a permanent, queryable, hierarchical structure is the part that compounds.

What this unlocks

Three things fall out of this architecture that are hard to build any other way.

Agentic retrieval

When your features are decomposed into named collections with named extractors, you can hand them to an LLM as tools. The LLM doesn't get a single search endpoint. It gets a feature search tool, an attribute filter tool, an LLM filter tool, each with explicit input and output shapes. The agent composes retriever stages dynamically based on the task.

Find ads featuring this actor that performed well in Q3 and use a similar color palette to this reference image

That becomes a four-stage pipeline the agent assembles on its own. You can't do that against a flat vector store because there's nothing to compose.

Cross-collection joins

The casting-database example is the simple form. The general form: any two collections with comparable feature spaces can be joined via a taxonomy, so you can enrich any feature with any other feature.

Faces with names
Scenes with brands
Audio with transcripts
Products with categories

The join is a retriever, the retriever is reusable, and the enriched output is itself a collection that feeds the next join.

Clustering that closes the loop

Run a clustering job on a collection, label the centroids with an LLM, and the labels become a taxonomy you didn't have to design. Apply that taxonomy to incoming content and every new object gets placed in a category that emerged from your data. The system bootstraps its own hierarchy.

The flywheel:

More content produces better clusters
Better clusters produce better taxonomies
Better taxonomies produce better placement of the next batch

The conceptual frame

The point of decomposition isn't to do anything magical. It takes the operation that representation learning has always done (reduce complex things to small interpretable representations) and makes it a piece of infrastructure instead of a piece of research.

Extractors instead of one-off model training
Collections instead of pickled embeddings
Retrievers instead of bespoke search code
Taxonomies instead of manual labeling
Clusters instead of EDA notebooks

A 3072-dimensional embedding is a starting point. The interesting work is in what you do after the embedding. If your stack stops there, you'll spend the next eighteen months reinventing the rest in application code. We know because we've watched it happen.

If any of that resonates, the docs walk through the primitives in the order they compose, with code.