NEWAgents can now see video via MCP.Try it now →
    6 min read

    The 3072-Dimension Problem

    A 3072-dimensional embedding encodes everything about a video and distinguishes nothing. Decomposing content into named, measurable features, then placing them in a queryable hierarchy, is how multimodal search actually works at scale.

    The 3072-Dimension Problem
    Architecture

    You embedded your video library with Gemini. Or CLIP, or SigLIP, or whatever the model of the month was when the project started. You stored a few million vectors in Pinecone or Qdrant or Weaviate. You wired up cosine similarity. You ran your first query.

    The results were fine. Not great. Fine.

    A search for "person holding a coffee cup" returned videos with people, and videos with cups, and a surprising number of videos with neither. Reranking helped a little. Hybrid BM25 helped a little more. But somewhere around the fourth or fifth round of tuning, you started to suspect that the problem wasn't your retriever, or your reranker, or your prompt. The problem was that a 3072-dimensional embedding is not a useful unit of work.

    It encodes everything. The lighting, the camera angle, the dominant color, the demographic of the person on screen, the room they're in, the vibe of the room they're in. All of it, smeared across three thousand floats. Cosine similarity treats every dimension equally. Your application does not.

    Most teams hit that wall about six weeks into a multimodal project. A better vector database won't get you past it.


    The reframe

    A data science lead we work with (thirty years in the industry, ran one of the first audio fingerprinting startups in the early 2000s) described our job to me recently in a way I haven't stopped thinking about:

    Reduce the number of dimensions in a video to a handful of measurable features, then place every new video into a hierarchical structure defined by those features.

    He called the output a "compact fingerprint."

    When I shared that framing with an engineer at AWS, he pushed back immediately. Every kind of representation learning does that, he said. Find small, interpretable representations of complex things. He was right. The idea is old. What's rare is running it as managed infrastructure, continuously, across millions of objects, with the reduced features exposed as first-class queryable surfaces.

    The job of multimodal infrastructure is not to be a faster vector database. It's the systems layer that takes you from I have embeddings to I have a usable application. That layer has a specific shape, and most of the work is in the shape.


    How decomposition works in practice

    The shape, in our system, is six primitives that compose in one direction.

    Bucket raw objects land Extractor decompose features Collection queryable surfaces Retriever compose pipelines Taxonomy hierarchical placement Cluster emergent structure feeds back reduction & addressing composition & structure

    Buckets are where raw objects land. Videos, images, documents, audio. A bucket has a schema and an optional sync connection (S3, GCS, Drive) and its only job is to be the entry point. Nothing has happened to the content yet.

    Feature extractors are the decomposition step, where the dimension reduction actually lives. A shot detector turns a thirty-second video into eight scenes with timestamps. A face identity model turns a frame into bounding boxes and 512-dim face embeddings. A multimodal extractor turns a scene into a Gemini embedding plus OCR output plus dominant colors plus whatever else you configured. Custom plugins (Python code, optionally with model weights) let you do the same for proprietary file types or domain-specific features.

    Collections are where the reduced features live. Each collection is the output of one extractor, which means each has its own schema, its own embedding space, its own indices. A single bucket fans out into many collections: scenes, faces, OCR text, objects, audio segments. Instead of one giant vector per video, you get many small structured records, each measurable along a dimension you chose on purpose. That's the move that makes everything else work.

    Retrievers compose collections. A retriever is a multi-stage pipeline: feature search on collection A, attribute filter on collection B, LLM filter on the merged set, reciprocal rank fusion at the end. Stages pass documents forward in a working set. If you've written a MongoDB aggregation pipeline, the mental model transfers directly. Retrievers are pipelines because no real application wants a single similarity score. It wants similarity plus a metadata filter plus a rerank plus a join.

    Taxonomies are the hierarchy. A taxonomy is a semantic join between two collections, where the join operation is itself a retriever pipeline. The canonical example: collection A is faces extracted from a casting database (with names attached), collection B is faces extracted from new ad creative. The taxonomy says for every face in B, find the nearest face in A above some threshold, and enrich B with the name. Run that at ingest time and every new ad arrives pre-labeled with its talent.

    Clusters are the dual of taxonomies. Taxonomies impose structure top-down (you defined the casting database). Clusters discover structure bottom-up. Run HDBSCAN on the embeddings in a collection, send the centroids to an LLM, and you get labeled groups you didn't have to specify in advance. The output is, of course, another collection, which can feed another retriever, which can populate another taxonomy.

    Six primitives. They compose in one direction: raw object, decomposed feature, queryable surface, composed pipeline, hierarchical placement, emergent structure. The operation is reduce to measurable features and make them addressable.


    Why hierarchy matters

    Reduction alone is not enough. If you stop after the extractor stage, you have a smarter vector database. Better features, sure, but still a system where every query starts from scratch.

    The leverage is in the hierarchy. Once you have a taxonomy of brands, products, talent, scenes, moods (or whatever your domain calls for), every new piece of content gets placed against it at ingest time. The compact fingerprint is computed once. Its location in the hierarchy is computed once. After that, retrieval is mostly traversal.

    That collapses a distinction most teams treat as fundamental: enrichment versus search.

    • Flat embedding world: enrichment is a separate batch job. Run a labeling pipeline, write labels back to the database, hope the labels stay fresh
    • Decomposed-and-placed world: enrichment is the same operation as search, run in reverse

    The taxonomy that locates a new ad in your brand hierarchy is the same retriever pipeline a user would invoke to ask "show me ads in this brand." One traversal, two directions.

    Reducing dimensions is table stakes. Making the reduction a permanent, queryable, hierarchical structure is the part that compounds.


    What this unlocks

    Three things fall out of this architecture that are hard to build any other way.

    Agentic retrieval

    When your features are decomposed into named collections with named extractors, you can hand them to an LLM as tools. The LLM doesn't get a single search endpoint. It gets a feature search tool, an attribute filter tool, an LLM filter tool, each with explicit input and output shapes. The agent composes retriever stages dynamically based on the task.

    Find ads featuring this actor that performed well in Q3 and use a similar color palette to this reference image

    That becomes a four-stage pipeline the agent assembles on its own. You can't do that against a flat vector store because there's nothing to compose.

    Cross-collection joins

    The casting-database example is the simple form. The general form: any two collections with comparable feature spaces can be joined via a taxonomy, so you can enrich any feature with any other feature.

    • Faces with names
    • Scenes with brands
    • Audio with transcripts
    • Products with categories

    The join is a retriever, the retriever is reusable, and the enriched output is itself a collection that feeds the next join.

    Clustering that closes the loop

    Run a clustering job on a collection, label the centroids with an LLM, and the labels become a taxonomy you didn't have to design. Apply that taxonomy to incoming content and every new object gets placed in a category that emerged from your data. The system bootstraps its own hierarchy.

    The flywheel:

    1. More content produces better clusters
    2. Better clusters produce better taxonomies
    3. Better taxonomies produce better placement of the next batch

    The conceptual frame

    The point of decomposition isn't to do anything magical. It takes the operation that representation learning has always done (reduce complex things to small interpretable representations) and makes it a piece of infrastructure instead of a piece of research.

    • Extractors instead of one-off model training
    • Collections instead of pickled embeddings
    • Retrievers instead of bespoke search code
    • Taxonomies instead of manual labeling
    • Clusters instead of EDA notebooks

    A 3072-dimensional embedding is a starting point. The interesting work is in what you do after the embedding. If your stack stops there, you'll spend the next eighteen months reinventing the rest in application code. We know because we've watched it happen.

    If any of that resonates, the docs walk through the primitives in the order they compose, with code.