Your company brain can't watch a video

The most useful thing anyone at your company said last quarter is probably sitting in a video nobody will watch again. It is in the all-hands recording, the customer call your account exec ran, the design review where someone explained why the old approach failed. Your company brain, the Glean or Dust or Notion AI you wired into every SaaS tool, cannot see any of it. It read the calendar invite and the follow-up doc. The forty minutes of people actually thinking, it skipped.

That is not a connector gap you close by adding one more integration. It is the architecture. Every company-brain product I have looked at is text in, text out, and most of what a company knows is not text. Before you argue about how to index institutional knowledge, you have to be able to read it in the first place.

The argument everyone is having is the wrong one

The category is hot. Glean reportedly crossed $200M ARR at a $7.2B valuation, Dust raised a $40M Series B, and Notion AI and Guru are in every stack. The engineering conversation around these tools has collapsed into a single debate: naive vector RAG or a knowledge graph.

If you have been on Hacker News lately you have seen the fight. The honest answer from the benchmark literature is that neither side wins. A systematic evaluation of RAG versus GraphRAG found no consistent winner: plain dense retrieval beat graph approaches on single-hop factual lookups and lost on multi-hop reasoning. A separate study clocked knowledge-graph RAG at 13.4% lower accuracy than vanilla RAG on Natural Questions. And the graph is not free: building it ran roughly 57x slower and querying it about 8x slower in one head-to-head. The graph only helps when the answer entities actually made it into the graph, and often they do not (around 65% coverage on common benchmarks).

So the space is pouring its energy into graph versus vector. Both sides are indexing the same thing: text. That is the miss.

Most of what a company knows is not text

Think about where knowledge actually lives. The all-hands recording. The Gong and Zoom call archive. The Figma files. Product and marketing footage. The screenshot someone pasted into the incident channel with the real stack trace. The whiteboard photo. The recorded onboarding session. Almost none of it is a document.

The standard move is to transcribe the audio and index the transcript. That throws away most of the signal. VideoRAG puts it well: a transcript “might describe the dog’s barking or growling” but fails to capture “baring teeth, raised hackles, or narrowed eyes.” The words are a fraction of what a video knows, and the authors note that RAG research has largely overlooked video entirely.

A company brain that reads transcripts is a company brain with its eyes closed. Step one is not a better index. It is refusing to flatten.

Decomposition: break the asset, do not summarize it

The alternative to flattening is decomposition: take each asset apart into named features, in the modality it was born in. This is where I reach for Mixpeek. One raw file enters as an object, gets segmented, and runs through versioned extractors that write features into a collection.

A forty-minute all-hands does not become one transcript. It becomes scene segments, each carrying a visual embedding from SigLIP, a speech segment transcribed by Whisper and embedded, and face-identity vectors so you can jump to the part where the CFO spoke. A PDF decomposes by page and paragraph. An image maps one to one. Each modality keeps its own representation instead of collapsing into English.

Every feature gets a stable address, a Feature URI, like mixpeek://multimodal_extractor@v2/vertex_multimodal_embedding. The model identifier is baked into the address, which is what lets you compare and fuse across models and swap embedding models later without re-flattening the whole archive. Now the video is queryable as what it is, not as a paragraph describing it.

Retrieval: one query across every modality

With features in place, a retriever is the query layer, the closest thing unstructured data has to SQL. A single multimodal search runs a feature-search stage across modalities and fuses the results. One text query, ranked across video scenes, PDF pages, and images at once:

{
  "stage_name": "search",
  "stage_type": "filter",
  "config": {
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v2/vertex_multimodal_embedding",
          "query": { "input_mode": "text", "value": "why we deprecated the v1 pipeline" },
          "top_k": 100
        },
        {
          "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
          "query": { "input_mode": "text", "value": "why we deprecated the v1 pipeline" },
          "lexical": true,
          "top_k": 100
        }
      ],
      "fusion": "rrf",
      "final_top_k": 25
    }
  }
}

Fusion defaults to reciprocal rank fusion at k=60, which sidesteps the score-scale mismatch between cosine similarity and BM25. Single searches land in 10 to 50ms, multi-search with fusion in 20 to 80ms. The honest caveat: at hundreds of thousands of assets, a cold novel-query vector search still costs a couple of seconds at p95 before the cache warms. That is the real number, not the demo number. But retrieval by itself is a trap, and it is the same trap the agent crowd just walked into.

Why agents do not fix flat retrieval

The pitch for agentic retrieval was that a flat vector index is fine if you let an agent search it iteratively, refine, and retry. In practice it underdelivers in specific, measurable ways.

The agent has no map of the corpus. As one recent paper on the problem puts it, “each search query is a shot in the dark… the agent operates without a map.” It also drifts, since autonomous query reformulation wanders off the user’s original intent, and it gets lazy: in one controlled run, follow-up retrieval fired 95% of the time with a light context and only 25% once the context filled up. On a government-document QA benchmark, an iterative agent tied a single-pass retriever, 90% to 91%. More rounds, same answer.

The root issue is the lack of structure. Flat top-k never lets the model see how the corpus is organized, so it never sees the forest for the trees. The approaches that actually move the numbers give the agent a structure to walk. RAPTOR builds a tree by recursively embedding, clustering, and summarizing, and set a new state of the art on the QuALITY benchmark. A newer method, Corpus2Skill, compiles the corpus into a navigable tree of cluster summaries the agent browses instead of blindly searching, and beats RAPTOR, agentic, and dense retrieval on its test set. That is one preprint on one dataset, so hold it loosely. But the direction is consistent: structure first, then navigate.

Give the brain a shape

Two constructs turn a pile of features into something navigable. Taxonomies classify every asset into your hierarchy by embedding similarity, not by an LLM guess and not by regex rules. It is the multimodal equivalent of a JOIN: each document is matched against reference collections of canonical categories, and the match writes a labeled path onto the document. Flat or hierarchical. Your product footage gets categorized the moment it lands.

Clustering is the other half. Clusters are warehouse-native grouping, a GROUP BY for embeddings. Run HDBSCAN or Leiden over the features, recursively sub-cluster into a labeled hierarchy, and let a small model name each cluster with a discriminative prompt so the labels stay distinct. That is the RAPTOR-shaped structure, built once over your real multimodal corpus instead of a text-only slice of it.

After both run, the brain has a shape: named categories plus a labeled cluster tree. The agent finally has a map.

Now let the agent navigate

Expose the whole thing as tools. The stack publishes retrievers, taxonomies, and clusters as an MCP server, around fifty tools, so an agent gets execute_retriever, execute_taxonomy, execute_cluster, and namespace search. Wiring Claude to it is one line:

claude mcp add mixpeek --transport streamable-http \
  --url https://mcp.mixpeek.com/mcp \
  --header "Authorization: Bearer YOUR_API_KEY"

Now the agent does not fire blind vector queries at a flat index. It reads the taxonomy to see which categories exist, walks the cluster hierarchy to find the right neighborhood, then retrieves inside it, across video, audio, images, and documents together. That is navigation over a multimodal structure instead of retrieval over a text blob. An agent needs a warehouse, not a database, and it is roughly how we built our own multimodal research agent.

The company-brain race is being run on the wrong track. The money and the benchmarks pour into graph versus vector, an argument about how to index text, while the recordings and design files and call archives sit unindexed because the tools cannot read them. Those are the parts of the company that were expensive to produce and are impossible to reproduce.

The order that works is boring and specific. Decompose every asset in its own modality, structure it with taxonomies and hierarchical clustering, then let an agent navigate the structure. Get that right and “search our company’s knowledge” finally includes the all-hands nobody rewatched.

There are open questions I do not have clean answers to. Cold-start latency on novel queries at scale is still seconds, not milliseconds. Incrementally updating a cluster hierarchy as the corpus grows is unsolved in the cleanest write-ups. And whether navigate-do-not-retrieve holds up outside a single benchmark is genuinely unknown. But the multimodal part is not the speculative part. That one is just overdue.