Architecture

Mixpeek follows a two-tower architecture — a well-known pattern in recommendation systems adapted for multimodal search.

Two towers: document tower writes N representation spaces at ingest, query tower reads and fuses them at search time, interaction signals close the feedback loop

Document Tower (Ingestion)

Source files enter through a bucket, trigger one or more collections, and pass through the Ray engine for feature extraction. Each collection produces a different representation — text embeddings, multimodal embeddings, metadata, taxonomy labels — all stored as named vectors on a single Qdrant point. Documents are encoded once at ingest time. Adding a new extractor or updating a taxonomy triggers a re-process on the bucket — documents get new representations without changing the ingestion path.

Query Tower (Retrieval)

A query arrives, gets encoded, and passes through a multi-stage retriever. The key stage is feature search, which runs a separate vector query per embedding space and fuses the results. The fusion strategy determines how per-feature scores combine into a final ranking:

Strategy	Behavior
`rrf`	Rank-based, no tuning needed
`weighted`	Manual weights you set
`learned`	Weights sampled from Beta distributions, updated by user behavior

See Fusion Strategies for the full comparison.

Closing the Loop

With learned fusion, the two towers aren’t static — they’re connected by a feedback loop:

Results are shown to users
Interactions (clicks, purchases, skips) are captured and stored in ClickHouse
Thompson Sampling aggregates interactions into Beta(α, β) distributions per feature — α counts positive signals, β counts non-engagement
Sampled weights are drawn from those distributions on each query, naturally balancing exploration and exploitation
Weights converge toward the optimal blend as interactions accumulate

The system handles cold start through hierarchical fallback: personal weights → demographic segment → global → uniform prior. With zero interactions, learned fusion behaves identically to RRF.

What Makes This Different

Standard two-tower systems learn a single embedding space. Mixpeek’s document tower fans out into N representation spaces (visual, audio, text, multimodal, metadata), and the query tower traverses them in sequence through multi-stage retrieval. Learning happens at the fusion layer — which spaces to weight — not inside the embeddings themselves. This means you can add a new extractor, re-process your data, and the bandit will automatically discover whether the new feature improves results — without retraining any model.

Feedback Loop Tutorial — step-by-step setup guide
Learned Fusion — Thompson Sampling algorithm details
Interaction Signals — which signals to capture and when
Fusion Strategies — all 5 strategies compared

Get Started

What Mixpeek Extracts

Retrieval

Platform

Vector Store

Resources

Document Tower (Ingestion)

Query Tower (Retrieval)

Closing the Loop

What Makes This Different

Get Started

What Mixpeek Extracts

Retrieval

Platform

Vector Store

Resources

Documentation Index

​Document Tower (Ingestion)

​Query Tower (Retrieval)

​Closing the Loop

​What Makes This Different

​Related

Document Tower (Ingestion)

Query Tower (Retrieval)

Closing the Loop

What Makes This Different

Related