text: 0.7, image: 0.3), the system learns from interaction data which features produce results users engage with.
How It Works
Learned fusion uses Thompson Sampling, a well-studied algorithm for the multi-armed bandit problem. Here’s how it applies to search fusion:Initialize with uniform priors
Each search feature (e.g., text embeddings, image embeddings) starts with a Beta(1, 1) distribution — a flat line that assigns equal probability to all weight values. This means zero assumptions about which feature is better.
Sample weights at query time
When a query arrives, the system draws a random weight from each feature’s Beta distribution and normalizes them to sum to 1. Early on, samples are highly variable (exploration). As data accumulates, they stabilize (exploitation).
Execute search with sampled weights
The feature search stage runs each embedding search and fuses results using the sampled weights — functionally identical to weighted fusion, but with dynamically chosen weights.
Capture user interactions
Users interact with results: clicks, purchases, skips. Each interaction is recorded with the document ID, position, and the context key that identifies which weight sample was used.
Update Beta distributions
Positive interactions (clicks, purchases) increment the
alpha parameter: alpha = 1 + clicks. Non-engagement increments beta: beta = 1 + (impressions - clicks). This shifts the distribution toward weights that produce engaging results.Thompson Sampling Explained
Think of it like flipping weighted coins. Each feature has its own coin:- At the start, both coins are fair — you have no idea which feature is better, so you flip both and take whatever comes up.
- After 50 interactions, the text feature’s coin lands “heads” 65% of the time (users click on text-matched results more). You naturally start weighting text higher, but still try image sometimes.
- After 1000 interactions, the text coin lands heads 72% of the time with very little variance. You’re confident in the weights and rarely deviate.
Hierarchical Fallback
Not every user has enough interaction history for personalized weights. The system uses a three-level fallback:| Level | Context | Min Interactions | When Used |
|---|---|---|---|
| Personal | Individual user | 5 | User has clicked/purchased enough for reliable weights |
| Demographic | User segment | 1 | User is new, but their segment has data |
| Global | All users | 1 | No segment data; uses aggregate behavior |
| Prior | Uniform | 0 | No interactions at all; falls back to equal weights |
user_id in your interaction signals enables personal-level learning. The segment field (e.g., “enterprise”, “consumer”, “power-user”) enables demographic-level learning.
End-to-End Walkthrough
1. Create a retriever with learned fusion
2. Execute a search
execution_id you’ll use for interaction tracking.
3. Capture interactions
4. Improved results over time
After 100+ interactions, the same search foruser_456 returns results with personalized fusion weights. If this user consistently engages with text-matched results over image-matched ones, the text feature weight increases for their queries.
5. Verify convergence
Use analytics to check how weights are evolving:Response Metadata
When learned fusion is active, the execution response includes a__learned_fusion__ object in each result’s metadata. Use it to verify the system is working and debug weight evolution:
| Field | Description |
|---|---|
context_level | Which fallback level was used: personal, demographic, global, or none (circuit breaker triggered, using uniform weights) |
context_key | The key used to look up interaction history (e.g., user:user_456) |
sampled_weights | The actual weights used for this query, keyed by feature URI |
effective_exploration | Current exploration multiplier after decay — lower means more exploitation |
circuit_breaker_triggered | true if the weight lookup timed out and fell back to uniform weights |
weight_resolution_ms | How long the weight lookup took in milliseconds |
Configuration Reference
Set to
"learned" to enable Thompson Sampling fusion.Each feature URI defines an “arm” in the bandit. The system learns a separate weight for each.
Passed at execution time. Enables personal-level weight learning. Without this, the system uses global weights only.
| Parameter | Default | Description |
|---|---|---|
prior_alpha | 1.0 | Beta distribution alpha prior (uniform) |
prior_beta | 1.0 | Beta distribution beta prior (uniform) |
exploration_bonus | 1.0 | Multiplier for distribution variance; >1 increases exploration |
min_interactions | 5 | Minimum interactions before using personal context |
When to Use Learned vs Static
| Scenario | Recommendation | Why |
|---|---|---|
| New product, no interaction data | rrf | No data to learn from; RRF is a strong default |
| Domain expert knows feature importance | weighted | Manual weights capture expert knowledge immediately |
| Diverse user base with different preferences | learned | Different users may benefit from different feature weights |
| A/B testing fusion approaches | rrf → learned | Start with baseline, measure improvement with evaluations |
| Single search feature | None needed | Fusion only applies when combining multiple features |
Session-Level Adaptation
Learned fusion persists weight state in ClickHouse, which has write-then-read latency (seconds to minutes). For within-session adaptation — where a user’s first few clicks should influence their next search immediately — the system uses a Redis session cache. When a user interacts with a result, the interaction is written to both ClickHouse (durable) and a Redis session cache (ephemeral, 1-hour TTL). On the next search in the same session, the bandit merges the session cache entries into the ClickHouse-backed Beta distributions before sampling:session_id on both search and interaction requests to enable this:
Without
session_id, the system still learns from interactions — it just won’t reflect them until ClickHouse ingests them (typically a few seconds). Session-level adaptation is optional but recommended for real-time UX.Temporal Decay
User preferences change over time. The system applies exponential decay to older interactions so recent behavior matters more:decay_factor: 0.995, the decay curve looks like:
| Age | Retained Weight | Effect |
|---|---|---|
| 1 day | 99.5% | Essentially full strength |
| 30 days | 86% (0.995^30) | Still strong |
| 90 days | 64% (0.995^90) | Noticeably faded |
| 180 days | 41% (0.995^180) | Weak influence |
| 365 days | 16% (0.995^365) | Nearly gone |
learning_config:
decay_factor(default0.995) — per-day multiplier. Set to1.0to disable decay entirely.decay_window_days(default365) — interactions older than this are ignored completely, reducing query cost.
Weight Clamping
Thompson Sampling can produce extreme weights that effectively silence a feature (e.g.,text: 0.99, image: 0.01). Weight clamping prevents this by enforcing minimum and maximum bounds:
[min_weight, max_weight] and then re-normalized. This guarantees that every feature contributes at least min_weight to the final fusion, even for users with heavily skewed interaction histories.
Why this matters: Without clamping, a user who clicks only text results could end up with image: 0.01 — effectively removing image search from their experience. If their preferences shift later, recovery is slow because the silenced feature produces almost no impressions to learn from.
Exploration Decay
Theexploration_bonus parameter controls how much the bandit explores (tries different weight combinations) vs. exploits (uses what it has learned). With a static bonus, the bandit never fully settles on the best weights.
Exploration decay reduces the bonus as interactions accumulate:
learning_config:
exploration_bonus(default1.0) — initial exploration multiplier. Higher values mean more random early sampling.exploration_decay(default0.99) — per-interaction decay rate.exploration_floor(default0.1) — minimum exploration. The bandit never fully stops exploring — this prevents it from getting permanently stuck on suboptimal weights if preferences change.
1.0 * 0.99^100 = 0.37. After 500: 1.0 * 0.99^500 = 0.007 (floored to 0.1). The system converges toward exploitation while maintaining a baseline level of exploration.
Multi-Signal Rewards
By default, learned fusion treatsclick as the only learning signal. The reward_map lets you assign different reward magnitudes to different interaction types:
alpha parameter for the associated feature (making it more likely to be weighted higher). Negative values increase the beta parameter (penalizing the feature). A purchase at 3.0 shifts weights three times as much as a click at 1.0.
Per-interaction rewards are also capped at max_reward_per_interaction (default 5.0) to prevent a single buggy or malicious interaction batch from dominating the learned weights.
See the Reward Signals reference for all 14 supported interaction types and guidance on choosing reward values.
Related
- Auto-Tune overview — the top-level guide to the full feedback loop
- Reward Signals — configuring which interactions drive learning
- Rollout Guide — traffic splitting, shadow mode, kill switch
- Fusion Strategies — comparison of all 5 strategies
- Interaction Signals — capturing the data that powers learning
- Evaluations — measuring learned fusion quality
- Feature Search stage — where fusion is configured

