The Similarity Score Illusion
You embed a query, search your vector index, and get back results with cosine similarity scores: 0.82, 0.79, 0.76, 0.74, 0.71. The top result is clearly the best match, right? And anything below 0.70 is probably irrelevant?
Neither assumption is safe. Cosine similarity scores are not probabilities. They are not percentages. A score of 0.80 does not mean "80% similar." And the difference between 0.82 and 0.79 may be meaningful for one model and meaningless for another.
This guide explains why — by examining what actually happens in high-dimensional embedding spaces. Understanding the geometry changes how you set thresholds, evaluate retrieval quality, compare models, and build agents that make reliable decisions based on search scores.
The Narrow Band Problem
The first surprise: for most embedding models, cosine similarity scores concentrate in a narrow band rather than spreading across the full [-1, 1] range.
Run a random experiment. Take CLIP ViT-L/14, embed 10,000 random images, and compute pairwise cosine similarities. You might expect scores uniformly distributed from -1 to 1. Instead, you'll find the scores clustered between 0.15 and 0.40, with a peak around 0.25.
This is not a bug in CLIP. It is a mathematical consequence of high-dimensional geometry.
Why Scores Cluster: The Concentration of Measure
In high dimensions, random vectors become approximately orthogonal. For two independent random vectors in \(d\) dimensions, their cosine similarity converges to a distribution with:
For CLIP's 768-dimensional embedding space, that gives a standard deviation of about 0.036. So random (unrelated) items cluster tightly around a cosine similarity of 0 with very little spread.
When the model is trained with contrastive learning (CLIP, SigLIP, CLAP), the distribution shifts upward for all items in the training domain, because the model learns to place semantically similar items in the same region of the sphere. But the concentration effect remains — the useful signal occupies a narrow band, not the full range.
Practical consequence: The difference between a cosine score of 0.30 and 0.35 with CLIP may represent a dramatic change in semantic relevance. Treating scores as intuitive percentages (where 0.30 means "30% similar") misses this entirely.
Distance Metrics: Cosine, L2, and Dot Product
Three distance metrics dominate vector search. They are mathematically related but behave differently in practice.
Cosine Similarity
Measures the angle between two vectors, ignoring magnitude:
cosine(a, b) = (a · b) / (a × b
)
Range: [-1, 1]. Value of 1 means identical direction, 0 means orthogonal, -1 means opposite.
When to use: When embedding magnitudes are uninformative — which is the case for most contrastive models (CLIP, SigLIP, BGE, E5). These models are trained with normalized embeddings, so magnitude carries no semantic signal.
L2 (Euclidean) Distance
Measures the straight-line distance between two points:
L2(a, b) = sqrt(sum((a_i - b_i)²))
Range: [0, ∞). Lower is more similar.
When to use: When magnitudes matter. Some models (DINOv2, some autoencoders) produce embeddings where the vector length correlates with confidence or specificity. L2 preserves this signal; cosine discards it.
Maximum Inner Product Search (MIPS / Dot Product)
Just the raw dot product without normalization:
dot(a, b) = sum(a_i × b_i)
Range: (-∞, ∞). Higher is more similar.
When to use: When models are trained to directly predict relevance scores via dot product (some cross-encoder distillation methods, recommendation systems). Also equivalent to cosine similarity when vectors are L2-normalized.
The Equivalence Under Normalization
For L2-normalized vectors (
| a | = | b |
cosine(a, b) = dot(a, b)
L2(a, b)² = 2 - 2 × cosine(a, b)
Most modern embedding models output L2-normalized vectors by default. In that case, the metric you choose does not change retrieval quality — only the score scale. HNSW with cosine and HNSW with dot product will return the same results in the same order.
Where it matters: If you are comparing scores across models, or mixing normalized and unnormalized embeddings, or setting absolute score thresholds — then the metric choice determines whether your thresholds are meaningful.
The Hubness Problem
Hubness is the most underappreciated failure mode in vector search. In high dimensions, some vectors become "universal near neighbors" — they appear in the nearest-neighbor lists of many queries, even though they are not particularly relevant to any of them.
What Causes Hubness
Consider a dataset of 1 million document embeddings. Some embeddings land near the centroid (center of mass) of the distribution. In high dimensions, these central vectors end up close to a disproportionate number of other vectors — not because they are semantically similar to everything, but because the geometry of high-dimensional space concentrates distances around the mean.
A vector that is close to the center of the embedding distribution will:
How Hubness Affects Search
If your dataset contains a few "hub" documents, you will see them appear in results for unrelated queries. Symptoms:
Mitigation Strategies
Score centering: Subtract the mean similarity from each score. If a document has a high average similarity to everything, its centered score is lower — penalizing hubs.
centered_score(q, d) = cosine(q, d) - mean(cosine(*, d))
The mean is computed over a sample of queries (or approximated as the vector's average cosine to the dataset centroid).
CSLS (Cross-Domain Similarity Local Scaling): Used originally for bilingual word embedding alignment, CSLS penalizes vectors that are neighbors of many other vectors:
CSLS(q, d) = 2 × cosine(q, d) - mean_kNN(q) - mean_kNN(d)
Where `mean_kNN(x)` is the average cosine similarity of x to its k nearest neighbors.
Inverted Softmax: Normalize scores not just by the query's distribution but by the document's distribution of scores across queries. This directly downweights documents that score well against everything.
Practical note: Most teams never explicitly address hubness. At small scale (< 100K vectors), the problem is minor. At million-scale with generic content (stock photo libraries, product catalogs with many similar items), it becomes visible and worth addressing.
Cross-Model Score Comparison
A cosine similarity of 0.82 from CLIP means something completely different from a cosine similarity of 0.82 from BGE-large. You cannot compare scores across models without calibration.
Why Scores Differ Between Models
Each model has its own embedding space geometry, shaped by:
Calibration Techniques
Percentile mapping: Convert raw scores to percentile ranks within each model's distribution. A score at the 95th percentile means "more similar than 95% of pairs" regardless of the raw number. This requires a reference distribution (sample pairwise scores from your dataset).
Z-score normalization: Subtract the mean and divide by the standard deviation of the score distribution:
z_score = (raw_score - mean) / std
A z-score of 2.0 means "2 standard deviations above the mean similarity" — comparable across models.
Isotonic regression: If you have labeled relevance judgments, fit a monotonic function that maps raw scores to calibrated probabilities. This is the most accurate method but requires labeled data.
Practical Rule of Thumb
If you do not have labeled data for calibration, use percentile ranks. Compute pairwise cosine similarities over a random sample of 10,000 documents from your dataset. Store the resulting score distribution. At query time, map each result's score to its percentile in this distribution. A result at the 99th percentile is reliably relevant regardless of the model.
Temperature and Score Sharpness
The temperature parameter during contrastive training controls how "peaked" the similarity distribution is. Lower temperature means:
How Temperature Affects Retrieval
A model trained with low temperature (e.g., τ = 0.01) produces similarity scores that are very peaked: the top result might score 0.92 while the second result scores 0.45. This makes threshold-based filtering easy but can be brittle — minor changes in the query produce large score swings.
A model trained with high temperature (e.g., τ = 0.07) produces flatter distributions: the top result scores 0.38 while the tenth result scores 0.32. This makes threshold-based filtering harder but is more robust to query perturbations.
For agents: If your agent makes binary relevant/not-relevant decisions based on a score threshold, you want a model with lower temperature (sharper separation). If your agent uses relative ranking (top-k), temperature matters less.
Practical Threshold Setting
Setting a similarity threshold — "only return results above 0.X" — is one of the most common mistakes in production search systems. Here is why it is hard and how to do it correctly.
Why Fixed Thresholds Fail
A threshold of 0.75 might work perfectly for short, specific queries ("red Nike running shoes") but filter out every valid result for broad queries ("athletic footwear"). The similarity distribution shifts depending on:
Better Approaches
Relative threshold (gap-based): Instead of an absolute threshold, look for the largest score drop between consecutive results. If the top 5 results score [0.82, 0.80, 0.79, 0.71, 0.65], the biggest gap is between 0.79 and 0.71 — a natural cutoff.
Adaptive threshold per query: Compute the mean and standard deviation of scores for the top-k results. Set the threshold at mean - 1 standard deviation. This adapts to each query's score distribution.
Calibrated confidence: Use the percentile mapping described earlier. Set the threshold at a fixed percentile (e.g., "only return results in the top 5% of the similarity distribution") rather than a fixed raw score.
No threshold (top-k only): For many applications, especially agent-driven retrieval, returning the top-k results without a threshold is more robust. Let the agent or downstream reranker decide relevance.
The Agent Decision Problem
When an AI agent calls a search tool, it needs to decide: "Are these results good enough to answer the user's question, or should I search differently?"
Fixed score thresholds are unreliable for this decision. Better signals:
Dimensionality and Its Effects
Embedding dimension is not just a storage consideration — it directly affects the geometry of the similarity space.
Higher Dimensions
More dimensions means:
Matryoshka Embeddings
Modern embedding models (Nomic, Jina v4, some BGE variants) support Matryoshka representations — the first N dimensions of the full embedding are themselves a valid embedding. You can truncate from 768 to 256 dimensions with moderate quality loss.
The geometric implication: at 256 dimensions, scores spread wider (less concentration), so thresholds need adjustment. A score of 0.30 at 768-dim might correspond to 0.35 at 256-dim for the same pair. If you change embedding dimensions, recalibrate your thresholds.
Practical Dimension Selection
| Dimension | Use Case | Score Spread | Storage per 1M Vectors |
| 128 | Real-time autocomplete, memory-constrained | Widest | ~512 MB |
| 256 | Mobile/edge search, high-throughput | Wide | ~1 GB |
| 512 | General-purpose retrieval | Moderate | ~2 GB |
| 768 | Standard production search | Narrow | ~3 GB |
| 1536 | High-precision, legal/medical | Narrowest | ~6 GB |
How This Works on Mixpeek
Mixpeek's retriever pipeline gives you control over how similarity scores are computed, normalized, and used for ranking across multiple models and modalities.
When you configure a multi-stage retriever, each stage can use a different model with a different embedding dimension and score distribution. The fusion stage (RRF or weighted) operates on ranks rather than raw scores, sidestepping the cross-model calibration problem:
from mixpeek import Mixpeek
client = Mixpeek(api_key="YOUR_API_KEY")
results = client.retrievers.search(
retriever_id="media-retriever",
query="person explaining a chart",
pipeline=[
{
"stage_type": "search",
"stage_id": "visual_search",
"model": "mixpeek://image_extractor@v1/openai_clip_large_v1",
"limit": 100
},
{
"stage_type": "search",
"stage_id": "transcript_search",
"model": "mixpeek://text_extractor@v1/baai_bge_large_v1",
"limit": 100
},
{
"stage_type": "fusion",
"stage_id": "rrf",
"method": "reciprocal_rank_fusion",
"k": 60,
"limit": 20
}
]
)
RRF uses rank positions rather than raw scores, so CLIP's cosine score of 0.32 (which might be excellent for CLIP) and BGE's score of 0.85 (typical for text retrieval) are combined correctly without manual calibration.
For agents using Mixpeek's search tools, the retriever handles the geometric complexity internally. The agent receives ranked results and can use the score gap pattern to decide confidence:
results = client.retrievers.search(...)
scores = [r["score"] for r in results["results"]]
# Score gap heuristic for agent confidence
if len(scores) >= 2:
gap = scores[0] - scores[1]
confident = gap > 0.05 # clear winner exists
Key Takeaways
1. Cosine similarity scores are not percentages. A score of 0.80 does not mean "80% similar." The useful range depends on the model, dimension, and training objective.
2. Scores concentrate in narrow bands. For most models, the difference between 0.30 and 0.35 can be more significant than the difference between 0.70 and 0.90 in absolute terms.
3. Fixed thresholds are fragile. Use percentile ranks, score gaps, or relative thresholds instead of hard-coded score cutoffs.
4. Scores from different models are incomparable. Use rank-based fusion (RRF) or percentile calibration when combining results across models.
5. Hubness is real. At scale, some documents become universal near-neighbors. Score centering or CSLS can mitigate this.
6. Dimension affects score geometry. If you truncate Matryoshka embeddings, recalibrate your thresholds.
7. For agents, prefer rank-based decisions over score-based decisions. Score gaps and cross-strategy agreement are more reliable than absolute score thresholds.