Embedding Space Geometry: Why Cosine Similarity Doesn't Always Mean What You Think

The Similarity Score Illusion

You embed a query, search your vector index, and get back results with cosine similarity scores: 0.82, 0.79, 0.76, 0.74, 0.71. The top result is clearly the best match, right? And anything below 0.70 is probably irrelevant?

Neither assumption is safe. Cosine similarity scores are not probabilities. They are not percentages. A score of 0.80 does not mean "80% similar." And the difference between 0.82 and 0.79 may be meaningful for one model and meaningless for another.

This guide explains why: by examining what actually happens in high-dimensional embedding spaces. Understanding the geometry changes how you set thresholds, evaluate retrieval quality, compare models, and build agents that make reliable decisions based on search scores.

The Narrow Band Problem

The first surprise: for most embedding models, cosine similarity scores concentrate in a narrow band rather than spreading across the full [-1, 1] range.

Run a random experiment. Take CLIP ViT-L/14, embed 10,000 random images, and compute pairwise cosine similarities. You might expect scores uniformly distributed from -1 to 1. Instead, you'll find the scores clustered between 0.15 and 0.40, with a peak around 0.25.

This is not a bug in CLIP. It is a mathematical consequence of high-dimensional geometry.

Why Scores Cluster: The Concentration of Measure

In high dimensions, random vectors become approximately orthogonal. For two independent random vectors in \(d\) dimensions, their cosine similarity converges to a distribution with:

Mean: approximately 0

Standard deviation: approximately \(1/\sqrt{d}\)

For CLIP's 768-dimensional embedding space, that gives a standard deviation of about 0.036. So random (unrelated) items cluster tightly around a cosine similarity of 0 with very little spread.

When the model is trained with contrastive learning (CLIP, SigLIP, CLAP), the distribution shifts upward for all items in the training domain, because the model learns to place semantically similar items in the same region of the sphere. But the concentration effect remains: the useful signal occupies a narrow band, not the full range.

Practical consequence: The difference between a cosine score of 0.30 and 0.35 with CLIP may represent a dramatic change in semantic relevance. Treating scores as intuitive percentages (where 0.30 means "30% similar") misses this entirely.

Distance Metrics: Cosine, L2, and Dot Product

Three distance metrics dominate vector search. They are mathematically related but behave differently in practice.

Cosine Similarity

Measures the angle between two vectors, ignoring magnitude:

cosine(a, b) = (a · b) / (||a|| × ||b||)

Range: [-1, 1]. Value of 1 means identical direction, 0 means orthogonal, -1 means opposite.

When to use: When embedding magnitudes are uninformative, which is the case for most contrastive models (CLIP, SigLIP, BGE, E5). These models are trained with normalized embeddings, so magnitude carries no semantic signal.

L2 (Euclidean) Distance

Measures the straight-line distance between two points:

L2(a, b) = sqrt(sum((a_i - b_i)²))

Range: [0, ∞). Lower is more similar.

When to use: When magnitudes matter. Some models (DINOv2, some autoencoders) produce embeddings where the vector length correlates with confidence or specificity. L2 preserves this signal; cosine discards it.

Maximum Inner Product Search (MIPS / Dot Product)

Just the raw dot product without normalization:

dot(a, b) = sum(a_i × b_i)

Range: (-∞, ∞). Higher is more similar.

When to use: When models are trained to directly predict relevance scores via dot product (some cross-encoder distillation methods, recommendation systems). Also equivalent to cosine similarity when vectors are L2-normalized.

The Equivalence Under Normalization

For L2-normalized vectors (

= 1), all three metrics produce equivalent rankings:

cosine(a, b) = dot(a, b)
L2(a, b)² = 2 - 2 × cosine(a, b)

Most modern embedding models output L2-normalized vectors by default. In that case, the metric you choose does not change retrieval quality: only the score scale. HNSW with cosine and HNSW with dot product will return the same results in the same order.

Where it matters: If you are comparing scores across models, or mixing normalized and unnormalized embeddings, or setting absolute score thresholds, then the metric choice determines whether your thresholds are meaningful.

The Hubness Problem

Hubness is the most underappreciated failure mode in vector search. In high dimensions, some vectors become "universal near neighbors": they appear in the nearest-neighbor lists of many queries, even though they are not particularly relevant to any of them.

What Causes Hubness

Consider a dataset of 1 million document embeddings. Some embeddings land near the centroid (center of mass) of the distribution. In high dimensions, these central vectors end up close to a disproportionate number of other vectors: not because they are semantically similar to everything, but because the geometry of high-dimensional space concentrates distances around the mean.

A vector that is close to the center of the embedding distribution will:

Have a moderate cosine similarity (0.3-0.5) with most other vectors

Appear in the top-k results for many different queries

Displace genuinely relevant results that are further from the center

How Hubness Affects Search

If your dataset contains a few "hub" documents, you will see them appear in results for unrelated queries. Symptoms:

The same 5-10 documents appear in results for very different queries

These documents tend to be generic or ambiguous (e.g., a stock photo of "business people in an office")

Increasing top-k makes the problem worse, because hubs fill more slots

Mitigation Strategies

Score centering: Subtract the mean similarity from each score. If a document has a high average similarity to everything, its centered score is lower, penalizing hubs.

centered_score(q, d) = cosine(q, d) - mean(cosine(*, d))

The mean is computed over a sample of queries (or approximated as the vector's average cosine to the dataset centroid).

CSLS (Cross-Domain Similarity Local Scaling): Used originally for bilingual word embedding alignment, CSLS penalizes vectors that are neighbors of many other vectors:

CSLS(q, d) = 2 × cosine(q, d) - mean_kNN(q) - mean_kNN(d)

Where mean_kNN(x) is the average cosine similarity of x to its k nearest neighbors.

Inverted Softmax: Normalize scores not just by the query's distribution but by the document's distribution of scores across queries. This directly downweights documents that score well against everything.

Practical note: Most teams never explicitly address hubness. At small scale (< 100K vectors), the problem is minor. At million-scale with generic content (stock photo libraries, product catalogs with many similar items), it becomes visible and worth addressing.

Cross-Model Score Comparison

A cosine similarity of 0.82 from CLIP means something completely different from a cosine similarity of 0.82 from BGE-large. You cannot compare scores across models without calibration.

Why Scores Differ Between Models

Each model has its own embedding space geometry, shaped by:

Training data: CLIP trained on 400M image-text pairs has a different distribution than BGE trained on text-only retrieval datasets

Training objective: Contrastive loss (CLIP, CLAP) vs. classification loss (DINOv2) vs. distillation (MiniLM) each produce different score distributions

Embedding dimension: 768-dim vs. 384-dim vs. 1536-dim spaces have different concentration profiles

Normalization: Some models output L2-normalized vectors (cosine scores in a tight band), others don't (scores spread wider)

Temperature: The softmax temperature used during contrastive training directly controls the "sharpness" of the similarity distribution

Calibration Techniques

Percentile mapping: Convert raw scores to percentile ranks within each model's distribution. A score at the 95th percentile means "more similar than 95% of pairs" regardless of the raw number. This requires a reference distribution (sample pairwise scores from your dataset).

Z-score normalization: Subtract the mean and divide by the standard deviation of the score distribution:

z_score = (raw_score - mean) / std

A z-score of 2.0 means "2 standard deviations above the mean similarity": comparable across models.

Isotonic regression: If you have labeled relevance judgments, fit a monotonic function that maps raw scores to calibrated probabilities. This is the most accurate method but requires labeled data.

Practical Rule of Thumb

If you do not have labeled data for calibration, use percentile ranks. Compute pairwise cosine similarities over a random sample of 10,000 documents from your dataset. Store the resulting score distribution. At query time, map each result's score to its percentile in this distribution. A result at the 99th percentile is reliably relevant regardless of the model.

Temperature and Score Sharpness

The temperature parameter during contrastive training controls how "peaked" the similarity distribution is. Lower temperature means:

Higher scores for positive pairs

Lower scores for negative pairs

A wider gap between relevant and irrelevant results

But also more sensitivity to small perturbations

How Temperature Affects Retrieval

A model trained with low temperature (e.g., τ = 0.01) produces similarity scores that are very peaked: the top result might score 0.92 while the second result scores 0.45. This makes threshold-based filtering easy but can be brittle: minor changes in the query produce large score swings.

A model trained with high temperature (e.g., τ = 0.07) produces flatter distributions: the top result scores 0.38 while the tenth result scores 0.32. This makes threshold-based filtering harder but is more robust to query perturbations.

For agents: If your agent makes binary relevant/not-relevant decisions based on a score threshold, you want a model with lower temperature (sharper separation). If your agent uses relative ranking (top-k), temperature matters less.

Practical Threshold Setting

Setting a similarity threshold: "only return results above 0.X": is one of the most common mistakes in production search systems. Here is why it is hard and how to do it correctly.

Why Fixed Thresholds Fail

A threshold of 0.75 might work perfectly for short, specific queries ("red Nike running shoes") but filter out every valid result for broad queries ("athletic footwear"). The similarity distribution shifts depending on:

Query specificity: Specific queries produce higher scores for their matches

Dataset density: Dense regions of the embedding space produce higher average scores

Content domain: Similarity distributions differ between product images, legal documents, and meeting transcripts

Better Approaches

Relative threshold (gap-based): Instead of an absolute threshold, look for the largest score drop between consecutive results. If the top 5 results score [0.82, 0.80, 0.79, 0.71, 0.65], the biggest gap is between 0.79 and 0.71: a natural cutoff.

Adaptive threshold per query: Compute the mean and standard deviation of scores for the top-k results. Set the threshold at mean - 1 standard deviation. This adapts to each query's score distribution.

Calibrated confidence: Use the percentile mapping described earlier. Set the threshold at a fixed percentile (e.g., "only return results in the top 5% of the similarity distribution") rather than a fixed raw score.

No threshold (top-k only): For many applications, especially agent-driven retrieval, returning the top-k results without a threshold is more robust. Let the agent or downstream reranker decide relevance.

The Agent Decision Problem

When an AI agent calls a search tool, it needs to decide: "Are these results good enough to answer the user's question, or should I search differently?"

Fixed score thresholds are unreliable for this decision. Better signals:

Score gap: Large gap between #1 and #2 suggests a clear best match

Score variance: Low variance among top-k suggests no clear winner: consider refining the query

Percentile rank: Top result at the 99.5th percentile of the model's score distribution is reliably relevant

Cross-validation: Search with multiple strategies (dense, sparse, hybrid) and check if the same documents appear across strategies

Dimensionality and Its Effects

Embedding dimension is not just a storage consideration: it directly affects the geometry of the similarity space.

Higher Dimensions

More dimensions means:

Better representational capacity: The model can encode more nuanced distinctions

Tighter score concentration: Random vectors become more orthogonal (scores cluster closer to 0)

Sharper discrimination: The gap between relevant and irrelevant results widens

Higher storage and compute costs: 1536-dim vectors cost 2x the storage of 768-dim

Matryoshka Embeddings

Modern embedding models (Nomic, Jina v4, some BGE variants) support Matryoshka representations: the first N dimensions of the full embedding are themselves a valid embedding. You can truncate from 768 to 256 dimensions with moderate quality loss.

The geometric implication: at 256 dimensions, scores spread wider (less concentration), so thresholds need adjustment. A score of 0.30 at 768-dim might correspond to 0.35 at 256-dim for the same pair. If you change embedding dimensions, recalibrate your thresholds.

Practical Dimension Selection

Dimension

Use Case

Score Spread

Storage per 1M Vectors

128	Real-time autocomplete, memory-constrained	Widest	~512 MB
256	Mobile/edge search, high-throughput	Wide	~1 GB
512	General-purpose retrieval	Moderate	~2 GB
768	Standard production search	Narrow	~3 GB
1536	High-precision, legal/medical	Narrowest	~6 GB

How This Works on Mixpeek

Mixpeek's retriever pipeline gives you control over how similarity scores are computed, normalized, and used for ranking across multiple models and modalities.

When you configure a multi-stage retriever, each stage can use a different model with a different embedding dimension and score distribution. The fusion stage (RRF or weighted) operates on ranks rather than raw scores, sidestepping the cross-model calibration problem:

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

results = client.retrievers.execute(
    retriever_id="media-retriever",
    query="person explaining a chart",
    pipeline=[
        {
            "stage_type": "search",
            "stage_id": "visual_search",
            "model": "mixpeek://image_extractor@v1/openai_clip_large_v1",
            "limit": 100
        },
        {
            "stage_type": "search",
            "stage_id": "transcript_search",
            "model": "mixpeek://text_extractor@v1/baai_bge_large_v1",
            "limit": 100
        },
        {
            "stage_type": "fusion",
            "stage_id": "rrf",
            "method": "reciprocal_rank_fusion",
            "k": 60,
            "limit": 20
        }
    ]
)

RRF uses rank positions rather than raw scores, so CLIP's cosine score of 0.32 (which might be excellent for CLIP) and BGE's score of 0.85 (typical for text retrieval) are combined correctly without manual calibration.

For agents using Mixpeek's search tools, the retriever handles the geometric complexity internally. The agent receives ranked results and can use the score gap pattern to decide confidence:

results = client.retrievers.execute(...)
scores = [r["score"] for r in results["results"]]

# Score gap heuristic for agent confidence
if len(scores) >= 2:
    gap = scores[0] - scores[1]
    confident = gap > 0.05  # clear winner exists

Key Takeaways

1. Cosine similarity scores are not percentages. A score of 0.80 does not mean "80% similar." The useful range depends on the model, dimension, and training objective.

2. Scores concentrate in narrow bands. For most models, the difference between 0.30 and 0.35 can be more significant than the difference between 0.70 and 0.90 in absolute terms.

3. Fixed thresholds are fragile. Use percentile ranks, score gaps, or relative thresholds instead of hard-coded score cutoffs.

4. Scores from different models are incomparable. Use rank-based fusion (RRF) or percentile calibration when combining results across models.

5. Hubness is real. At scale, some documents become universal near-neighbors. Score centering or CSLS can mitigate this.

6. Dimension affects score geometry. If you truncate Matryoshka embeddings, recalibrate your thresholds.

7. For agents, prefer rank-based decisions over score-based decisions. Score gaps and cross-strategy agreement are more reliable than absolute score thresholds.