NEWManaged multimodal retrieval.Explore platform →
    Embeddings
    20 min read
    Updated 2026-05-31

    Embedding Space Geometry: Why Cosine Similarity Doesn't Always Mean What You Think

    A deep dive into the geometry of embedding spaces — why cosine similarity scores cluster in narrow ranges, how the hubness problem distorts retrieval, when to use cosine vs. L2 vs. dot product, and practical techniques for threshold setting, score calibration, and cross-model comparison.

    Embeddings
    Search
    Similarity
    Algorithms
    Vector Search

    The Similarity Score Illusion



    You embed a query, search your vector index, and get back results with cosine similarity scores: 0.82, 0.79, 0.76, 0.74, 0.71. The top result is clearly the best match, right? And anything below 0.70 is probably irrelevant?

    Neither assumption is safe. Cosine similarity scores are not probabilities. They are not percentages. A score of 0.80 does not mean "80% similar." And the difference between 0.82 and 0.79 may be meaningful for one model and meaningless for another.

    This guide explains why — by examining what actually happens in high-dimensional embedding spaces. Understanding the geometry changes how you set thresholds, evaluate retrieval quality, compare models, and build agents that make reliable decisions based on search scores.

    The Narrow Band Problem



    The first surprise: for most embedding models, cosine similarity scores concentrate in a narrow band rather than spreading across the full [-1, 1] range.

    Run a random experiment. Take CLIP ViT-L/14, embed 10,000 random images, and compute pairwise cosine similarities. You might expect scores uniformly distributed from -1 to 1. Instead, you'll find the scores clustered between 0.15 and 0.40, with a peak around 0.25.

    This is not a bug in CLIP. It is a mathematical consequence of high-dimensional geometry.

    Why Scores Cluster: The Concentration of Measure



    In high dimensions, random vectors become approximately orthogonal. For two independent random vectors in \(d\) dimensions, their cosine similarity converges to a distribution with:

  1. Mean: approximately 0
  2. Standard deviation: approximately \(1/\sqrt{d}\)


  3. For CLIP's 768-dimensional embedding space, that gives a standard deviation of about 0.036. So random (unrelated) items cluster tightly around a cosine similarity of 0 with very little spread.

    When the model is trained with contrastive learning (CLIP, SigLIP, CLAP), the distribution shifts upward for all items in the training domain, because the model learns to place semantically similar items in the same region of the sphere. But the concentration effect remains — the useful signal occupies a narrow band, not the full range.

    Practical consequence: The difference between a cosine score of 0.30 and 0.35 with CLIP may represent a dramatic change in semantic relevance. Treating scores as intuitive percentages (where 0.30 means "30% similar") misses this entirely.

    Distance Metrics: Cosine, L2, and Dot Product



    Three distance metrics dominate vector search. They are mathematically related but behave differently in practice.

    Cosine Similarity



    Measures the angle between two vectors, ignoring magnitude:

    cosine(a, b) = (a · b) / (
    a×b
    )


    Range: [-1, 1]. Value of 1 means identical direction, 0 means orthogonal, -1 means opposite.

    When to use: When embedding magnitudes are uninformative — which is the case for most contrastive models (CLIP, SigLIP, BGE, E5). These models are trained with normalized embeddings, so magnitude carries no semantic signal.

    L2 (Euclidean) Distance



    Measures the straight-line distance between two points:

    L2(a, b) = sqrt(sum((a_i - b_i)²))
    


    Range: [0, ∞). Lower is more similar.

    When to use: When magnitudes matter. Some models (DINOv2, some autoencoders) produce embeddings where the vector length correlates with confidence or specificity. L2 preserves this signal; cosine discards it.

    Maximum Inner Product Search (MIPS / Dot Product)



    Just the raw dot product without normalization:

    dot(a, b) = sum(a_i × b_i)
    


    Range: (-∞, ∞). Higher is more similar.

    When to use: When models are trained to directly predict relevance scores via dot product (some cross-encoder distillation methods, recommendation systems). Also equivalent to cosine similarity when vectors are L2-normalized.

    The Equivalence Under Normalization



    For L2-normalized vectors (
    a=b
    = 1), all three metrics produce equivalent rankings:

    cosine(a, b) = dot(a, b)
    L2(a, b)² = 2 - 2 × cosine(a, b)
    


    Most modern embedding models output L2-normalized vectors by default. In that case, the metric you choose does not change retrieval quality — only the score scale. HNSW with cosine and HNSW with dot product will return the same results in the same order.

    Where it matters: If you are comparing scores across models, or mixing normalized and unnormalized embeddings, or setting absolute score thresholds — then the metric choice determines whether your thresholds are meaningful.

    The Hubness Problem



    Hubness is the most underappreciated failure mode in vector search. In high dimensions, some vectors become "universal near neighbors" — they appear in the nearest-neighbor lists of many queries, even though they are not particularly relevant to any of them.

    What Causes Hubness



    Consider a dataset of 1 million document embeddings. Some embeddings land near the centroid (center of mass) of the distribution. In high dimensions, these central vectors end up close to a disproportionate number of other vectors — not because they are semantically similar to everything, but because the geometry of high-dimensional space concentrates distances around the mean.

    A vector that is close to the center of the embedding distribution will:
  4. Have a moderate cosine similarity (0.3–0.5) with most other vectors
  5. Appear in the top-k results for many different queries
  6. Displace genuinely relevant results that are further from the center


  7. How Hubness Affects Search



    If your dataset contains a few "hub" documents, you will see them appear in results for unrelated queries. Symptoms:

  8. The same 5–10 documents appear in results for very different queries
  9. These documents tend to be generic or ambiguous (e.g., a stock photo of "business people in an office")
  10. Increasing top-k makes the problem worse, because hubs fill more slots


  11. Mitigation Strategies



    Score centering: Subtract the mean similarity from each score. If a document has a high average similarity to everything, its centered score is lower — penalizing hubs.

    centered_score(q, d) = cosine(q, d) - mean(cosine(*, d))
    


    The mean is computed over a sample of queries (or approximated as the vector's average cosine to the dataset centroid).

    CSLS (Cross-Domain Similarity Local Scaling): Used originally for bilingual word embedding alignment, CSLS penalizes vectors that are neighbors of many other vectors:

    CSLS(q, d) = 2 × cosine(q, d) - mean_kNN(q) - mean_kNN(d)
    


    Where `mean_kNN(x)` is the average cosine similarity of x to its k nearest neighbors.

    Inverted Softmax: Normalize scores not just by the query's distribution but by the document's distribution of scores across queries. This directly downweights documents that score well against everything.

    Practical note: Most teams never explicitly address hubness. At small scale (< 100K vectors), the problem is minor. At million-scale with generic content (stock photo libraries, product catalogs with many similar items), it becomes visible and worth addressing.

    Cross-Model Score Comparison



    A cosine similarity of 0.82 from CLIP means something completely different from a cosine similarity of 0.82 from BGE-large. You cannot compare scores across models without calibration.

    Why Scores Differ Between Models



    Each model has its own embedding space geometry, shaped by:

  12. Training data: CLIP trained on 400M image-text pairs has a different distribution than BGE trained on text-only retrieval datasets
  13. Training objective: Contrastive loss (CLIP, CLAP) vs. classification loss (DINOv2) vs. distillation (MiniLM) each produce different score distributions
  14. Embedding dimension: 768-dim vs. 384-dim vs. 1536-dim spaces have different concentration profiles
  15. Normalization: Some models output L2-normalized vectors (cosine scores in a tight band), others don't (scores spread wider)
  16. Temperature: The softmax temperature used during contrastive training directly controls the "sharpness" of the similarity distribution


  17. Calibration Techniques



    Percentile mapping: Convert raw scores to percentile ranks within each model's distribution. A score at the 95th percentile means "more similar than 95% of pairs" regardless of the raw number. This requires a reference distribution (sample pairwise scores from your dataset).

    Z-score normalization: Subtract the mean and divide by the standard deviation of the score distribution:

    z_score = (raw_score - mean) / std
    


    A z-score of 2.0 means "2 standard deviations above the mean similarity" — comparable across models.

    Isotonic regression: If you have labeled relevance judgments, fit a monotonic function that maps raw scores to calibrated probabilities. This is the most accurate method but requires labeled data.

    Practical Rule of Thumb



    If you do not have labeled data for calibration, use percentile ranks. Compute pairwise cosine similarities over a random sample of 10,000 documents from your dataset. Store the resulting score distribution. At query time, map each result's score to its percentile in this distribution. A result at the 99th percentile is reliably relevant regardless of the model.

    Temperature and Score Sharpness



    The temperature parameter during contrastive training controls how "peaked" the similarity distribution is. Lower temperature means:

  18. Higher scores for positive pairs
  19. Lower scores for negative pairs
  20. A wider gap between relevant and irrelevant results
  21. But also more sensitivity to small perturbations


  22. How Temperature Affects Retrieval



    A model trained with low temperature (e.g., τ = 0.01) produces similarity scores that are very peaked: the top result might score 0.92 while the second result scores 0.45. This makes threshold-based filtering easy but can be brittle — minor changes in the query produce large score swings.

    A model trained with high temperature (e.g., τ = 0.07) produces flatter distributions: the top result scores 0.38 while the tenth result scores 0.32. This makes threshold-based filtering harder but is more robust to query perturbations.

    For agents: If your agent makes binary relevant/not-relevant decisions based on a score threshold, you want a model with lower temperature (sharper separation). If your agent uses relative ranking (top-k), temperature matters less.

    Practical Threshold Setting



    Setting a similarity threshold — "only return results above 0.X" — is one of the most common mistakes in production search systems. Here is why it is hard and how to do it correctly.

    Why Fixed Thresholds Fail



    A threshold of 0.75 might work perfectly for short, specific queries ("red Nike running shoes") but filter out every valid result for broad queries ("athletic footwear"). The similarity distribution shifts depending on:

  23. Query specificity: Specific queries produce higher scores for their matches
  24. Dataset density: Dense regions of the embedding space produce higher average scores
  25. Content domain: Similarity distributions differ between product images, legal documents, and meeting transcripts


  26. Better Approaches



    Relative threshold (gap-based): Instead of an absolute threshold, look for the largest score drop between consecutive results. If the top 5 results score [0.82, 0.80, 0.79, 0.71, 0.65], the biggest gap is between 0.79 and 0.71 — a natural cutoff.

    Adaptive threshold per query: Compute the mean and standard deviation of scores for the top-k results. Set the threshold at mean - 1 standard deviation. This adapts to each query's score distribution.

    Calibrated confidence: Use the percentile mapping described earlier. Set the threshold at a fixed percentile (e.g., "only return results in the top 5% of the similarity distribution") rather than a fixed raw score.

    No threshold (top-k only): For many applications, especially agent-driven retrieval, returning the top-k results without a threshold is more robust. Let the agent or downstream reranker decide relevance.

    The Agent Decision Problem



    When an AI agent calls a search tool, it needs to decide: "Are these results good enough to answer the user's question, or should I search differently?"

    Fixed score thresholds are unreliable for this decision. Better signals:

  27. Score gap: Large gap between #1 and #2 suggests a clear best match
  28. Score variance: Low variance among top-k suggests no clear winner — consider refining the query
  29. Percentile rank: Top result at the 99.5th percentile of the model's score distribution is reliably relevant
  30. Cross-validation: Search with multiple strategies (dense, sparse, hybrid) and check if the same documents appear across strategies


  31. Dimensionality and Its Effects



    Embedding dimension is not just a storage consideration — it directly affects the geometry of the similarity space.

    Higher Dimensions



    More dimensions means:
  32. Better representational capacity: The model can encode more nuanced distinctions
  33. Tighter score concentration: Random vectors become more orthogonal (scores cluster closer to 0)
  34. Sharper discrimination: The gap between relevant and irrelevant results widens
  35. Higher storage and compute costs: 1536-dim vectors cost 2x the storage of 768-dim


  36. Matryoshka Embeddings



    Modern embedding models (Nomic, Jina v4, some BGE variants) support Matryoshka representations — the first N dimensions of the full embedding are themselves a valid embedding. You can truncate from 768 to 256 dimensions with moderate quality loss.

    The geometric implication: at 256 dimensions, scores spread wider (less concentration), so thresholds need adjustment. A score of 0.30 at 768-dim might correspond to 0.35 at 256-dim for the same pair. If you change embedding dimensions, recalibrate your thresholds.

    Practical Dimension Selection



    DimensionUse CaseScore SpreadStorage per 1M Vectors
    128Real-time autocomplete, memory-constrainedWidest~512 MB
    256Mobile/edge search, high-throughputWide~1 GB
    512General-purpose retrievalModerate~2 GB
    768Standard production searchNarrow~3 GB
    1536High-precision, legal/medicalNarrowest~6 GB

    How This Works on Mixpeek



    Mixpeek's retriever pipeline gives you control over how similarity scores are computed, normalized, and used for ranking across multiple models and modalities.

    When you configure a multi-stage retriever, each stage can use a different model with a different embedding dimension and score distribution. The fusion stage (RRF or weighted) operates on ranks rather than raw scores, sidestepping the cross-model calibration problem:

    from mixpeek import Mixpeek

    client = Mixpeek(api_key="YOUR_API_KEY")

    results = client.retrievers.search( retriever_id="media-retriever", query="person explaining a chart", pipeline=[ { "stage_type": "search", "stage_id": "visual_search", "model": "mixpeek://image_extractor@v1/openai_clip_large_v1", "limit": 100 }, { "stage_type": "search", "stage_id": "transcript_search", "model": "mixpeek://text_extractor@v1/baai_bge_large_v1", "limit": 100 }, { "stage_type": "fusion", "stage_id": "rrf", "method": "reciprocal_rank_fusion", "k": 60, "limit": 20 } ] )


    RRF uses rank positions rather than raw scores, so CLIP's cosine score of 0.32 (which might be excellent for CLIP) and BGE's score of 0.85 (typical for text retrieval) are combined correctly without manual calibration.

    For agents using Mixpeek's search tools, the retriever handles the geometric complexity internally. The agent receives ranked results and can use the score gap pattern to decide confidence:

    results = client.retrievers.search(...)
    scores = [r["score"] for r in results["results"]]

    # Score gap heuristic for agent confidence if len(scores) >= 2: gap = scores[0] - scores[1] confident = gap > 0.05 # clear winner exists


    Key Takeaways



    1. Cosine similarity scores are not percentages. A score of 0.80 does not mean "80% similar." The useful range depends on the model, dimension, and training objective.

    2. Scores concentrate in narrow bands. For most models, the difference between 0.30 and 0.35 can be more significant than the difference between 0.70 and 0.90 in absolute terms.

    3. Fixed thresholds are fragile. Use percentile ranks, score gaps, or relative thresholds instead of hard-coded score cutoffs.

    4. Scores from different models are incomparable. Use rank-based fusion (RRF) or percentile calibration when combining results across models.

    5. Hubness is real. At scale, some documents become universal near-neighbors. Score centering or CSLS can mitigate this.

    6. Dimension affects score geometry. If you truncate Matryoshka embeddings, recalibrate your thresholds.

    7. For agents, prefer rank-based decisions over score-based decisions. Score gaps and cross-strategy agreement are more reliable than absolute score thresholds.

    Further Reading



  37. Multi-Index Search Architecture — combining results from multiple indexes with score fusion
  38. Embedding Quantization & Compression — how quantization affects similarity scores and retrieval quality
  39. Approximate Nearest Neighbor Search — the algorithms behind fast vector retrieval and their accuracy tradeoffs
  40. Contrastive Learning — how CLIP, SigLIP, and CLAP training objectives shape embedding geometry
  41. Instruction-Tuned Embeddings — how task prompts shift score distributions for different retrieval scenarios
  42. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs