Calibrating Similarity Scores: What Cosine Similarity Actually Means for Retrieval

The Decision Hiding Behind Every Score

When an AI agent searches a video library, a document store, or an image index, it gets back results with similarity scores: 0.42, 0.38, 0.31. Then it has to make a decision. Is the top result good enough to answer the question? Should it cite it, act on it, or say "I could not find that"? Should it pull the top 3, or only results above some bar?

That decision is a threshold on a similarity score. And it is where most retrieval systems quietly go wrong, because the score the agent is thresholding does not mean what it looks like it means. A cosine similarity of 0.31 is a strong match in one model and noise in another. The same query can produce 0.8 against an easy corpus and 0.4 against a hard one while retrieving equally good results. An agent that hard-codes "relevant if score > 0.7" will be overconfident on some queries and blind on others.

This guide is about what a similarity score actually is, why raw scores are not comparable or interpretable on their own, and how to calibrate them so a threshold means something.

What Cosine Similarity Computes

Cosine similarity between two vectors is the cosine of the angle between them:

cos(a, b) = (a . b) / (||a|| * ||b||)

If both vectors are L2-normalized (length 1), which most embedding pipelines do, this reduces to a plain dot product, and the value lives in [-1, 1]. For the encoders used in semantic search it is usually positive, so in practice you see roughly [0, 1].

Geometrically it measures direction, not magnitude: two vectors pointing the same way score 1 regardless of length. That is exactly what you want for semantic similarity, where the "amount" of signal should not dominate the "kind" of signal.

What cosine similarity is not:

It is not a probability. 0.5 does not mean "50% likely relevant."

It is not calibrated. There is no universal meaning to 0.4 across models, modalities, or even corpora.

It is not linear in relevance. The gap between 0.9 and 0.8 is not the same amount of "relevance" as the gap between 0.4 and 0.3.

A similarity score is an ordinal signal: within one query, against one index, with one model, higher is better. Everything beyond that ordering has to be earned through calibration.

Why Raw Scores Are Not Comparable

Four independent effects shape the absolute value of a score, and none of them is about whether the result is relevant.

1. Training temperature. Contrastive models (CLIP, SigLIP, CLAP, and their successors) train on logits of the form similarity / temperature. The temperature tau controls how sharply the model separates positives from negatives. A model trained with a small temperature pushes scores toward the extremes; a larger temperature compresses them toward the middle. Two models can rank the same results identically while one reports them around 0.9 and the other around 0.4. (For the training side of this, see Contrastive Learning.)

2. The modality gap. In a joint text-image space like CLIP, text embeddings and image embeddings occupy systematically different regions -- there is a measurable offset between the two clouds even for semantically identical pairs. So cross-modal scores (text query to image) sit on a different, usually lower, scale than uni-modal scores (image to image). A 0.3 text-to-image match can be as strong as a 0.7 image-to-image match. Audio-text spaces (CLAP) show the same offset. The consequence: a threshold tuned on one modality pairing is wrong for another.

3. Score-distribution shape. Some models produce tightly clustered scores (everything between 0.25 and 0.45), others spread them out. The spread is a property of the model and corpus, not of relevance. A 0.45 in a tight distribution can be the best possible match; a 0.45 in a wide distribution can be mediocre.

4. Per-query difficulty. For an easy, well-represented query the top results sit high; for a rare or out-of-distribution query even the correct answer scores lower. The absolute top score therefore drifts query to query. This is why "the best result scored only 0.4, so there is nothing relevant" is a bug, not a conclusion.

The practical upshot: never compare raw scores across models, across modalities, or across query types, and never transfer a threshold from one of those to another.

The Right Mental Model: Look at the Distribution

The single most useful habit is to stop reading the number and start reading the distribution. For a given model and corpus, run a batch of queries with known relevant and irrelevant results, and plot two score histograms: scores of relevant pairs and scores of irrelevant pairs.

You will see two overlapping humps. The separation between them -- not the absolute position -- is what determines how well a threshold can work:

If the humps are well separated, a threshold between them cleanly splits relevant from irrelevant.

If they overlap heavily, no threshold works well, and you need a better model, a reranker, or multi-stage retrieval rather than a cleverer cutoff.

This plot is the diagnostic that tells you whether your problem is "pick a threshold" or "the embeddings cannot separate these classes." Most threshold debugging is really this question in disguise.

Calibration Method 1: Pick an Operating Threshold From Labeled Data

The simplest calibration is to choose a threshold empirically for one specific (model, modality, task) combination.

1. Build a small labeled set: queries with known relevant and irrelevant results (a few hundred pairs is enough to start). 2. Score them all with your production model and pairing. 3. Sweep candidate thresholds and compute precision and recall at each. 4. Choose the threshold that hits your target -- precision-first for an agent that must not cite wrong evidence, recall-first for a discovery tool.

import numpy as np

def choose_threshold(scores, labels, target_precision=0.9):
    # scores: array of similarity scores; labels: 1 if relevant else 0
    order = np.argsort(-scores)
    scores, labels = scores[order], labels[order]
    tp = np.cumsum(labels)
    fp = np.cumsum(1 - labels)
    precision = tp / np.maximum(tp + fp, 1)
    recall = tp / max(labels.sum(), 1)
    ok = precision >= target_precision
    if not ok.any():
        return None  # model can't hit target precision at any cutoff
    # among thresholds meeting precision, take the one with best recall
    best = np.argmax(np.where(ok, recall, -1))
    return float(scores[best]), float(precision[best]), float(recall[best])

This threshold is valid only for the exact model, modality pairing, and task you measured. Re-measure when any of those change -- including a model version bump, which is the most common silent breakage.

Calibration Method 2: Normalize Scores Before Fusing Them

When you combine results from multiple retrievers -- dense vectors, sparse/BM25, a late-interaction model -- their score scales are unrelated. A dense cosine of 0.31, a BM25 score of 14.2, and a late-interaction MaxSim of 0.72 are not comparable, so you cannot add or average them directly.

Two robust options:

Per-list normalization. Rescale each result list to a common range before combining. Min-max maps each list to [0, 1]; z-score (subtract mean, divide by std) is more robust to outliers. Do this per query, because the scale shifts query to query.

def minmax(scores):
    lo, hi = scores.min(), scores.max()
    return (scores - lo) / (hi - lo + 1e-9)

Rank-based fusion (skip calibration entirely). Reciprocal Rank Fusion combines lists by rank position, not score, so the incompatible scales never matter:

RRF(d) = sum over lists  1 / (k + rank_of_d_in_list)     # k ~ 60

RRF is the safe default for hybrid search precisely because it sidesteps the calibration problem. Reach for normalized score fusion only when you have a reason to weight by score magnitude, and validate it against RRF.

Calibration Method 3: Map Scores to Probabilities

When an agent needs an actual confidence -- "how likely is this result correct, on a 0-1 scale I can reason about and threshold consistently" -- map raw scores to probabilities with a calibration function learned on labeled data.

Platt scaling fits a logistic curve from score to probability:

P(relevant | s) = 1 / (1 + exp(a * s + b))

You fit the two parameters a and b on (score, label) pairs. It assumes a sigmoid-shaped relationship, which holds well for most similarity scores.

Isotonic regression is non-parametric: it fits any monotonic (non-decreasing) mapping from score to probability. It is more flexible than Platt scaling but needs more data to avoid overfitting.

from sklearn.isotonic import IsotonicRegression
from sklearn.linear_model import LogisticRegression
import numpy as np

# Platt scaling
platt = LogisticRegression().fit(scores.reshape(-1, 1), labels)
prob = platt.predict_proba(scores.reshape(-1, 1))[:, 1]

# Isotonic (needs more labeled data)
iso = IsotonicRegression(out_of_bounds="clip").fit(scores, labels)
prob_iso = iso.predict(scores)

Once scores are probabilities, a threshold like "act if P >= 0.8" is stable and interpretable, and it transfers across queries (though still not across models -- refit per model). This is what lets an agent reason about confidence instead of guessing at raw cosines.

Designing the Threshold for an Agent

Calibration gives you a meaningful score; the agent still needs a policy. Three patterns that work in practice:

Abstain band. Define two thresholds. Above the high one, act. Below the low one, answer "not found." Between them, escalate -- rerank with a cross-encoder, ask for more context, or hand off to a human. The middle band is where naive single-threshold systems make their worst mistakes.

Top-k with a floor. Take the top k results but drop any below a calibrated floor, so a query with no good answer returns fewer (or zero) results instead of k bad ones. An agent reading "0 results above the floor" behaves far better than one handed k irrelevant clips.

Relative gap check. If the top result is not meaningfully above the second (a small score gap), treat the retrieval as ambiguous and widen context rather than committing to the top hit.

The common thread: an agent should be able to say "I do not have a good enough match," and that is only possible if the score it checks is calibrated.

Evaluation: Measure Calibration, Not Just Ranking

Ranking metrics (nDCG, recall@k) tell you if the order is good. They say nothing about whether your thresholds are meaningful. Add calibration-specific checks:

Reliability curve. Bin predicted probabilities and plot predicted vs observed relevance rate. A well-calibrated system sits on the diagonal; systematic deviation means your mapping is off.

Precision/recall at your chosen threshold, measured on held-out data per model and per modality pairing.

Threshold stability across query classes. Re-run the precision/recall sweep separately for object, scene, action, and cross-modal queries. If the best threshold differs sharply by class, use per-class thresholds rather than one global cutoff.

Re-validate on every model version. A retrained or swapped encoder almost always shifts the score distribution; an inherited threshold is the most common cause of a silent quality regression after a model upgrade.

Doing This in Mixpeek

Retrievers in Mixpeek return scored, ranked results, and hybrid retrieval fuses dense, sparse, and BM25 stages -- which is exactly where score incomparability bites. The default fusion is rank-based (RRF), so you get robust hybrid results without hand-calibrating across stages, and you can set a score floor on the retriever so low-confidence matches are dropped rather than returned to the agent.

from mixpeek import Mixpeek

client = Mixpeek(api_key="mxp_sk_...")

results = client.retrievers.execute(
    retriever_id="ret_video_evidence",
    query="forklift entering the loading aisle",
    top_k=20,
    min_score=0.0,          # rely on rank fusion; apply a calibrated floor in your app
)

# Apply YOUR calibrated, per-model policy on top of the ranked results
HIGH, LOW = 0.62, 0.45      # thresholds you measured on labeled data for THIS retriever
top = results[0]
if top.score >= HIGH:
    act_on(top)
elif top.score < LOW:
    answer_not_found()
else:
    rerank_or_escalate(results)   # the abstain band

Keep the calibrated thresholds in your application, versioned alongside the retriever and embedding model they were measured against -- not hard-coded as universal constants. When you change the model, re-measure. If you bring your own embeddings to MVS, the same rule holds: the scores are only meaningful relative to the encoder that produced them.

The Decision Hiding Behind Every Score

What Cosine Similarity Computes

Why Raw Scores Are Not Comparable

The Right Mental Model: Look at the Distribution

Calibration Method 1: Pick an Operating Threshold From Labeled Data

Calibration Method 2: Normalize Scores Before Fusing Them

Calibration Method 3: Map Scores to Probabilities

Designing the Threshold for an Agent

Evaluation: Measure Calibration, Not Just Ranking

Doing This in Mixpeek

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Semantic Caching: How Agents Skip Work They Have Already Done

Filtered Vector Search: How Agents Combine Similarity with Hard Constraints

BM25 and the Inverted Index: The Lexical Retriever Every Hybrid Search Treats as a Black Box