Why a Generic Model Is Not Enough
An off-the-shelf embedding model -- CLIP for images, CLAP for audio, a BGE or E5 text encoder -- was trained to be broadly useful across web-scale data. That generality is exactly its weakness inside a specific domain. The model learned the axes of variation that mattered for its training distribution, not the ones that matter for yours.
Consider what an agent has to do. It searches a catalog where two SKUs differ only by a stitching pattern, an inspection feed where a "hairline crack" and a "surface scratch" are different defect classes, or a support archive full of internal product names the model has never seen. In every case the distinctions your business cares about are *finer* than the distinctions the generic model was trained to make. The model maps your two near-identical items to nearly the same vector, the agent retrieves the wrong one, and downstream reasoning is built on a false premise.
Fine-tuning and distillation are the two levers that reshape an embedding space so the distinctions you care about become the dominant axes. This guide is about how they actually work, when each one is the right tool, and how to ship a new model without breaking what already works.
What Fine-Tuning Actually Changes
An embedding model is a function that maps an input to a point in a high-dimensional space. "Fine-tuning" means continuing to train that function on your data so that the geometry of the space changes: items you consider similar are pulled together, items you consider different are pushed apart.
The mechanism is almost always contrastive. You do not teach the model absolute coordinates; you teach it *relative* judgments through triples:
(anchor, positive, negative)
The anchor is a query or item, the positive is something that should rank high for it, and the negative is something that should rank low. Training adjusts the weights so the anchor ends up closer to the positive than to the negative by some margin. Repeat over thousands of triples and the entire space reorganizes around your notion of relevance.
A common loss is the triplet margin loss:
L = max(0, d(anchor, positive) - d(anchor, negative) + margin)
where \(d\) is a distance (often \(1 - \cos\)). The loss is zero only when the positive is closer than the negative by at least \(margin\); otherwise the gradient pulls the positive in and pushes the negative out. The more widely used modern variant is multiple-negatives ranking loss (the in-batch InfoNCE objective), where every other item in the batch acts as a negative for each anchor, giving you many negatives per step for free.
The single most important fact about fine-tuning: it is only as good as your triples. The model learns the relevance definition encoded in your positives and negatives. If your positives are noisy, you teach noise. If your negatives are trivially easy, you teach almost nothing.
Where Training Pairs Come From
You rarely have a labeled triple dataset sitting around. The practical work of fine-tuning is mostly the work of mining pairs from signals you already have:
A few thousand to tens of thousands of clean domain triples is often enough to move recall substantially. You do not need millions; you need the *right* ones.
Hard-Negative Mining Is the Real Lever
If you train on random negatives -- pair each anchor with arbitrary other items -- the model plateaus fast. Random negatives are usually so unrelated to the anchor that the model separates them after a handful of steps, and there is no further signal to learn from. The whole game is in the hard negatives: items that are genuinely similar to the anchor but are not correct matches. Those are where the generic model is wrong, and those are what carry the gradient that fixes it.
The standard recipe is iterative:
1. Embed the corpus with the current model.
2. For each anchor, retrieve its top-k nearest neighbors.
3. Keep the neighbors that are NOT labeled positives -- these are hard negatives.
4. Add (anchor, positive, hard_negative) triples to the training set.
5. Fine-tune. Then repeat from step 1 with the improved model.
Each round surfaces the confusions that survived the last round, so the model keeps sharpening exactly the boundaries that still blur. This is the mechanism that teaches it your two near-identical SKUs are different: you feed it the pair it currently gets wrong, over and over, until it does not.
The False-Negative Trap
Hard-negative mining has a dangerous failure mode. When you pull an anchor's nearest neighbors and label them all "negative," some of them may actually be *correct but unlabeled* positives. Training to push a true match away is directly harmful -- you are teaching the model the opposite of what you want. This is the false-negative problem, and it is the most common reason a fine-tune makes retrieval worse instead of better.
Mitigations: only mine negatives beyond the very top ranks (skip the top few, which are most likely to be unlabeled positives); use a score-margin threshold so an item too similar to the anchor is excluded from the negative pool; and where you can afford it, have a stronger model or a human verify mined negatives before they enter training. Treat any unexplained drop in eval recall after adding mined negatives as a false-negative signal first.
Distillation: Compressing a Smart Model into a Fast One
Fine-tuning reshapes a space. Distillation transfers the *knowledge* of one model (the teacher) into another (the student), usually because the teacher is too expensive to run at retrieval time.
The canonical example for retrieval is cross-encoder to bi-encoder distillation. A cross-encoder reads the query and a candidate *together* and scores their relevance with full attention between them; it is highly accurate but cannot be precomputed, because it needs both inputs at once. A bi-encoder embeds query and item *separately*, so item vectors can be indexed ahead of time and searched in milliseconds -- but it is less accurate because it never lets the two inputs interact.
Distillation gives you most of the cross-encoder's accuracy at the bi-encoder's speed:
1. Run the expensive cross-encoder on many (query, candidate) pairs to get soft relevance scores.
2. Train the bi-encoder so its similarity scores match the cross-encoder's scores.
(loss is typically MSE or KL-divergence between teacher and student scores,
not a hard 0/1 label)
The key word is soft. The teacher does not just say "relevant / not relevant"; it says "0.87 relevant," "0.42 relevant." Those graded scores carry the teacher's nuanced sense of *how* relevant each candidate is, and that gradient is far richer than a binary label. The student learns the teacher's ranking, not just its decisions. This is the same insight as classic knowledge distillation: the soft targets teach the student the structure the teacher discovered.
Distillation also goes model-to-model for size: a large, slow embedding model distilled into a small one that you can run on cheap hardware or at the edge, trading a small accuracy loss for a large latency and cost win.
Full Fine-Tuning vs. Parameter-Efficient Methods
You do not always update every weight. There is a spectrum:
A useful default: start with a projection head to confirm your data carries signal, move to LoRA when you need more capacity, and reserve full fine-tuning for cases where the domain is genuinely far from the base distribution (medical imaging, satellite, niche industrial audio).
The Batch-Size Constraint
Contrastive fine-tuning has a non-obvious requirement: it wants large batches, because in in-batch contrastive losses the other items in the batch *are* your negatives. A batch of 16 gives each anchor 15 negatives; a batch of 1024 gives it 1023. Small-batch contrastive fine-tuning produces noticeably less discriminative embeddings. If hardware limits your batch size, use a memory bank or gradient caching to accumulate negatives across steps -- do not just shrug and train at batch size 16.
The Deployment Trap: A New Model Is a New Space
Here is the part teams discover the expensive way. When you fine-tune or distill, you produce a new embedding space. Vectors from the new model are not comparable to vectors from the old model, even if the dimensionality is identical and the model name barely changed. Cosine similarity between an old-model item vector and a new-model query vector is meaningless noise.
This means you cannot hot-swap the model under a live index. If your index holds vectors from model v1 and you start embedding queries with model v2, every search silently returns garbage -- no error, just wrong results. The only correct path is to re-embed the entire corpus with the new model into a new index, validate it, then cut over.
The safe rollout pattern:
1. Freeze the model version as a versioned artifact (weights + config + preprocessing).
2. Re-embed the full corpus into a NEW index with the new model.
3. Run offline eval (recall@k, NDCG) on the new index vs. the current one.
4. Shadow / A-B the new index against live traffic.
5. Cut over only if eval and online metrics both improve. Keep the old index
until you are confident, so rollback is instant.
Skipping the re-embed step, or comparing across versions, is the most common way a "model upgrade" turns into a silent recall outage.
How to Know It Worked
Never judge a fine-tune by training loss; judge it by retrieval metrics on a held-out set the model never trained on. Build an evaluation set of (query, relevant-item) pairs that reflect real agent queries, and measure recall@k, MRR, and NDCG before and after. Pay special attention to the confusable cases that motivated the work -- a fine-tune can lift average recall while *regressing* on the exact hard pairs you cared about, which is a sign of false negatives in training.
Watch for forgetting too: keep a slice of general queries in the eval set so you can see if domain gains came at the cost of broad capability. The deployable model is the one that wins on your domain set *without* collapsing on the general set, verified on data it has never seen.
How This Looks in Mixpeek
Mixpeek treats the embedding model as a versioned part of a collection's configuration rather than a global default, which makes the deployment trap above a managed operation instead of a manual one. You attach a feature extractor (the embedding model and its version) to a collection; when you change it, the platform re-processes the affected objects into the new space rather than mixing versions inside one index, so an agent never searches across incompatible vectors.
from mixpeek import Mixpeek
client = Mixpeek(api_key="YOUR_KEY")
# Pin the embedding model + version on the collection's feature extractor.
# Bumping the model version triggers re-processing into a new, self-consistent
# space -- old and new vectors are never compared.
collection = client.collections.create(
namespace="product-catalog",
collection_name="catalog-v2",
feature_extractors=[
{
"feature_name": "image_embedding",
"extractor": "image_embedder",
"version": "domain-tuned-2026-06", # the fine-tuned model artifact
}
],
)
# Evaluate the new space on held-out (query, relevant-item) pairs before cutover.
eval_run = client.evaluations.run(
collection_id=collection["collection_id"],
metric="ndcg@10",
ground_truth="catalog-relevance-set",
)
print(f"NDCG@10: {eval_run['score']:.3f}")
The point for an agent builder: the gain from fine-tuning only reaches the agent if the new space is rolled out *consistently* -- whole-corpus re-embed, eval gate, clean cutover. Wiring the model version into the collection makes that the default path instead of a footgun.
Key Takeaways
1. A generic embedding model blurs the distinctions your domain cares about; fine-tuning reshapes the space so those distinctions become the dominant axes.
2. Fine-tuning is contrastive -- it learns from (anchor, positive, negative) triples -- so the quality of your mined pairs, not the size of the model, determines the result.
3. Hard-negative mining is the real lever: feed the model the near-misses it currently gets wrong, iteratively. Guard against the false-negative trap where a mined "negative" is actually an unlabeled positive.
4. Distillation transfers a slow, accurate teacher (a cross-encoder) into a fast, indexable student (a bi-encoder) using the teacher's soft scores, buying most of the accuracy at a fraction of the latency.
5. Prefer parameter-efficient methods (projection head, LoRA) before full fine-tuning, and use large batches so in-batch negatives stay informative.
6. A new model is a new space. Re-embed the whole corpus, eval-gate, and cut over cleanly -- never compare vectors across versions or hot-swap under a live index.