Embedding Portability and Versioning: Why Your Vectors Are Not as Portable as You Think

The Portability Illusion

Teams building on vector search tend to treat embeddings like database records: store them, query them, move them between systems. This mental model breaks the moment you try to do any of the following:

Compare vectors from two different embedding models

Upgrade from model v1 to model v2 without reprocessing your entire corpus

Share a vector index with a partner who uses a different encoder

Switch vector database providers while keeping your data intact

The root cause is simple: a vector only has meaning inside the specific model and embedding space that created it. Unlike a UUID, a hash, or even a pixel coordinate, an embedding is a learned representation. Two models that both produce 1024-dimensional vectors place the concept "golden retriever" at completely different coordinates. Concatenating, averaging, or comparing vectors across models produces nonsense.

This is not a theoretical concern. It shows up the moment your embedding ecosystem evolves, which it always does.

Why Embeddings Are Not Portable

Model-dependent coordinates

Every embedding model learns its own coordinate system during training. CLIP, SigLIP, BGE, Cohere Embed, and OpenAI's text-embedding-3 all produce floating-point arrays, but the numbers encode entirely different learned features. Dimension 47 in one model might correlate with "texture" while dimension 47 in another correlates with "sentiment."

This means:

You cannot compare vectors from different models, even at the same dimensionality

You cannot mix vectors from different models in the same index

You cannot assume that the same input produces similar coordinates across models

The metadata problem

A raw vector is an opaque array of floats. Without metadata, you cannot determine:

Which model produced it

Which version of that model was used

Whether the vector was normalized

Which distance metric is appropriate (cosine, dot product, L2)

Whether any quantization or dimensionality reduction was applied

This is the interoperability problem: for one party to use another party's vector, both sides need a shared envelope that specifies the model, version, embedding space, and metric. Without it, you are comparing coordinates from different maps.

Version drift within model families

Even within a single model family, versions are not guaranteed to be compatible. A provider can update weights, change tokenization, swap the training data distribution, or apply distillation. The model name stays the same, but the embedding space shifts.

OpenAI's transition from text-embedding-ada-002 to text-embedding-3-small is an obvious example (entirely new model, new dimensions), but even patch-level updates can shift retrieval quality by several percentage points. If your pipeline does not track model versions, you have no way to detect or respond to this drift.

The Upgrade Problem

The portability problem leads directly to the upgrade problem. When a better model arrives (and it always does), you face a choice:

Option A: Keep old vectors. Your new data gets encoded with the new model, your old data stays on the old model. You now have a fragmented index where queries only work well against data encoded by the same model. Retrieval quality degrades proportionally to the share of old vectors in your corpus.

Option B: Re-encode everything. You reprocess every document through the new model and rebuild your index. This is correct but expensive:

Corpus size

Cost at $0.0001/embedding

Wall time (1,000 embeddings/sec)

1M documents	$100	17 minutes
10M documents	$1,000	2.8 hours
100M documents	$10,000	28 hours
1B documents	$100,000	11.5 days

And these estimates assume single-pass encoding. Multimodal data (video frames, audio segments, document pages) multiplies the count by 10--100x because each source file produces many vectors.

Option C: Do nothing. Stay on the old model. This works until the model is deprecated, until your competitors get measurably better results from newer models, or until you need to combine your vectors with a partner's system that uses a different model.

None of these options are free. The question is which cost structure fits your constraints.

Interoperability: Making Vectors Readable Across Systems

Interoperability means that a vector produced by one system can be correctly interpreted by another. This requires agreement on four things:

1. Model identification

Every vector needs a tag specifying the exact model that produced it. Not just "CLIP" but the specific checkpoint: openai/clip-vit-large-patch14 or laion/CLIP-ViT-H-14-laion2B-s32B-b79K. Different checkpoints, even within the same architecture, produce incompatible spaces.

2. Version pinning

Models get updated. Track the exact version (commit hash, API version date, or release tag) so that you can detect when the underlying space changes. This is especially important for hosted embedding APIs where the provider controls updates.

3. Space metadata

Include dimensionality, normalization status, and the intended distance metric. A system that expects cosine similarity on L2-normalized vectors will produce wrong results if given unnormalized vectors intended for dot-product search.

4. Provenance

Record the training data domain and any fine-tuning applied. A CLIP model fine-tuned on medical images occupies a different space than the same architecture trained on web data. The model name alone does not capture this.

Proposed envelope format

A practical vector envelope might look like this:

{
  "model": "openai/clip-vit-large-patch14",
  "version": "2025-03-15",
  "dimensions": 768,
  "normalized": true,
  "metric": "cosine",
  "quantization": "none",
  "domain": "general-web",
  "vector": [0.023, -0.041, ...]
}

This is more overhead than storing a bare float array, but it is the minimum information needed for a receiving system to know whether it can use the vector.

Migration Strategies: Moving from Model v1 to v2

Shadow indexing (dual-write)

Maintain two indexes in parallel. New data is encoded by both models and written to both indexes. A background job re-encodes historical data from the old model into the new one, working through the backlog in priority order (most-queried documents first).

How it works:

1. Create a new namespace/collection for the v2 model 2. Update your ingestion pipeline to encode and write to both v1 and v2 3. Start a backfill job that reads source data, encodes with v2, and writes to the v2 index 4. Route queries to the v1 index during migration 5. When the v2 index reaches full coverage, validate quality and cut over 6. Decommission the v1 index

Trade-offs: Doubles write cost and storage during migration. Gives you a clean rollback path. Works well when the backfill job can run without competing for the same compute as real-time queries.

Blue-green migration

Build the v2 index completely offline, validate it, and perform an atomic cutover.

How it works:

1. Export all source data (not vectors, the original files/text) 2. Encode everything with the v2 model in a batch job 3. Load into a new index 4. Run your quality benchmark: compare recall@k, MRR, and nDCG against a golden test set 5. If quality meets your threshold, swap the query endpoint to the new index 6. Keep the old index available for rollback for a defined period

Trade-offs: No mixed-version queries during migration. Requires enough compute and storage to hold two complete indexes simultaneously. Best for corpora that can be re-encoded in a reasonable time window.

Progressive rollout

Re-encode data in priority order and gradually shift traffic to the new index.

How it works:

1. Rank documents by query frequency (or business importance) 2. Re-encode the top tier and add to the v2 index 3. Route a percentage of queries to v2 (starting at 5--10%) 4. Monitor quality metrics in production 5. Increase the v2 traffic share as more data is backfilled 6. Continue until 100% of data is in v2 and 100% of traffic is routed there

Trade-offs: Minimizes risk by catching quality regressions early. More operationally complex because you are running a partially migrated system. Requires query routing logic that can split traffic by index.

Comparison

Strategy

Downtime

Compute cost

Complexity

Rollback

Shadow indexing	None	High (2x writes)	Medium	Easy
Blue-green	Brief cutover	High (full rebuild)	Low	Easy
Progressive	None	Medium (prioritized)	High	Medium

How Mixpeek Handles This

Mixpeek's architecture separates raw data storage from vector indexing. Every file ingested through a bucket is stored in its original form alongside the extracted features. This means:

Source data is always available for re-encoding. You never lose the ability to rebuild your index with a new model because the originals are retained in your storage tier.

Collections define the feature extraction pipeline. When you want to upgrade models, you create a new collection with the updated extractor configuration. The new collection processes the same source data through the new model.

Namespaces isolate vector spaces. Each namespace is a separate vector index. You can run v1 and v2 namespaces side by side, compare retrieval quality, and cut over when ready.

Retrievers abstract the query layer. A retriever can be pointed at a different namespace without changing your application code. Migration becomes a configuration change, not a code deployment.

Storage tiering preserves history. Cold and archived tiers keep your previous vectors accessible for comparison, audit, or rollback, without paying hot-storage costs. See the vector storage tiering guide for details.

This design means that an embedding upgrade follows a predictable workflow:

1. Create a new namespace for v2 2. Create a new collection pointing at the same bucket, configured with the v2 model 3. Trigger reprocessing (the batch pipeline handles backfill automatically) 4. Validate quality on your test queries 5. Update your retriever to point at the v2 namespace 6. Archive the v1 namespace when you are confident in v2

No custom migration scripts. No mixed-version indexes. No orphaned vectors.

Organizational Implications

Embedding portability is not just an infrastructure problem. It affects how teams plan and budget:

Model evaluation becomes a migration planning exercise. You cannot evaluate a new embedding model without also estimating the cost and timeline to migrate your existing data. A 5% improvement in recall means nothing if the migration takes three months and costs $50,000.

Vendor lock-in has a new dimension. Choosing a hosted embedding API (OpenAI, Cohere, Google) means your vectors are tied to that provider's model lifecycle. If they deprecate a model version, you migrate on their timeline, not yours.

Multi-tenant systems need per-tenant versioning. If different customers onboarded at different times and their data was encoded with different model versions, you need to track which model version applies to which tenant's data.

Compliance and auditability require provenance. In regulated industries, you may need to demonstrate which model produced a specific vector and when. Without version tracking, this is impossible.

Checklist: Is Your Embedding Infrastructure Upgrade-Ready?

Every stored vector is tagged with the model name, version, and distance metric

Raw source data (files, text, URLs) is retained alongside vectors

Your ingestion pipeline can target multiple namespaces/indexes simultaneously

You have a quality benchmark (golden test set + metrics) for comparing model versions

Your backfill pipeline is idempotent and resumable

Query routing can direct traffic to different indexes without code changes

You have a documented rollback procedure

Storage costs for parallel indexes during migration are budgeted

Model version is included in your observability and logging

Key Takeaways

Embeddings are model-dependent coordinates, not universal identifiers. They only have meaning inside the model and space that created them.

Portability requires a metadata envelope: model name, version, dimensionality, normalization, and distance metric. Without this, vectors are opaque and uninterpretable.

Every embedding system will eventually need to upgrade models. Build for this from day one by retaining source data, versioning namespaces, and automating quality benchmarks.

The three main migration strategies (shadow indexing, blue-green, progressive rollout) each trade off between cost, complexity, and downtime. Choose based on your corpus size and risk tolerance.

Infrastructure that separates raw storage from vector indexing gives you a clean upgrade path without custom migration scripts.

Related Resources

Embedding Portability -- glossary entry on cross-system vector compatibility

Embedding Versioning -- glossary entry on model upgrade strategies

Multimodal Embeddings -- how vector representations encode different data types

Vector Storage Tiering -- managing hot, warm, and cold vector data

Vector Database -- storage and retrieval for high-dimensional vectors

Latent Space -- the abstract vector space where embeddings reside

Documentation -- getting started with Mixpeek

If the migration itself is the problem you are staring down -- a huge corpus, missing source content, or a hard budget -- the companion guide on switching embedding models without re-embedding everything covers vector-space translation (Procrustes and learned adapters) and query-side bridging as alternatives to the full re-embed.