The Portability Illusion
Teams building on vector search tend to treat embeddings like database records: store them, query them, move them between systems. This mental model breaks the moment you try to do any of the following:
The root cause is simple: a vector only has meaning inside the specific model and embedding space that created it. Unlike a UUID, a hash, or even a pixel coordinate, an embedding is a learned representation. Two models that both produce 1024-dimensional vectors place the concept "golden retriever" at completely different coordinates. Concatenating, averaging, or comparing vectors across models produces nonsense.
This is not a theoretical concern. It shows up the moment your embedding ecosystem evolves, which it always does.
Why Embeddings Are Not Portable
Model-dependent coordinates
Every embedding model learns its own coordinate system during training. CLIP, SigLIP, BGE, Cohere Embed, and OpenAI's text-embedding-3 all produce floating-point arrays, but the numbers encode entirely different learned features. Dimension 47 in one model might correlate with "texture" while dimension 47 in another correlates with "sentiment."
This means:
The metadata problem
A raw vector is an opaque array of floats. Without metadata, you cannot determine:
This is the interoperability problem: for one party to use another party's vector, both sides need a shared envelope that specifies the model, version, embedding space, and metric. Without it, you are comparing coordinates from different maps.
Version drift within model families
Even within a single model family, versions are not guaranteed to be compatible. A provider can update weights, change tokenization, swap the training data distribution, or apply distillation. The model name stays the same, but the embedding space shifts.
OpenAI's transition from text-embedding-ada-002 to text-embedding-3-small is an obvious example (entirely new model, new dimensions), but even patch-level updates can shift retrieval quality by several percentage points. If your pipeline does not track model versions, you have no way to detect or respond to this drift.
The Upgrade Problem
The portability problem leads directly to the upgrade problem. When a better model arrives (and it always does), you face a choice:
Option A: Keep old vectors. Your new data gets encoded with the new model, your old data stays on the old model. You now have a fragmented index where queries only work well against data encoded by the same model. Retrieval quality degrades proportionally to the share of old vectors in your corpus.
Option B: Re-encode everything. You reprocess every document through the new model and rebuild your index. This is correct but expensive:
| Corpus size | Cost at $0.0001/embedding | Wall time (1,000 embeddings/sec) |
| 1M documents | $100 | 17 minutes |
| 10M documents | $1,000 | 2.8 hours |
| 100M documents | $10,000 | 28 hours |
| 1B documents | $100,000 | 11.5 days |
Option C: Do nothing. Stay on the old model. This works until the model is deprecated, until your competitors get measurably better results from newer models, or until you need to combine your vectors with a partner's system that uses a different model.
None of these options are free. The question is which cost structure fits your constraints.
Interoperability: Making Vectors Readable Across Systems
Interoperability means that a vector produced by one system can be correctly interpreted by another. This requires agreement on four things:
1. Model identification
Every vector needs a tag specifying the exact model that produced it. Not just "CLIP" but the specific checkpoint: `openai/clip-vit-large-patch14` or `laion/CLIP-ViT-H-14-laion2B-s32B-b79K`. Different checkpoints, even within the same architecture, produce incompatible spaces.
2. Version pinning
Models get updated. Track the exact version (commit hash, API version date, or release tag) so that you can detect when the underlying space changes. This is especially important for hosted embedding APIs where the provider controls updates.
3. Space metadata
Include dimensionality, normalization status, and the intended distance metric. A system that expects cosine similarity on L2-normalized vectors will produce wrong results if given unnormalized vectors intended for dot-product search.
4. Provenance
Record the training data domain and any fine-tuning applied. A CLIP model fine-tuned on medical images occupies a different space than the same architecture trained on web data. The model name alone does not capture this.
Proposed envelope format
A practical vector envelope might look like this:
{
"model": "openai/clip-vit-large-patch14",
"version": "2025-03-15",
"dimensions": 768,
"normalized": true,
"metric": "cosine",
"quantization": "none",
"domain": "general-web",
"vector": [0.023, -0.041, ...]
}
This is more overhead than storing a bare float array, but it is the minimum information needed for a receiving system to know whether it can use the vector.
Migration Strategies: Moving from Model v1 to v2
Shadow indexing (dual-write)
Maintain two indexes in parallel. New data is encoded by both models and written to both indexes. A background job re-encodes historical data from the old model into the new one, working through the backlog in priority order (most-queried documents first).
How it works:
1. Create a new namespace/collection for the v2 model 2. Update your ingestion pipeline to encode and write to both v1 and v2 3. Start a backfill job that reads source data, encodes with v2, and writes to the v2 index 4. Route queries to the v1 index during migration 5. When the v2 index reaches full coverage, validate quality and cut over 6. Decommission the v1 index
Trade-offs: Doubles write cost and storage during migration. Gives you a clean rollback path. Works well when the backfill job can run without competing for the same compute as real-time queries.
Blue-green migration
Build the v2 index completely offline, validate it, and perform an atomic cutover.
How it works:
1. Export all source data (not vectors, the original files/text) 2. Encode everything with the v2 model in a batch job 3. Load into a new index 4. Run your quality benchmark: compare recall@k, MRR, and nDCG against a golden test set 5. If quality meets your threshold, swap the query endpoint to the new index 6. Keep the old index available for rollback for a defined period
Trade-offs: No mixed-version queries during migration. Requires enough compute and storage to hold two complete indexes simultaneously. Best for corpora that can be re-encoded in a reasonable time window.
Progressive rollout
Re-encode data in priority order and gradually shift traffic to the new index.
How it works:
1. Rank documents by query frequency (or business importance) 2. Re-encode the top tier and add to the v2 index 3. Route a percentage of queries to v2 (starting at 5--10%) 4. Monitor quality metrics in production 5. Increase the v2 traffic share as more data is backfilled 6. Continue until 100% of data is in v2 and 100% of traffic is routed there
Trade-offs: Minimizes risk by catching quality regressions early. More operationally complex because you are running a partially migrated system. Requires query routing logic that can split traffic by index.
Comparison
| Strategy | Downtime | Compute cost | Complexity | Rollback |
| Shadow indexing | None | High (2x writes) | Medium | Easy |
| Blue-green | Brief cutover | High (full rebuild) | Low | Easy |
| Progressive | None | Medium (prioritized) | High | Medium |
How Mixpeek Handles This
Mixpeek's architecture separates raw data storage from vector indexing. Every file ingested through a bucket is stored in its original form alongside the extracted features. This means:
This design means that an embedding upgrade follows a predictable workflow:
1. Create a new namespace for v2 2. Create a new collection pointing at the same bucket, configured with the v2 model 3. Trigger reprocessing (the batch pipeline handles backfill automatically) 4. Validate quality on your test queries 5. Update your retriever to point at the v2 namespace 6. Archive the v1 namespace when you are confident in v2
No custom migration scripts. No mixed-version indexes. No orphaned vectors.
Organizational Implications
Embedding portability is not just an infrastructure problem. It affects how teams plan and budget:
Model evaluation becomes a migration planning exercise. You cannot evaluate a new embedding model without also estimating the cost and timeline to migrate your existing data. A 5% improvement in recall means nothing if the migration takes three months and costs $50,000.
Vendor lock-in has a new dimension. Choosing a hosted embedding API (OpenAI, Cohere, Google) means your vectors are tied to that provider's model lifecycle. If they deprecate a model version, you migrate on their timeline, not yours.
Multi-tenant systems need per-tenant versioning. If different customers onboarded at different times and their data was encoded with different model versions, you need to track which model version applies to which tenant's data.
Compliance and auditability require provenance. In regulated industries, you may need to demonstrate which model produced a specific vector and when. Without version tracking, this is impossible.
