The Vector Storage Cost Problem
Every embedding model you run creates vectors that need to live somewhere. A single CLIP embedding is 512 dimensions of float32, about 2 KB. That sounds small until you process a million images, and now you have 2 GB of vectors. Process 100 million documents across text, image, video, and audio modalities, and you are storing 200 GB of raw vector data before accounting for indexes, metadata, or replicas.
In-memory vector databases like Pinecone, Qdrant, and Weaviate store vectors in RAM or on fast SSDs to deliver sub-10ms query latency. That speed comes at a cost. Pinecone's serverless offering runs about $2 per GB per month for storage plus per-query charges. Qdrant Cloud charges roughly $0.045 per million dimensions stored. At 100 million 512-dimensional vectors, you are paying thousands of dollars per month just to keep vectors addressable, before you serve a single query.
The uncomfortable truth: most of those vectors are rarely queried. In a typical production deployment, fewer than 20% of stored vectors receive any queries in a given month. The other 80% sit in expensive hot storage doing nothing. This is the same pattern that drove the adoption of storage tiering in traditional databases decades ago, and the same economics apply to vectors.
What Is Vector Storage Tiering?
Vector storage tiering separates your vector data into access-frequency tiers, each with different latency and cost characteristics. The concept maps directly to the hot/warm/cold model that AWS, GCP, and Azure have used for object storage for years:
Hot tier stores vectors in memory or on NVMe SSDs. Query latency is under 10 milliseconds. This is where your actively searched data lives: the product catalog a user is browsing right now, the security camera feeds being monitored in real time, the document corpus powering a customer-facing RAG chatbot. Every major vector database operates exclusively in this tier.
Warm tier stores vectors on disk with optional caching layers. Latency ranges from 50 to 200 milliseconds. This is suitable for data that gets queried periodically but not constantly: monthly reports, seasonal product lines, historical customer interactions that a support agent might pull up.
Cold tier stores vectors in object storage like Amazon S3 Vectors, Google Cloud Storage, or Azure Blob Storage. Latency is 200 to 800 milliseconds for single queries, with batch queries amortizing the overhead. This tier costs pennies per GB per month instead of dollars. It is the right home for archival data, compliance-retained embeddings, training datasets, and any collection you query less than a few times per day.
The Economics: Why Tiering Saves 60-90%
The cost difference between tiers is not incremental. It is an order of magnitude.
| Tier | Example Service | Storage Cost (per GB/month) | Query Latency | Best For |
| Hot | Pinecone Serverless | ~$2.00 | < 10ms | Real-time search, live RAG |
| Hot | Qdrant Cloud | ~$1.50 | < 10ms | Production similarity search |
| Warm | Qdrant on-disk | ~$0.30 | 50-200ms | Periodic lookups, batch search |
| Cold | Amazon S3 Vectors | ~$0.023 | 200-800ms | Archival, compliance, batch |
| Cold | S3 Standard | ~$0.023 | N/A (no native ANN) | Raw embedding backup |
That is an 80-84% reduction in storage costs from a single architectural decision. The savings compound when you factor in replica costs, backup storage, and the engineering time saved by not scaling hot infrastructure for data nobody queries.
Four Architecture Patterns
Pattern 1: Hot-Only (Traditional Vector Database)
Every vector lives in a single Pinecone index, Qdrant collection, or Weaviate class. No tiering.
When it works: Your total vector count is under 10 million, all data is actively queried, and you need consistent sub-10ms latency for every request. This is the right starting point for most applications.
When it breaks: Data grows beyond what your budget can sustain in hot storage. You start deleting old vectors to control costs, losing data you might need later. Or you accept degraded query performance as indexes grow past what your instance can handle efficiently.
Pattern 2: Cold-Only (Object Storage)
All vectors live in S3 Vectors or equivalent. No in-memory database at all.
When it works: Batch analytics pipelines where you compute similarity offline. Compliance workloads where you must retain embeddings for N years but rarely query them. ML training pipelines that need access to historical embeddings for model retraining.
When it breaks: Any workload that requires real-time or near-real-time search. Object storage latency of 200-800ms per query is too slow for user-facing applications, RAG pipelines with strict latency budgets, or monitoring systems that need instant alerts.
Pattern 3: Hot + Cold Hybrid (Manual Tiering)
Active data lives in a vector database. Archival data lives in S3. You write application logic to promote data from cold to hot when it becomes active, and demote data from hot to cold after a period of inactivity.
When it works: You have a clear lifecycle for your data. New product listings stay hot for 90 days, then move to cold. Customer support transcripts are hot for the current quarter, then archived. Security camera embeddings are hot for 48 hours, then cold.
When it breaks: Promotion and demotion logic is complex to build and maintain. You need to handle partial queries that span both tiers. Monitoring two separate systems doubles your operational surface area. Schema changes must be coordinated across both stores.
Pattern 4: Unified Tiering (Managed Lifecycle)
A single system manages the full lifecycle: ingest, hot serving, cold archival, and automatic promotion/demotion. The application queries a single API and the system routes to the appropriate tier transparently.
This is the architecture that Mixpeek implements. S3 or GCS is the canonical store for all vectors. A hot serving layer (backed by vector indexes) handles real-time queries for active collections. Collections transition through lifecycle states:
The advantage is operational simplicity. You do not manage two separate systems, write promotion/demotion logic, or handle cross-tier queries. The disadvantage is vendor dependency on the platform providing unified tiering.
Choosing Your Tiering Strategy
The right pattern depends on three variables: query frequency, latency requirements, and total vector count.
| Your Workload | Recommended Pattern | Why |
| < 10M vectors, all actively queried | Hot-Only | Tiering adds complexity you do not need |
| > 50M vectors, < 20% actively queried | Hot + Cold or Unified | 80% of your data is wasting money in hot storage |
| Batch-only analytics, no real-time queries | Cold-Only | Pay pennies per GB and accept higher latency |
| Compliance retention (must keep, rarely query) | Cold-Only or Unified | Object storage costs are negligible for retention |
| Mixed workload (some real-time, some batch) | Unified Tiering | Single API surface, automatic lifecycle management |
| Data with clear temporal lifecycle | Hot + Cold Hybrid | Time-based rules make promotion/demotion simple |
Common Mistakes
Over-provisioning hot storage. Teams default to keeping everything in their vector database because "it might be needed." Audit your query logs. If a collection has not been queried in 30 days, it belongs in cold storage.
Ignoring egress costs. Moving data between tiers incurs network transfer costs. S3 charges $0.09 per GB for data transferred out. If you are constantly promoting and demoting the same vectors, the transfer costs can offset your storage savings. Set a minimum dwell time (e.g., 30 days) before allowing demotion.
Not measuring actual query patterns. Many teams assume uniform query distribution. In reality, query traffic follows a power law. A small number of collections receive the vast majority of queries. Instrument your system to track per-collection query frequency before designing your tiering policy.
Implementation: Setting Up S3 Vectors as a Cold Tier
Amazon S3 Vectors (GA 2025) lets you store and query embeddings directly in S3 without a separate vector database. Here is how to set it up as a cold tier for vectors that have aged out of your hot database.
Step 1: Create a Vector Bucket
aws s3vectors create-vector-bucket \
--vector-bucket-name my-cold-vectors
Step 2: Create an Index for Each Collection
aws s3vectors create-index \
--vector-bucket-name my-cold-vectors \
--index-name product-embeddings-archive \
--dimension 512 \
--distance-metric cosine \
--data-type float32
Step 3: Write Vectors from Your Hot Database
import boto3
import json
s3v = boto3.client("s3vectors")
# Export from Qdrant/Pinecone, write to S3 Vectors
for batch in export_from_hot_database(collection="products"):
vectors = [
{
"key": v["id"],
"data": {"float32": v["embedding"]},
"metadata": v["payload"]
}
for v in batch
]
s3v.put_vectors(
vectorBucketName="my-cold-vectors",
indexName="product-embeddings-archive",
vectors=vectors
)
Step 4: Query the Cold Tier
results = s3v.query_vectors(
vectorBucketName="my-cold-vectors",
indexName="product-embeddings-archive",
queryVector={"float32": query_embedding},
topK=10
)
Step 5: Promote Back to Hot When Needed
If a cold collection suddenly needs real-time access (e.g., a seasonal product line coming back into rotation), read vectors from S3 Vectors and bulk-insert into your hot database.
Vendor Comparison: Tiered Vector Storage Options
| Feature | Mixpeek MVS | Milvus Tiered | Weaviate Offloading | S3 Vectors + Qdrant (DIY) | Pinecone (Hot Only) |
| Automatic lifecycle | Yes | Partial (manual config) | Partial (tenant-level) | No (custom code) | No |
| Cold tier storage | S3/GCS native | MinIO/S3 | S3 | S3 Vectors | N/A |
| Cross-tier queries | Transparent | Requires config | Tenant-scoped | Application-managed | N/A |
| Multimodal support | Native (image, video, audio, text) | Vectors only | Vectors only | Vectors only | Vectors only |
| Promotion/demotion | Automatic or API | Manual | Manual | Custom code | N/A |
| Min cold query latency | 200ms | 300ms | 500ms | 200ms | N/A |
| Pricing model | Per-feature stored | Self-hosted + cloud | Per-dimension | S3 storage + Qdrant instance | Per-GB + per-query |
Monitoring Your Tiered Deployment
Three metrics tell you whether your tiering strategy is working:
Hot tier utilization. What percentage of vectors in your hot tier were queried in the last 30 days? If it is below 50%, you are over-provisioning hot storage and should demote more aggressively.
P99 latency by tier. Track query latency separately for hot and cold tiers. If cold-tier latency is creeping above 1 second, your cold indexes may need optimization or your query patterns suggest that data should be promoted.
Cost per query. Divide your total storage and compute costs by the number of queries served. If your cost per query is rising while your query volume is flat, data growth is outpacing your tiering policy. Tighten demotion rules or move more data to cold storage.
FAQ
Does S3 Vectors replace vector databases?
No. S3 Vectors is a cold-tier complement, not a hot-tier replacement. It excels at storing and querying vectors you access infrequently at a fraction of the cost of in-memory databases. For sub-10ms real-time search, you still need a vector database.
How much can tiering save?
60-90% of storage costs for deployments where less than 20% of data is actively queried. The exact savings depend on your hot/cold ratio, the pricing of your hot database, and your query volume.
Can I tier multimodal embeddings (video, image, audio)?
Yes. Embedding dimensions and data types are the same regardless of modality. The tiering decision is based on access patterns, not data type. Mixpeek supports tiered storage natively across all modalities.
What about query consistency across tiers?
Hot and cold tiers serve different copies of the same vectors. There is no consistency concern for immutable embeddings. If you update embeddings (e.g., after retraining a model), you need to update both tiers. Unified tiering systems handle this automatically since S3/GCS is the canonical store and the hot layer is derived from it.
When should I start tiering?
When your vector storage costs exceed $200/month or your total vector count exceeds 50 million. Below these thresholds, the operational complexity of tiering is not worth the savings. Above them, the savings compound quickly.
