Vector Storage Tiering: The Architecture Guide to Hot, Warm, and Cold Vector Data

The Vector Storage Cost Problem

Every embedding model you run creates vectors that need to live somewhere. A single CLIP embedding is 512 dimensions of float32, about 2 KB. That sounds small until you process a million images, and now you have 2 GB of vectors. Process 100 million documents across text, image, video, and audio modalities, and you are storing 200 GB of raw vector data before accounting for indexes, metadata, or replicas.

In-memory vector databases like Pinecone, Qdrant, and Weaviate store vectors in RAM or on fast SSDs to deliver sub-10ms query latency. That speed comes at a cost. Pinecone's serverless offering runs about $2 per GB per month for storage plus per-query charges. Qdrant Cloud charges roughly $0.045 per million dimensions stored. At 100 million 512-dimensional vectors, you are paying thousands of dollars per month just to keep vectors addressable, before you serve a single query.

The uncomfortable truth: most of those vectors are rarely queried. In a typical production deployment, fewer than 20% of stored vectors receive any queries in a given month. The other 80% sit in expensive hot storage doing nothing. This is the same pattern that drove the adoption of storage tiering in traditional databases decades ago, and the same economics apply to vectors.

What Is Vector Storage Tiering?

Vector storage tiering separates your vector data into access-frequency tiers, each with different latency and cost characteristics. The concept maps directly to the hot/warm/cold model that AWS, GCP, and Azure have used for object storage for years:

Hot tier stores vectors in memory or on NVMe SSDs. Query latency is under 10 milliseconds. This is where your actively searched data lives: the product catalog a user is browsing right now, the security camera feeds being monitored in real time, the document corpus powering a customer-facing RAG chatbot. Every major vector database operates exclusively in this tier.

Warm tier stores vectors on disk with optional caching layers. Latency ranges from 50 to 200 milliseconds. This is suitable for data that gets queried periodically but not constantly: monthly reports, seasonal product lines, historical customer interactions that a support agent might pull up.

Cold tier stores vectors in object storage like Amazon S3 Vectors, Google Cloud Storage, or Azure Blob Storage. Latency is 200 to 800 milliseconds for single queries, with batch queries amortizing the overhead. This tier costs pennies per GB per month instead of dollars. It is the right home for archival data, compliance-retained embeddings, training datasets, and any collection you query less than a few times per day.

The Economics: Why Tiering Saves 60-90%

The cost difference between tiers is not incremental. It is an order of magnitude.

Tier

Example Service

Storage Cost (per GB/month)

Query Latency

Best For

Hot	Pinecone Serverless	~$2.00	< 10ms	Real-time search, live RAG
Hot	Qdrant Cloud	~$1.50	< 10ms	Production similarity search
Warm	Qdrant on-disk	~$0.30	50-200ms	Periodic lookups, batch search
Cold	Amazon S3 Vectors	~$0.023	200-800ms	Archival, compliance, batch
Cold	S3 Standard	~$0.023	N/A (no native ANN)	Raw embedding backup

At 100 million 512-dim float32 vectors (roughly 200 GB of raw data):

All hot (Pinecone): ~$400/month storage + query costs

All hot (Qdrant Cloud): ~$300/month storage + instance costs

Tiered (20% hot, 80% cold): ~$60 hot + $3.70 cold = ~$64/month

That is an 80-84% reduction in storage costs from a single architectural decision. The savings compound when you factor in replica costs, backup storage, and the engineering time saved by not scaling hot infrastructure for data nobody queries.

Four Architecture Patterns

Pattern 1: Hot-Only (Traditional Vector Database)

Every vector lives in a single Pinecone index, Qdrant collection, or Weaviate class. No tiering.

When it works: Your total vector count is under 10 million, all data is actively queried, and you need consistent sub-10ms latency for every request. This is the right starting point for most applications.

When it breaks: Data grows beyond what your budget can sustain in hot storage. You start deleting old vectors to control costs, losing data you might need later. Or you accept degraded query performance as indexes grow past what your instance can handle efficiently.

Pattern 2: Cold-Only (Object Storage)

All vectors live in S3 Vectors or equivalent. No in-memory database at all.

When it works: Batch analytics pipelines where you compute similarity offline. Compliance workloads where you must retain embeddings for N years but rarely query them. ML training pipelines that need access to historical embeddings for model retraining.

When it breaks: Any workload that requires real-time or near-real-time search. Object storage latency of 200-800ms per query is too slow for user-facing applications, RAG pipelines with strict latency budgets, or monitoring systems that need instant alerts.

Pattern 3: Hot + Cold Hybrid (Manual Tiering)

Active data lives in a vector database. Archival data lives in S3. You write application logic to promote data from cold to hot when it becomes active, and demote data from hot to cold after a period of inactivity.

When it works: You have a clear lifecycle for your data. New product listings stay hot for 90 days, then move to cold. Customer support transcripts are hot for the current quarter, then archived. Security camera embeddings are hot for 48 hours, then cold.

When it breaks: Promotion and demotion logic is complex to build and maintain. You need to handle partial queries that span both tiers. Monitoring two separate systems doubles your operational surface area. Schema changes must be coordinated across both stores.

Pattern 4: Unified Tiering (Managed Lifecycle)

A single system manages the full lifecycle: ingest, hot serving, cold archival, and automatic promotion/demotion. The application queries a single API and the system routes to the appropriate tier transparently.

This is the architecture that Mixpeek implements. S3 or GCS is the canonical store for all vectors. A hot serving layer (backed by vector indexes) handles real-time queries for active collections. Collections transition through lifecycle states:

Active: Vectors are indexed in the hot serving layer and queryable at low latency. S3/GCS holds the canonical copy.

Cold: Vectors are removed from hot indexes but remain in S3/GCS. Queries are served directly from object storage at higher latency. Collections can be reactivated instantly.

Archived: Only metadata is retained. Vectors can be reconstructed from source files if needed.

The advantage is operational simplicity. You do not manage two separate systems, write promotion/demotion logic, or handle cross-tier queries. The disadvantage is vendor dependency on the platform providing unified tiering.

Choosing Your Tiering Strategy

The right pattern depends on three variables: query frequency, latency requirements, and total vector count.

Your Workload

Recommended Pattern

Why

< 10M vectors, all actively queried	Hot-Only	Tiering adds complexity you do not need
> 50M vectors, < 20% actively queried	Hot + Cold or Unified	80% of your data is wasting money in hot storage
Batch-only analytics, no real-time queries	Cold-Only	Pay pennies per GB and accept higher latency
Compliance retention (must keep, rarely query)	Cold-Only or Unified	Object storage costs are negligible for retention
Mixed workload (some real-time, some batch)	Unified Tiering	Single API surface, automatic lifecycle management
Data with clear temporal lifecycle	Hot + Cold Hybrid	Time-based rules make promotion/demotion simple

Common Mistakes

Over-provisioning hot storage. Teams default to keeping everything in their vector database because "it might be needed." Audit your query logs. If a collection has not been queried in 30 days, it belongs in cold storage.

Ignoring egress costs. Moving data between tiers incurs network transfer costs. S3 charges $0.09 per GB for data transferred out. If you are constantly promoting and demoting the same vectors, the transfer costs can offset your storage savings. Set a minimum dwell time (e.g., 30 days) before allowing demotion.

Not measuring actual query patterns. Many teams assume uniform query distribution. In reality, query traffic follows a power law. A small number of collections receive the vast majority of queries. Instrument your system to track per-collection query frequency before designing your tiering policy.

Implementation: Setting Up S3 Vectors as a Cold Tier

Amazon S3 Vectors (GA 2025) lets you store and query embeddings directly in S3 without a separate vector database. Here is how to set it up as a cold tier for vectors that have aged out of your hot database.

Step 1: Create a Vector Bucket

aws s3vectors create-vector-bucket \
  --vector-bucket-name my-cold-vectors

Step 2: Create an Index for Each Collection

aws s3vectors create-index \
  --vector-bucket-name my-cold-vectors \
  --index-name product-embeddings-archive \
  --dimension 512 \
  --distance-metric cosine \
  --data-type float32

Step 3: Write Vectors from Your Hot Database

import boto3
import json

s3v = boto3.client("s3vectors")

# Export from Qdrant/Pinecone, write to S3 Vectors
for batch in export_from_hot_database(collection="products"):
    vectors = [
        {
            "key": v["id"],
            "data": {"float32": v["embedding"]},
            "metadata": v["payload"]
        }
        for v in batch
    ]
    s3v.put_vectors(
        vectorBucketName="my-cold-vectors",
        indexName="product-embeddings-archive",
        vectors=vectors
    )

Step 4: Query the Cold Tier

results = s3v.query_vectors(
    vectorBucketName="my-cold-vectors",
    indexName="product-embeddings-archive",
    queryVector={"float32": query_embedding},
    topK=10
)

Step 5: Promote Back to Hot When Needed

If a cold collection suddenly needs real-time access (e.g., a seasonal product line coming back into rotation), read vectors from S3 Vectors and bulk-insert into your hot database.

Vendor Comparison: Tiered Vector Storage Options

Feature

Mixpeek MVS

Milvus Tiered

Weaviate Offloading

S3 Vectors + Qdrant (DIY)

Pinecone (Hot Only)

Automatic lifecycle	Yes	Partial (manual config)	Partial (tenant-level)	No (custom code)	No
Cold tier storage	S3/GCS native	MinIO/S3	S3	S3 Vectors	N/A
Cross-tier queries	Transparent	Requires config	Tenant-scoped	Application-managed	N/A
Multimodal support	Native (image, video, audio, text)	Vectors only	Vectors only	Vectors only	Vectors only
Promotion/demotion	Automatic or API	Manual	Manual	Custom code	N/A
Min cold query latency	200ms	300ms	500ms	200ms	N/A
Pricing model	Per-feature stored	Self-hosted + cloud	Per-dimension	S3 storage + Qdrant instance	Per-GB + per-query

Monitoring Your Tiered Deployment

Three metrics tell you whether your tiering strategy is working:

Hot tier utilization. What percentage of vectors in your hot tier were queried in the last 30 days? If it is below 50%, you are over-provisioning hot storage and should demote more aggressively.

P99 latency by tier. Track query latency separately for hot and cold tiers. If cold-tier latency is creeping above 1 second, your cold indexes may need optimization or your query patterns suggest that data should be promoted.

Cost per query. Divide your total storage and compute costs by the number of queries served. If your cost per query is rising while your query volume is flat, data growth is outpacing your tiering policy. Tighten demotion rules or move more data to cold storage.

FAQ

Does S3 Vectors replace vector databases?

No. S3 Vectors is a cold-tier complement, not a hot-tier replacement. It excels at storing and querying vectors you access infrequently at a fraction of the cost of in-memory databases. For sub-10ms real-time search, you still need a vector database.

How much can tiering save?

60-90% of storage costs for deployments where less than 20% of data is actively queried. The exact savings depend on your hot/cold ratio, the pricing of your hot database, and your query volume.

Can I tier multimodal embeddings (video, image, audio)?

Yes. Embedding dimensions and data types are the same regardless of modality. The tiering decision is based on access patterns, not data type. Mixpeek supports tiered storage natively across all modalities.

What about query consistency across tiers?

Hot and cold tiers serve different copies of the same vectors. There is no consistency concern for immutable embeddings. If you update embeddings (e.g., after retraining a model), you need to update both tiers. Unified tiering systems handle this automatically since S3/GCS is the canonical store and the hot layer is derived from it.

When should I start tiering?

When your vector storage costs exceed $200/month or your total vector count exceeds 50 million. Below these thresholds, the operational complexity of tiering is not worth the savings. Above them, the savings compound quickly.

Related Resources

Multimodal Data Warehouse Architecture — how tiering fits into the broader data warehouse pattern

What Is a Multimodal Data Warehouse? — category overview

MVS: Multimodal Vector Store — Mixpeek's unified tiered storage product

Documentation — getting started with Mixpeek