Build a Model Context Protocol (MCP) Server on S3 using Lambda, Temporal, Ray, and Qdrant
Build a scalable MCP pipeline on S3 using AWS Lambda, Temporal, Ray, and Qdrant to process and index unstructured data like video, audio, and PDFs for real-time AI search and retrieval.

Learn how to build an MCP (Multimodal Context Protocol) pipeline directly on top of S3 using Lambda (for change detection), Temporal (for orchestration), Ray (for scalable compute), and Qdrant (for vector search).
🧩 Architecture Overview
To build a fully event-driven MCP pipeline that indexes unstructured content from S3 (videos, PDFs, images, logs), here’s what you’ll need:
- Amazon S3 — your raw unstructured data source
- AWS Lambda — triggers feature extraction on new data (CDC style)
- Temporal — orchestration and retries across modalities
- Ray — distributed execution for compute-heavy tasks (e.g., video segmentation, embedding)
- Qdrant — vector DB to store indexed representations
📥 Example: From Upload to Retrieval
Once this pipeline is in place, the flow becomes simple:
- Upload a file (e.g.
customer-incident.mp4
) to S3 - Trigger runs via Lambda → Temporal → Ray to extract embeddings
- Store outputs in Qdrant (vectors) and Postgres (metadata)
Query Qdrant with a prompt like:
“Show me all videos where someone slips and falls indoors”
The system surfaces relevant clips — no manual tagging required.
🔗 System Flow
- Change Detection (CDC): S3 → EventBridge → Lambda
- Orchestration: Lambda kicks off a Temporal workflow
- Distributed Processing: Temporal starts Ray tasks
- Indexing: Outputs pushed to Qdrant (for vector search) and optionally Postgres (for metadata)
⚙️ Infrastructure Breakdown
Step 1: S3 CDC via Lambda
def lambda_handler(event, context):
s3_key = event["Records"][0]["s3"]["object"]["key"]
# Start Temporal workflow
temporal_client.start_workflow("ProcessMultimodalFile", args={"key": s3_key})
Step 2: Temporal Workflow
@workflow.defn
class ProcessMultimodalFile:
@workflow.run
async def run(self, key):
video_path = await download_from_s3(key)
await workflow.execute_activity(extract_and_embed, video_path)
Step 3: Ray Tasks for Feature Extraction
@ray.remote
def extract_and_embed(path):
vision = vision_embedder(path)
audio = audio_embedder(path)
qdrant_client.insert([vision, audio])
🧠 What’s Indexed?
Modality | Examples Extracted |
---|---|
Vision | Object detection, OCR, visual style |
Audio | Speech-to-text, speaker ID |
Text | Entities, sentiment, topic modeling |
PDF/Image | Layouts, diagrams, handwriting |
🏗️ Mixpeek’s Fully Managed MCP Stack



🔥 Long-Tail Use Cases
Industry | Use Case Example |
---|---|
Insurance | Detect slip-and-fall claims via video ingestion from branch cameras |
Healthcare | Search for similar MRI results across a decade of unstructured imaging files |
Education | Auto-index and summarize lecture videos for semantic search |
Security | Tag suspicious behavior patterns across thousands of archived CCTV feeds |
Media | Find moments of laughter or applause in podcast audio archives |
Logistics | Scan forklift usage across warehouse footage to predict operator burnout |
🧩 Skip the Plumbing
You could glue this all together — or you can use Mixpeek, which handles:
- Multimodal ingestion pipelines
- Zero-config embedding + indexing
- Query-ready APIs for search, alerts, and retrieval
- Event-based workflows with zero devops
Built for developers who don’t want to reinvent multimodal infra.