Learn how to build an MCP (Multimodal Context Protocol) pipeline directly on top of S3 using Lambda (for change detection), Temporal (for orchestration), Ray (for scalable compute), and Qdrant (for vector search).
🧩 Architecture Overview To build a fully event-driven MCP pipeline that indexes unstructured content from S3 (videos, PDFs, images, logs), here’s what you’ll need:
Amazon S3 — your raw unstructured data sourceAWS Lambda — triggers feature extraction on new data (CDC style)Temporal — orchestration and retries across modalitiesRay — distributed execution for compute-heavy tasks (e.g., video segmentation, embedding)Qdrant — vector DB to store indexed representations📥 Example: From Upload to Retrieval Once this pipeline is in place, the flow becomes simple:
Upload a file (e.g. customer-incident.mp4
) to S3Trigger runs via Lambda → Temporal → Ray to extract embeddingsStore outputs in Qdrant (vectors) and Postgres (metadata)Query Qdrant with a prompt like:
“Show me all videos where someone slips and falls indoors” The system surfaces relevant clips — no manual tagging required.
🔗 System Flow Change Detection (CDC) : S3 → EventBridge → LambdaOrchestration : Lambda kicks off a Temporal workflowDistributed Processing : Temporal starts Ray tasksIndexing : Outputs pushed to Qdrant (for vector search) and optionally Postgres (for metadata)
graph TD
A[S3 - New File] --> B[EventBridge Trigger]
B --> C[AWS Lambda - CDC]
C --> D[Temporal Workflow - Orchestration]
D --> E[Ray Tasks - Feature Extraction]
E --> F[Qdrant - Vector Index]
E --> G[Postgres - Metadata Store]
⚙️ Infrastructure Breakdown Step 1: S3 CDC via Lambda def lambda_handler(event, context):
s3_key = event["Records"][0]["s3"]["object"]["key"]
# Start Temporal workflow
temporal_client.start_workflow("ProcessMultimodalFile", args={"key": s3_key})
Step 2: Temporal Workflow @workflow.defn
class ProcessMultimodalFile:
@workflow.run
async def run(self, key):
video_path = await download_from_s3(key)
await workflow.execute_activity(extract_and_embed, video_path)
@ray.remote
def extract_and_embed(path):
vision = vision_embedder(path)
audio = audio_embedder(path)
qdrant_client.insert([vision, audio])
🧠What’s Indexed?
Modality
Examples Extracted
Vision
Object detection, OCR, visual style
Audio
Speech-to-text, speaker ID
Text
Entities, sentiment, topic modeling
PDF/Image
Layouts, diagrams, handwriting
🏗️ Mixpeek’s Fully Managed MCP Stack Architecture - Mixpeek
Understanding Mixpeek core architecture and data flow
🔥 Long-Tail Use Cases
Industry
Use Case Example
Insurance
Detect slip-and-fall claims via video ingestion from branch cameras
Healthcare
Search for similar MRI results across a decade of unstructured imaging files
Education
Auto-index and summarize lecture videos for semantic search
Security
Tag suspicious behavior patterns across thousands of archived CCTV feeds
Media
Find moments of laughter or applause in podcast audio archives
Logistics
Scan forklift usage across warehouse footage to predict operator burnout
🧩 Skip the Plumbing You could glue this all together — or you can use Mixpeek , which handles:
Multimodal ingestion pipelines Zero-config embedding + indexing Query-ready APIs for search, alerts, and retrieval Event-based workflows with zero devops Built for developers who don’t want to reinvent multimodal infra.