Multimodal RAG
Retrieval-augmented generation across video, images, and text. Retrieve relevant multimodal context, then pass to your LLM with citations back to source timestamps and frames.
"How did the product launch go? Cite specific video clips and document timestamps"
Why This Matters
RAG quality depends on retrieval quality. Mixpeek handles the multimodal retrieval infrastructure while you bring your preferred generation model.
import requestsfrom openai import OpenAIAPI_URL = "https://api.mixpeek.com"headers = {"Authorization": "Bearer YOUR_API_KEY", "X-Namespace": "your-namespace"}openai = OpenAI(api_key="your-openai-key")# Retrieve multimodal context with citationsresults = requests.post(f"{API_URL}/v1/retrievers/rag-retriever/execute",headers=headers,json={"query": {"text": "How did the product launch go?"}}).json()# Format context with source citationscontext_str = "\n".join([f"[{i+1}] {doc['text']} (Source: {doc['root_object_id']} @ {doc['start_time']}s)"for i, doc in enumerate(results["documents"])])# Generate with your preferred LLMresponse = openai.chat.completions.create(model="gpt-4o",messages=[{"role": "system", "content": f"Answer based on this context:\n{context_str}"},{"role": "user", "content": "Summarize the product launch feedback with citations"}])print(response.choices[0].message.content)
Feature Extractors
Text Embedding
Extract semantic embeddings from documents, transcripts and text content
Image Embedding
Generate visual embeddings for similarity search and clustering
Video Embedding
Generate vector embeddings for video content
Audio Transcription
Transcribe audio content to text
Retriever Stages
feature search
Search and filter documents by vector similarity using feature embeddings
