Build an Agent That Can See

Your AI agent reads text and calls APIs, but it cannot watch a video or listen to a recording. Mixpeek adds that capability: it decomposes video into searchable features (transcripts, scene embeddings, keyframes) and exposes them through a retrieval API your agent can call like any other tool. In this guide you build a LangChain agent that indexes a video and answers natural-language questions about its contents.

What You’ll Build

A LangChain agent with a video_search tool backed by Mixpeek. The agent accepts a plain-English question, searches indexed video by transcript and visual similarity, and returns timestamped results.

Prerequisites

pip install mixpeek langchain langchain-openai

You also need:

A Mixpeek API key — get one at mixpeek.com/start
An OpenAI API key (for the LangChain agent LLM)

export MIXPEEK_API_KEY="sk_live_replace_me"
export OPENAI_API_KEY="sk-replace_me"

Create a namespace

A namespace isolates all storage and compute for a project. Create one with the multimodal_extractor feature extractor enabled.

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_MIXPEEK_API_KEY")

ns = client.namespaces.create(
    namespace_name="agent-video-demo",
    description="Video search for LangChain agent",
    feature_extractors=[
        {"feature_extractor_name": "multimodal_extractor", "version": "v1"}
    ]
)

namespace_id = ns.namespace_id
print(f"Namespace: {namespace_id}")

Save the returned namespace_id — every subsequent call requires it.

Create a bucket and upload a video

Buckets define the schema for incoming objects. Create one that accepts a video URL, then register an object pointing to a sample video.

bucket = client.buckets.create(
    bucket_name="demo-videos",
    namespace_id=namespace_id,
    schema={
        "properties": {
            "video_url": {"type": "url", "required": True}
        }
    }
)

bucket_id = bucket.bucket_id

obj = client.objects.create(
    bucket_id=bucket_id,
    namespace_id=namespace_id,
    key_prefix="/samples",
    blobs=[
        {
            "property": "video_url",
            "type": "video",
            "url": "https://storage.googleapis.com/mixpeek-public-demo/videos/sample-product-demo.mp4"
        }
    ]
)

object_id = obj.object_id
print(f"Object: {object_id}")

Create a collection and process the video

A collection binds a bucket to a feature extractor. When you submit a batch, the engine decomposes the video into scene embeddings, keyframes, and transcripts.

col = client.collections.create(
    collection_name="video-scenes",
    namespace_id=namespace_id,
    source={"type": "bucket", "bucket_id": bucket_id},
    feature_extractor={
        "feature_extractor_name": "multimodal_extractor",
        "version": "v1",
        "input_mappings": {"video": "payload.video_url"},
        "parameters": {
            "split_method": "scene",
            "scene_detection_threshold": 0.5,
            "run_transcription": True,
            "run_multimodal_embedding": True,
            "run_video_description": True,
        }
    }
)

collection_id = col.collection_id

# Create and submit a batch
batch = client.batches.create(
    bucket_id=bucket_id,
    namespace_id=namespace_id,
    object_ids=[object_id]
)

result = client.batches.submit(
    bucket_id=bucket_id,
    batch_id=batch.batch_id,
    namespace_id=namespace_id
)

task_id = result.task_id
print(f"Task: {task_id}")

Wait for processing

Video processing takes 1-5 minutes depending on length. Poll the task endpoint until status is COMPLETED.

Python

import time

while True:
    task = client.tasks.get(task_id=task_id, namespace_id=namespace_id)
    print(f"Status: {task.status}")
    if task.status == "COMPLETED":
        break
    if task.status == "FAILED":
        raise Exception(f"Processing failed: {task.error}")
    time.sleep(5)

print("Video processed — documents ready for search")

For production use, register a webhook instead of polling. Mixpeek sends a batch.completed event when processing finishes.

Create a retriever

A retriever defines how search queries map to indexed features. This one performs semantic search over scene embeddings extracted by multimodal_extractor.

ret = client.retrievers.create(
    retriever_name="agent-video-search",
    namespace_id=namespace_id,
    description="Semantic search over video scenes",
    input_schema={
        "properties": {
            "query_text": {"type": "text", "required": True}
        }
    },
    collection_ids=[collection_id],
    stages=[
        {
            "stage_type": "filter",
            "stage_id": "feature_search",
            "parameters": {
                "feature_uri": "mixpeek://multimodal_extractor@v1/multimodal_embedding",
                "input": {"text": "{{INPUT.query_text}}"},
                "limit": 20
            }
        }
    ],
    cache_config={"enabled": True, "ttl_seconds": 300}
)

retriever_id = ret.retriever_id
print(f"Retriever: {retriever_id}")

Wire it as a LangChain tool

Wrap the retriever’s execute method as a LangChain Tool. The agent calls this tool whenever it needs to search video content.

Python

from langchain.tools import Tool
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder


def search_video(query: str) -> str:
    """Search indexed video content by natural language query."""
    results = client.retrievers.execute(
        retriever_id=retriever_id,
        namespace_id=namespace_id,
        inputs={"query_text": query},
        limit=5
    )
    # Format results with timestamps for the agent
    formatted = []
    for r in results.results:
        start = r.metadata.get("start_time", "?")
        end = r.metadata.get("end_time", "?")
        score = r.score
        formatted.append(
            f"[{start}s - {end}s] (score: {score:.3f}) {r.metadata.get('description', '')}"
        )
    return "\n".join(formatted) if formatted else "No results found."


video_search_tool = Tool(
    name="video_search",
    description=(
        "Search indexed video content by natural language. "
        "Returns timestamped scene matches with relevance scores. "
        "Use this when the user asks about what happens in a video."
    ),
    func=search_video
)

# Build the agent
llm = ChatOpenAI(model="gpt-4o", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You answer questions about video content. Use the video_search tool to find relevant moments."),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad"),
])

agent = create_openai_tools_agent(llm, [video_search_tool], prompt)
executor = AgentExecutor(agent=agent, tools=[video_search_tool], verbose=True)

# Run a query
response = executor.invoke({
    "input": "What product features are demonstrated in the video?"
})

print(response["output"])

Run the script. The agent calls video_search, receives timestamped scene matches, and synthesizes a natural-language answer.

What Just Happened

Here is the pipeline you built:

Namespace created an isolated environment with multimodal_extractor enabled
Bucket + Object registered a video URL with a defined schema
Collection + Batch triggered the engine to decompose the video into scene segments, each with embeddings, keyframes, and timestamps
Retriever defined a search interface over those scene embeddings
LangChain Tool wrapped the retriever so your agent can query video content in plain English

The agent does not download or process video itself. It calls the Mixpeek retriever, which returns pre-indexed results in milliseconds. You can add more videos to the bucket and reprocess — the retriever automatically searches across all indexed content.

Next Steps

Add transcript search

Add transcript embeddings to your retriever for hybrid visual + spoken-content search.

Retriever stages

Add reranking, filtering, and enrichment stages to your retriever pipeline.

Webhooks

Replace polling with event-driven processing notifications.

Get Started

Agent Integrations

What Mixpeek Extracts

Retrieval

Platform

Relevance & Personalization

Enrich & Organize

Best Practices

Troubleshoot

Build an Agent That Can See

What You’ll Build

Prerequisites

What Just Happened

Next Steps

Add transcript search

Retriever stages

Webhooks

Get Started

Agent Integrations

What Mixpeek Extracts

Retrieval

Platform

Relevance & Personalization

Enrich & Organize

Best Practices

Troubleshoot

​What You’ll Build

​Prerequisites

​What Just Happened

​Next Steps

Add transcript search

Retriever stages

Webhooks

What You’ll Build

Prerequisites

What Just Happened

Next Steps