Why Agents Need Multimodal Search Tools
Large language models can reason, plan, and write code. What they cannot do is perceive the world directly. An LLM has no eyes, no ears, no access to your video library, your document archive, or your audio recordings. It works entirely from text passed into its context window.
This is the perception gap. An agent that needs to answer "find the meeting where the CFO discussed the budget cut" must somehow search across video frames, audio transcripts, and slide decks -- modalities the LLM cannot natively process. The agent needs tools that translate its text-based reasoning into multimodal search operations and return structured results it can act on.
The Model Context Protocol (MCP) has emerged as the standard interface between agents and external capabilities. Originally introduced by Anthropic in late 2024 and now supported by most major agent frameworks (LangChain, LlamaIndex, OpenAI Agents SDK, AutoGen), MCP defines a JSON-RPC protocol for tools, resources, and prompts that any conforming host can discover and invoke.
But most MCP tool implementations are simple wrappers around text APIs. Exposing multimodal search -- where inputs and outputs span images, audio, video, and documents -- requires careful schema design, efficient embedding handoff, and result formats that give the agent enough context to reason about visual and auditory content it cannot directly see or hear.
This guide covers the architecture of MCP tools for multimodal search, from protocol fundamentals through production implementation patterns.
MCP Protocol Fundamentals
MCP uses JSON-RPC 2.0 over stdio or HTTP (Server-Sent Events). A server exposes three primitive types:
Tools are callable functions with typed input schemas. The agent decides when to call them based on its reasoning. A search tool takes a query and returns results.
Resources are read-only data the agent can access on demand. A resource might be a collection listing, a schema description, or a cached search result. Resources have URIs and can be subscribed to for updates.
Prompts are reusable templates that guide the agent on how to use the tools effectively. A multimodal search prompt might teach the agent how to decompose a complex query into modality-specific sub-queries.
The key architectural insight is that MCP separates discovery from invocation. The agent first discovers available tools via `tools/list`, inspects their schemas, and then decides which tools to call based on the user's request. This means your tool schemas are the primary interface -- they must be expressive enough for the agent to understand what the tool can do without documentation.
Designing Tool Schemas for Search
The most critical design decision is schema granularity: should you expose one `search` tool that handles all modalities, or separate tools per modality?
The Monolithic Approach
A single `multimodal_search` tool accepts a text query and optional filters:
{
"name": "multimodal_search",
"description": "Search across video, images, audio, and documents using natural language. Returns ranked results with timestamps, thumbnails, and relevance scores.",
"inputSchema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Natural language search query"
},
"collections": {
"type": "array",
"items": { "type": "string" },
"description": "Collection IDs to search. Omit to search all."
},
"modalities": {
"type": "array",
"items": { "enum": ["video", "image", "audio", "document"] },
"description": "Filter results to specific content types"
},
"top_k": {
"type": "integer",
"default": 10,
"description": "Number of results to return"
}
},
"required": ["query"]
}
}
This works when the agent's query maps cleanly to a single retrieval operation. The agent writes "find the meeting where the CFO discussed budget cuts" and the tool handles modality routing internally.
The Composable Approach
Separate tools give the agent explicit control over the retrieval strategy:
[
{
"name": "search_video_frames",
"description": "Find video frames matching a visual description. Returns frame timestamps, thumbnails, and similarity scores.",
"inputSchema": {
"properties": {
"visual_query": { "type": "string" },
"collection_id": { "type": "string" },
"top_k": { "type": "integer", "default": 10 }
},
"required": ["visual_query", "collection_id"]
}
},
{
"name": "search_transcripts",
"description": "Search audio/video transcripts by text content. Returns transcript segments with timestamps and speaker labels.",
"inputSchema": {
"properties": {
"text_query": { "type": "string" },
"collection_id": { "type": "string" },
"speaker_filter": { "type": "string" },
"top_k": { "type": "integer", "default": 10 }
},
"required": ["text_query", "collection_id"]
}
},
{
"name": "search_documents",
"description": "Semantic search over document text and OCR content. Returns document sections with page numbers and extracted text.",
"inputSchema": {
"properties": {
"query": { "type": "string" },
"collection_id": { "type": "string" },
"doc_type_filter": { "type": "string" },
"top_k": { "type": "integer", "default": 10 }
},
"required": ["query", "collection_id"]
}
}
]
The composable approach is more powerful in practice. When the agent needs to find "the meeting where the CFO discussed budget cuts," it can decompose this into: (1) search transcripts for "budget cuts" filtered to the CFO's speaker label, (2) search video frames for a person at a conference table with slides, (3) cross-reference the timestamps. This multi-step reasoning is what agents excel at -- but only if the tools expose enough modality-specific controls.
Rule of thumb: start with the composable approach. You can always add a convenience `multimodal_search` tool later that internally calls the composable tools. Going the other direction -- decomposing a monolithic tool into composable ones -- requires the agent to learn a new tool set.
The Embedding Handoff Problem
When an agent calls a search tool, it sends a text query. The tool must convert that text into the right kind of embedding for the target modality. This is the embedding handoff -- and it is where most naive MCP implementations break down.
Consider what happens when an agent searches for "a red sports car parked in front of a glass building." The tool needs to:
1. Decide whether this is a visual query (search by image similarity) or a text query (search by caption/transcript match) 2. Select the right embedding model for the target modality 3. Encode the query into the model's vector space 4. Run the vector search against the correct index
The naive approach is to embed the query text with a text encoder and search a text embedding index. But the user's query describes a visual scene. The correct approach is to use a cross-modal embedding model (like CLIP or SigLIP) that maps text into the same vector space as image embeddings, so the text query "red sports car" retrieves images that visually match even if no caption mentions cars.
This means the MCP tool must know which embedding model was used at ingest time and route the query through the corresponding encoder. A search over CLIP-indexed video frames requires a CLIP text encoder. A search over transcripts indexed with BGE requires a BGE query encoder. Mixing encoders produces garbage results because the vector spaces are incompatible.
The Retriever Abstraction
The cleanest pattern is to abstract this behind a retriever that knows its own embedding configuration:
class MultimodalSearchTool:
def __init__(self, mixpeek_client):
self.client = mixpeek_client
async def search(self, query: str, collection_id: str, top_k: int = 10):
# The retriever knows which models were used at ingest
# and routes the query through the correct encoders
results = await self.client.retrievers.retrieve(
queries=[{"type": "text", "value": query}],
collection_ids=[collection_id],
stages=[
{
"type": "feature_search",
"feature": "image_embedding",
"top_k": top_k * 3 # over-retrieve for reranking
},
{
"type": "feature_search",
"feature": "transcript",
"top_k": top_k * 3
},
{
"type": "rerank",
"model": "Qwen/Qwen3-VL-Reranker-2B",
"top_k": top_k
}
]
)
return results
The retriever handles model selection, multi-stage fusion, and reranking internally. The MCP tool exposes a clean text-in, results-out interface. The agent never needs to know about embedding models or vector spaces -- it just describes what it is looking for.
Structuring Results for Agent Reasoning
The results your MCP tool returns determine how effectively the agent can reason about multimodal content. Text results are straightforward -- the agent can read them directly. But for images, audio, and video, you need a representation that conveys enough information for the agent to make decisions without actually seeing or hearing the content.
Result Schema Design
{
"results": [
{
"asset_id": "vid_abc123",
"asset_type": "video",
"relevance_score": 0.92,
"timestamp": {
"start_seconds": 142.5,
"end_seconds": 156.3
},
"context": {
"transcript_segment": "[Speaker: CFO] We need to cut the Q4 budget by fifteen percent. The board approved the reduction last Tuesday.",
"scene_description": "Conference room, 6 people seated around table, presentation slide visible showing bar chart labeled 'Q4 Budget Projections'",
"detected_objects": ["person", "laptop", "presentation_screen", "chart"]
},
"source": {
"filename": "board-meeting-2026-05-12.mp4",
"collection": "executive-meetings",
"duration_seconds": 3842
}
}
]
}
The critical field is `context`. For video results, this includes the transcript segment, a generated scene description, and detected objects. The agent can read all of these and reason about the content without needing to "see" the video frame. For audio results, the context includes the transcript with speaker labels and timestamps. For document results, it includes the extracted text with page numbers and section headings.
Do not return raw embedding vectors. They are meaningless to the agent. Do not return base64-encoded images in most cases -- they consume context window tokens without adding reasoning value. Instead, return descriptive metadata that the agent can use for filtering, ranking, and presenting to the user.
The exception is when the agent is building a response that includes visual content -- for example, generating a report with embedded images. In that case, return a URL or reference ID that the agent can pass to a rendering tool, not the raw bytes.
Pagination and Streaming
Multimodal search results can be large. A single video result with transcript, scene description, and metadata might be 500+ tokens. Returning 50 results would consume 25,000 tokens of context -- half the window of many models.
MCP supports streaming via Server-Sent Events (SSE), but most agent frameworks process tool results as a single block. The practical pattern is progressive disclosure:
1. Return a summary first (top 5 results with compact metadata) 2. Expose a `get_result_details` tool that returns full context for a specific result 3. Let the agent decide which results warrant deeper inspection
This mirrors how humans search: scan titles, click into promising results, read details. The agent does the same -- it scans the summary, identifies the most relevant result, and calls `get_result_details` for the full transcript and scene description.
MCP Resources for Collection Metadata
Beyond search tools, MCP resources let you expose collection metadata that helps the agent decide where and how to search.
{
"uri": "mixpeek://collections/executive-meetings",
"name": "Executive Meetings",
"description": "Board meetings and executive briefings from 2024-2026",
"mimeType": "application/json",
"metadata": {
"asset_count": 342,
"modalities": ["video", "audio"],
"features": ["transcript", "scene_caption", "face_identity", "image_embedding"],
"date_range": "2024-01-15 to 2026-05-12",
"total_duration_hours": 512
}
}
When the agent needs to answer "find the meeting where the CFO discussed budget cuts," it can first read the collection resources to discover that the `executive-meetings` collection contains video with transcripts and face identity features. This informs its search strategy: search transcripts for "budget cuts" and use face identity to verify the CFO is present.
Without collection metadata, the agent would have to guess which collection to search and which features are available -- leading to failed searches or suboptimal query strategies.
Error Handling and Graceful Degradation
Multimodal search can fail in ways that text search does not. An embedding model might not support the query language. A video might not have been transcribed yet. A collection might have image embeddings but no OCR text.
The MCP tool should return structured errors that help the agent adapt:
{
"error": {
"code": "FEATURE_NOT_AVAILABLE",
"message": "Collection 'product-images' does not have transcript features. Available features: image_embedding, object_detection, ocr_text.",
"suggestion": "Use search_by_visual_similarity or search_documents for OCR text search instead."
}
}
The `suggestion` field is key. It tells the agent how to recover without human intervention. A well-designed MCP tool anticipates common failure modes and provides actionable alternatives.
Putting It Together: A Complete MCP Server
Here is the architecture of a production MCP server for multimodal search:
from mcp.server import Server
from mcp.types import Tool, Resource
from mixpeek import Mixpeek
server = Server("multimodal-search")
client = Mixpeek(api_key="API_KEY")
@server.tool()
async def search_visual(
query: str,
collection_id: str,
top_k: int = 10
) -> list[dict]:
"""Search video frames and images by visual description."""
results = await client.retrievers.retrieve(
queries=[{"type": "text", "value": query}],
collection_ids=[collection_id],
stages=[
{"type": "feature_search", "feature": "image_embedding", "top_k": top_k}
]
)
return format_results(results)
@server.tool()
async def search_transcript(
query: str,
collection_id: str,
speaker: str | None = None,
top_k: int = 10
) -> list[dict]:
"""Search audio/video transcripts. Optionally filter by speaker."""
stages = [
{"type": "feature_search", "feature": "transcript", "top_k": top_k * 2}
]
if speaker:
stages.insert(0, {
"type": "filter",
"feature": "speaker_diarization",
"filter": {"speaker_label": speaker}
})
stages.append({"type": "rerank", "model": "Qwen/Qwen3-Reranker-8B", "top_k": top_k})
results = await client.retrievers.retrieve(
queries=[{"type": "text", "value": query}],
collection_ids=[collection_id],
stages=stages
)
return format_results(results)
@server.tool()
async def search_documents(
query: str,
collection_id: str,
top_k: int = 10
) -> list[dict]:
"""Semantic search over document text and OCR content."""
results = await client.retrievers.retrieve(
queries=[{"type": "text", "value": query}],
collection_ids=[collection_id],
stages=[
{"type": "feature_search", "feature": "text_embedding", "top_k": top_k * 2},
{"type": "rerank", "model": "Qwen/Qwen3-Reranker-8B", "top_k": top_k}
]
)
return format_results(results)
@server.resource("mixpeek://collections")
async def list_collections() -> str:
"""List available collections with their modalities and features."""
collections = await client.collections.list()
return json.dumps([{
"id": c.id,
"name": c.name,
"asset_count": c.asset_count,
"modalities": c.modalities,
"features": c.features
} for c in collections])
The server exposes three search tools (visual, transcript, document) and one resource (collection listing). Each tool handles embedding routing and multi-stage retrieval internally. The agent discovers the tools, reads the collection metadata, and composes search strategies by calling the right tools in sequence.
Schema Design Principles
After building several MCP servers for multimodal search, these principles consistently produce better agent behavior:
1. Tool descriptions are documentation. Agents read the `description` field to decide when to use a tool. Write descriptions that explain what the tool searches over, what filters are available, and what the result format looks like. One detailed sentence is worth more than a paragraph of vague marketing copy.
2. Enumerate valid values. If a filter accepts a fixed set of values (modalities, feature types, speaker labels), use `enum` in the schema. Agents handle enums much more reliably than freeform strings.
3. Default to broad, narrow on request. If the agent does not specify a filter, search everything. If it specifies a speaker, filter to that speaker. Do not require the agent to know about every filter option to get useful results.
4. Return provenance. Every result should include enough metadata for the agent to cite its source: filename, timestamp, page number, collection name. Agents that can cite sources are dramatically more useful than agents that just return answers.
5. Separate search from retrieval. The search tool returns ranked references. A separate `get_asset_details` tool returns full content for a specific asset. This keeps search results compact and lets the agent decide what to inspect in detail.
Where This Is Heading
The MCP ecosystem is evolving rapidly. Server-to-server composition (one MCP server calling tools on another) is being standardized, which will enable hierarchical search architectures -- an agent's search tool could itself be an agent that reasons about which sub-searches to run.
Multimodal search is the foundational capability that makes agents useful for real-world work. An agent that can search across video, audio, images, and documents can answer questions that no text-only system can handle. The MCP protocol gives these search capabilities a standard interface that works across any agent framework.