Why Computer-Use Agents Need Searchable Memory
Computer-use agents operate through a loop:
1. Observe the screen. 2. Choose an action. 3. Click, type, scroll, or call a tool. 4. Observe the new screen. 5. Decide what changed.
That loop creates a rich trace, but most systems store it as logs and screenshots. Logs capture actions but miss visual state. Screenshots capture pixels but are hard to search. Browser events capture URLs but not intent. Tool outputs capture results but not whether the agent used them correctly.
If the agent fails, the debugging question is rarely "what did the model say?" It is:
Those are retrieval questions over unstructured agent traces. A computer-use memory system turns every step into searchable evidence.
The hard gate for this topic is direct: searchable computer-use memory helps AI agents see screens, inspect UI state, and search unstructured task traces.
The Core Architecture
Treat each agent step as an observation packet.
computer-use run
-> state transition
-> observation packet
-> multimodal indexes
-> bounded retrieval tool
-> evidence for the next agent step
An observation packet is not just a screenshot. It combines what the agent saw, what it did, what changed, and how to cite the evidence.
{
"observation_id": "obs_run_42_step_08",
"run_id": "run_42",
"step": 8,
"timestamp": "2026-06-06T15:20:11Z",
"task": "reset account password",
"environment": "browser",
"source_uri": "s3://agent-runs/run_42/step_08.png",
"url": "https://app.example.com/settings",
"action_before": {
"type": "click",
"target": "Save",
"x": 812,
"y": 718
},
"visual_state": {
"caption": "Settings page with password form and disabled save button",
"ocr": ["Current password", "New password", "Save"],
"ui_elements": [
{"role": "button", "text": "Save", "bbox": [760, 690, 860, 735], "enabled": false}
]
},
"tool_outputs": [
{"tool": "validate_password_policy", "result": "missing_special_character"}
],
"provenance": {
"screenshot_model": "Hcompany/Holo-3.1-4B",
"ocr_model": "PaddleOCR-VL",
"extractor_version": "2026-06-06"
}
}
This packet gives future agents something concrete to retrieve. It can be searched by visible text, UI layout, action type, tool result, task, URL, or visual similarity.
Segment Runs by State Transitions
Document RAG chunks text. Video RAG chunks time. Computer-use memory chunks state transitions.
A state transition is the interval between one observation and the next:
screen_before + action + tool_output + screen_after
This is the useful retrieval unit because it preserves causality:
Store both single-state packets and transition packets.
Single-state packets answer:
Transition packets answer:
Most failures are transition failures, not state failures. The screen looked plausible, but the next action was wrong.
Extract Multiple Channels
Do not force screenshots, actions, OCR, DOM, and tool calls into one vector. Use separate channels and fuse results later.
Screenshot Caption Channel
A vision-language model can summarize the screen:
This channel supports natural language queries like "modal open with a disabled submit button" or "pricing table where annual billing is selected."
OCR and UI Text Channel
OCR and accessibility text preserve exact strings. This matters for:
Dense visual embeddings are weak at exact strings. Keep OCR searchable with lexical retrieval and filters.
UI Element Channel
When possible, extract structured UI elements:
For web and desktop agents, this channel is often more reliable than pixels alone. For custom canvases, mobile screenshots, and older apps, you may need OCR and vision grounding as the fallback.
Action and Tool Channel
Actions and tool outputs form the agent's operational memory:
This channel explains why a state exists. It is the difference between "the page showed an error" and "the page showed an error after the agent submitted a password that failed policy validation."
Visual Embedding Channel
Visual embeddings support fuzzy search over layouts and states:
Use visual embeddings for recall. Use OCR, UI elements, and reranking for precision.
Use Multi-Index Retrieval
A computer-use memory search should query several indexes independently:
agent task or debug query
-> screenshot caption search
-> OCR/BM25 search
-> UI element filter
-> action/tool search
-> visual embedding search
-> fusion
-> reranking
-> evidence packet
The score scales are not comparable. BM25 over OCR, cosine similarity over screenshots, and a UI element filter do not produce the same kind of confidence. Start with rank-based fusion such as Reciprocal Rank Fusion, then rerank the top candidates with the full query and packet context.
The result should tell the agent why it matched:
{
"source_uri": "s3://agent-runs/run_42/step_08.png",
"run_id": "run_42",
"step": 8,
"summary": "Password settings page with disabled Save button",
"matched_channels": ["ocr", "ui_element", "visual_embedding"],
"matched_evidence": [
{"channel": "ocr", "text": "Save"},
{"channel": "ui_element", "role": "button", "enabled": false},
{"channel": "tool", "result": "missing_special_character"}
],
"next_actions": ["inspect_neighboring_steps", "compare_successful_transition"]
}
The agent should not receive a vague memory blob. It should receive evidence with a screen, a step, a reason, and allowed follow-up actions.
Design the Agent Tool
Expose computer-use memory as a bounded tool.
{
"tool": "search_computer_use_memory",
"input_schema": {
"query": "string",
"run_filters": {
"app": "string",
"environment": "browser desktop
mobile",
"task": "string",
"url_contains": "string"
},
"channels": ["screenshot", "ocr", "ui_element", "action", "tool_output"],
"top_k": 20,
"include_neighbor_steps": true,
"budget_ms": 2000
}
}
Make the tool return observations, not conclusions. The agent can use the observations to decide whether to retry, inspect a neighbor step, compare against a successful run, or ask a human for review.
Good tool outputs include:
Bad tool outputs only say "similar screen found." That forces the model to infer too much.
Evaluate Memory Quality
Computer-use memory needs retrieval evals and trace evals.
Retrieval eval examples:
Trace eval examples:
Useful metrics:
The key is separating "retrieved the right packet" from "used the packet correctly." A memory system can retrieve the right evidence while the agent ignores it.
Security and Privacy Controls
Computer-use traces can contain sensitive data. Treat them as production data, not debug leftovers.
Design controls before indexing:
Memory is useful because it is searchable. That also makes access control more important.
Mixpeek Implementation Pattern
Use object storage as the source of truth for screenshots, screen recordings, and trace artifacts. Then index each observation packet with multiple feature channels.
from mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_API_KEY")
mx.ingest.images(
collection="computer_use_memory",
source={"type": "s3", "bucket": "agent-runs", "prefix": "run_42/"},
metadata={
"run_id": "run_42",
"task": "reset account password",
"environment": "browser"
},
pipeline={
"captioning": {"model": "mixpeek://image_extractor@v1/hcompany_holo_31_4b_v1"},
"ocr": {"model": "mixpeek://image_extractor@v1/paddle_ocr_vl_16_v1"},
"visual_embedding": {"model": "mixpeek://image_extractor@v1/qwen3_vl_embedding_2b_v1"}
}
)
Then expose a retrieval tool to the agent:
results = mx.retrievers.retrieve(
retriever_id="computer-use-memory",
queries=[
{
"type": "text",
"value": "disabled save button after password policy validation error"
}
],
filters={
"environment": {"eq": "browser"},
"task": {"eq": "reset account password"}
},
stages=[
{"name": "ocr_bm25", "top_k": 100},
{"name": "ui_element_filter", "top_k": 100},
{"name": "visual_embedding", "top_k": 100},
{"name": "rrf_fusion", "top_k": 40},
{"name": "vlm_rerank", "top_k": 10}
],
include_context=True,
budget_ms=2000
)
If you bring your own screenshot embeddings, MVS can store and search them on object storage. If you want Mixpeek to run the perception layer, Managed indexing can populate captions, OCR, visual embeddings, and trace features from the objects you already keep in storage.
Design Checklist
Key Takeaways
1. Computer-use traces are multimodal evidence, not just logs.
2. The best retrieval unit is the observation packet: screen state, action, tool output, and provenance.
3. Screenshots need multiple indexes. Visual embeddings, OCR, UI elements, and action traces answer different questions.
4. Agent memory tools should return bounded evidence with citations and follow-up actions.
5. Evaluation must test retrieval, step localization, and whether the agent used the memory correctly.
6. Searchable memory makes agents more capable, but only if access control and redaction are designed into the index.