Computer-Use Agent Memory: How to Search Screens, Tools, and UI State

Why Computer-Use Agents Need Searchable Memory

Computer-use agents operate through a loop:

1. Observe the screen. 2. Choose an action. 3. Click, type, scroll, or call a tool. 4. Observe the new screen. 5. Decide what changed.

That loop creates a rich trace, but most systems store it as logs and screenshots. Logs capture actions but miss visual state. Screenshots capture pixels but are hard to search. Browser events capture URLs but not intent. Tool outputs capture results but not whether the agent used them correctly.

If the agent fails, the debugging question is rarely "what did the model say?" It is:

Which screen state caused the bad action?

Did the expected button exist before the click?

Was the agent looking at stale context?

Did a tool result contradict the visible UI?

Has this failure happened before in a similar workflow?

Which prior run contains the closest successful state?

Those are retrieval questions over unstructured agent traces. A computer-use memory system turns every step into searchable evidence.

The hard gate for this topic is direct: searchable computer-use memory helps AI agents see screens, inspect UI state, and search unstructured task traces.

The Core Architecture

Treat each agent step as an observation packet.

computer-use run
  -> state transition
  -> observation packet
  -> multimodal indexes
  -> bounded retrieval tool
  -> evidence for the next agent step

An observation packet is not just a screenshot. It combines what the agent saw, what it did, what changed, and how to cite the evidence.

{
  "observation_id": "obs_run_42_step_08",
  "run_id": "run_42",
  "step": 8,
  "timestamp": "2026-06-06T15:20:11Z",
  "task": "reset account password",
  "environment": "browser",
  "source_uri": "s3://agent-runs/run_42/step_08.png",
  "url": "https://app.example.com/settings",
  "action_before": {
    "type": "click",
    "target": "Save",
    "x": 812,
    "y": 718
  },
  "visual_state": {
    "caption": "Settings page with password form and disabled save button",
    "ocr": ["Current password", "New password", "Save"],
    "ui_elements": [
      {"role": "button", "text": "Save", "bbox": [760, 690, 860, 735], "enabled": false}
    ]
  },
  "tool_outputs": [
    {"tool": "validate_password_policy", "result": "missing_special_character"}
  ],
  "provenance": {
    "screenshot_model": "Hcompany/Holo-3.1-4B",
    "ocr_model": "PaddleOCR-VL",
    "extractor_version": "2026-06-06"
  }
}

This packet gives future agents something concrete to retrieve. It can be searched by visible text, UI layout, action type, tool result, task, URL, or visual similarity.

Segment Runs by State Transitions

Document RAG chunks text. Video RAG chunks time. Computer-use memory chunks state transitions.

A state transition is the interval between one observation and the next:

screen_before + action + tool_output + screen_after

This is the useful retrieval unit because it preserves causality:

what was visible before the action

what action the agent chose

what tool or browser event executed

what changed on the screen

whether the result matched the intent

Store both single-state packets and transition packets.

Single-state packets answer:

"Find screens where the Save button was disabled."

"Find prior runs on the billing page."

"Find screenshots with error code AUTH-429."

Transition packets answer:

"Find cases where clicking Save did nothing."

"Find successful transitions from checkout to payment confirmation."

"Find runs where the agent clicked a hidden or disabled control."

Most failures are transition failures, not state failures. The screen looked plausible, but the next action was wrong.

Extract Multiple Channels

Do not force screenshots, actions, OCR, DOM, and tool calls into one vector. Use separate channels and fuse results later.

Screenshot Caption Channel

A vision-language model can summarize the screen:

page or app type

visible controls

layout

error messages

affordances

apparent state

This channel supports natural language queries like "modal open with a disabled submit button" or "pricing table where annual billing is selected."

OCR and UI Text Channel

OCR and accessibility text preserve exact strings. This matters for:

error codes

button labels

form labels

product names

amounts

table values

legal or compliance text

Dense visual embeddings are weak at exact strings. Keep OCR searchable with lexical retrieval and filters.

UI Element Channel

When possible, extract structured UI elements:

role: button, input, link, menu, checkbox

text or accessible name

bounding box

enabled or disabled state

selected state

hierarchy

DOM selector or accessibility path

For web and desktop agents, this channel is often more reliable than pixels alone. For custom canvases, mobile screenshots, and older apps, you may need OCR and vision grounding as the fallback.

Action and Tool Channel

Actions and tool outputs form the agent's operational memory:

click coordinates

typed text, redacted when sensitive

scroll direction

selected tool

tool arguments

tool output

safety checks

retry count

cancellation reason

This channel explains why a state exists. It is the difference between "the page showed an error" and "the page showed an error after the agent submitted a password that failed policy validation."

Visual Embedding Channel

Visual embeddings support fuzzy search over layouts and states:

similar modal layouts

similar checkout failures

similar visual diff patterns

same UI in different themes or languages

repeated app states with different text

Use visual embeddings for recall. Use OCR, UI elements, and reranking for precision.

Use Multi-Index Retrieval

A computer-use memory search should query several indexes independently:

agent task or debug query
  -> screenshot caption search
  -> OCR/BM25 search
  -> UI element filter
  -> action/tool search
  -> visual embedding search
  -> fusion
  -> reranking
  -> evidence packet

The score scales are not comparable. BM25 over OCR, cosine similarity over screenshots, and a UI element filter do not produce the same kind of confidence. Start with rank-based fusion such as Reciprocal Rank Fusion, then rerank the top candidates with the full query and packet context.

The result should tell the agent why it matched:

{
  "source_uri": "s3://agent-runs/run_42/step_08.png",
  "run_id": "run_42",
  "step": 8,
  "summary": "Password settings page with disabled Save button",
  "matched_channels": ["ocr", "ui_element", "visual_embedding"],
  "matched_evidence": [
    {"channel": "ocr", "text": "Save"},
    {"channel": "ui_element", "role": "button", "enabled": false},
    {"channel": "tool", "result": "missing_special_character"}
  ],
  "next_actions": ["inspect_neighboring_steps", "compare_successful_transition"]
}

The agent should not receive a vague memory blob. It should receive evidence with a screen, a step, a reason, and allowed follow-up actions.

Design the Agent Tool

Expose computer-use memory as a bounded tool.

{
  "tool": "search_computer_use_memory",
  "input_schema": {
    "query": "string",
    "run_filters": {
      "app": "string",
      "environment": "browser | desktop | mobile",
      "task": "string",
      "url_contains": "string"
    },
    "channels": ["screenshot", "ocr", "ui_element", "action", "tool_output"],
    "top_k": 20,
    "include_neighbor_steps": true,
    "budget_ms": 2000
  }
}

Make the tool return observations, not conclusions. The agent can use the observations to decide whether to retry, inspect a neighbor step, compare against a successful run, or ask a human for review.

Good tool outputs include:

screenshot URI or signed URL

step number and run ID

matched channels

UI element coordinates or selectors

action before and after

tool outputs that influenced the state

confidence and rank

provenance for every extracted feature

Bad tool outputs only say "similar screen found." That forces the model to infer too much.

Evaluate Memory Quality

Computer-use memory needs retrieval evals and trace evals.

Retrieval eval examples:

Find the prior run where the billing modal failed after submit.

Find the screenshot with error code AUTH-429.

Find successful states that look like this failed state.

Find the step before the agent clicked a disabled button.

Find cases where the tool result said success but the UI still showed pending.

Trace eval examples:

Did the agent call memory before retrying the same failed action?

Did it search the right channel, OCR for an error code and visual similarity for a layout?

Did it cite the exact step it used?

Did it inspect neighbor steps before concluding cause?

Did it stop when memory returned no high-confidence match?

Useful metrics:

Recall@k for known prior states.

nDCG by task class.

UI element match accuracy.

Step localization error.

Screenshot citation rate.

Unsupported visual claim rate.

Stale-context reuse rate.

Cost per successful retrieved observation.

P95 tool latency.

Cancellation rate for superseded searches.

The key is separating "retrieved the right packet" from "used the packet correctly." A memory system can retrieve the right evidence while the agent ignores it.

Security and Privacy Controls

Computer-use traces can contain sensitive data. Treat them as production data, not debug leftovers.

Design controls before indexing:

redact passwords, tokens, addresses, and payment fields

store screenshot crops instead of full screens when possible

attach tenant, user, run, and app ACLs to every packet

separate raw screenshots from derived features

expire temporary task memory

preserve audit lineage for who searched what

block retrieval across tenants by default

Memory is useful because it is searchable. That also makes access control more important.

Mixpeek Implementation Pattern

Use object storage as the source of truth for screenshots, screen recordings, and trace artifacts. Then index each observation packet with multiple feature channels.

from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_API_KEY")

mx.ingest.images(
    collection="computer_use_memory",
    source={"type": "s3", "bucket": "agent-runs", "prefix": "run_42/"},
    metadata={
        "run_id": "run_42",
        "task": "reset account password",
        "environment": "browser"
    },
    pipeline={
        "captioning": {"model": "mixpeek://image_extractor@v1/hcompany_holo_31_4b_v1"},
        "ocr": {"model": "mixpeek://image_extractor@v1/paddle_ocr_vl_16_v1"},
        "visual_embedding": {"model": "mixpeek://image_extractor@v1/qwen3_vl_embedding_2b_v1"}
    }
)

Then expose a retrieval tool to the agent:

results = mx.retrievers.execute(
    retriever_id="your-retriever-id",
    query="disabled save button after password policy validation error",
)

If you bring your own screenshot embeddings, MVS can store and search them on object storage. If you want Mixpeek to run the perception layer, Managed indexing can populate captions, OCR, visual embeddings, and trace features from the objects you already keep in storage.

Design Checklist

Store one observation packet per screen state.

Store transition packets for before-action-after sequences.

Keep screenshot, OCR, UI element, action, tool, and visual embedding channels separate.

Preserve run ID, step number, timestamp, source URI, model ID, and extractor version.

Use lexical retrieval for exact visible text and error codes.

Use visual embeddings for fuzzy screen similarity.

Fuse multiple channels before reranking.

Return evidence packets, not final answers.

Let agents request neighbor steps and successful comparable transitions.

Evaluate both retrieval quality and agent tool behavior.

Enforce ACLs, redaction, and retention before indexing.

Track latency, budget, cancellation, and stale-context reuse in production.

Key Takeaways

1. Computer-use traces are multimodal evidence, not just logs.

2. The best retrieval unit is the observation packet: screen state, action, tool output, and provenance.

3. Screenshots need multiple indexes. Visual embeddings, OCR, UI elements, and action traces answer different questions.

4. Agent memory tools should return bounded evidence with citations and follow-up actions.

5. Evaluation must test retrieval, step localization, and whether the agent used the memory correctly.

6. Searchable memory makes agents more capable, but only if access control and redaction are designed into the index.

Why Computer-Use Agents Need Searchable Memory

The Core Architecture

Segment Runs by State Transitions

Extract Multiple Channels

Screenshot Caption Channel

OCR and UI Text Channel

UI Element Channel

Action and Tool Channel

Visual Embedding Channel

Use Multi-Index Retrieval

Design the Agent Tool

Evaluate Memory Quality

Security and Privacy Controls

Mixpeek Implementation Pattern

Design Checklist

Key Takeaways

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Agent Perception Evals: Testing Whether AI Agents Can See, Hear, and Search

MCP Tool Design for Multimodal Search

Audio-Visual Retrieval for AI Agents: How to Search What Happened, Not Just What Was Said