NEWVectors or files. Pick a path.Start →
    Agent Perception
    18 min read
    Updated 2026-06-06

    Computer-Use Agent Memory: How to Search Screens, Tools, and UI State

    A practical architecture guide for indexing computer-use agent traces. Learn how to turn screenshots, UI elements, actions, tool outputs, and state transitions into searchable memory with evidence, provenance, budgets, and evals.

    AI Agents
    Computer Use
    Agent Memory
    Screenshots
    Multimodal Search

    Why Computer-Use Agents Need Searchable Memory



    Computer-use agents operate through a loop:

    1. Observe the screen. 2. Choose an action. 3. Click, type, scroll, or call a tool. 4. Observe the new screen. 5. Decide what changed.

    That loop creates a rich trace, but most systems store it as logs and screenshots. Logs capture actions but miss visual state. Screenshots capture pixels but are hard to search. Browser events capture URLs but not intent. Tool outputs capture results but not whether the agent used them correctly.

    If the agent fails, the debugging question is rarely "what did the model say?" It is:

  1. Which screen state caused the bad action?
  2. Did the expected button exist before the click?
  3. Was the agent looking at stale context?
  4. Did a tool result contradict the visible UI?
  5. Has this failure happened before in a similar workflow?
  6. Which prior run contains the closest successful state?


  7. Those are retrieval questions over unstructured agent traces. A computer-use memory system turns every step into searchable evidence.

    The hard gate for this topic is direct: searchable computer-use memory helps AI agents see screens, inspect UI state, and search unstructured task traces.

    The Core Architecture



    Treat each agent step as an observation packet.

    computer-use run
      -> state transition
      -> observation packet
      -> multimodal indexes
      -> bounded retrieval tool
      -> evidence for the next agent step
    


    An observation packet is not just a screenshot. It combines what the agent saw, what it did, what changed, and how to cite the evidence.

    {
      "observation_id": "obs_run_42_step_08",
      "run_id": "run_42",
      "step": 8,
      "timestamp": "2026-06-06T15:20:11Z",
      "task": "reset account password",
      "environment": "browser",
      "source_uri": "s3://agent-runs/run_42/step_08.png",
      "url": "https://app.example.com/settings",
      "action_before": {
        "type": "click",
        "target": "Save",
        "x": 812,
        "y": 718
      },
      "visual_state": {
        "caption": "Settings page with password form and disabled save button",
        "ocr": ["Current password", "New password", "Save"],
        "ui_elements": [
          {"role": "button", "text": "Save", "bbox": [760, 690, 860, 735], "enabled": false}
        ]
      },
      "tool_outputs": [
        {"tool": "validate_password_policy", "result": "missing_special_character"}
      ],
      "provenance": {
        "screenshot_model": "Hcompany/Holo-3.1-4B",
        "ocr_model": "PaddleOCR-VL",
        "extractor_version": "2026-06-06"
      }
    }
    


    This packet gives future agents something concrete to retrieve. It can be searched by visible text, UI layout, action type, tool result, task, URL, or visual similarity.

    Segment Runs by State Transitions



    Document RAG chunks text. Video RAG chunks time. Computer-use memory chunks state transitions.

    A state transition is the interval between one observation and the next:

    screen_before + action + tool_output + screen_after
    


    This is the useful retrieval unit because it preserves causality:

  8. what was visible before the action
  9. what action the agent chose
  10. what tool or browser event executed
  11. what changed on the screen
  12. whether the result matched the intent


  13. Store both single-state packets and transition packets.

    Single-state packets answer:

  14. "Find screens where the Save button was disabled."
  15. "Find prior runs on the billing page."
  16. "Find screenshots with error code AUTH-429."


  17. Transition packets answer:

  18. "Find cases where clicking Save did nothing."
  19. "Find successful transitions from checkout to payment confirmation."
  20. "Find runs where the agent clicked a hidden or disabled control."


  21. Most failures are transition failures, not state failures. The screen looked plausible, but the next action was wrong.

    Extract Multiple Channels



    Do not force screenshots, actions, OCR, DOM, and tool calls into one vector. Use separate channels and fuse results later.

    Screenshot Caption Channel



    A vision-language model can summarize the screen:

  22. page or app type
  23. visible controls
  24. layout
  25. error messages
  26. affordances
  27. apparent state


  28. This channel supports natural language queries like "modal open with a disabled submit button" or "pricing table where annual billing is selected."

    OCR and UI Text Channel



    OCR and accessibility text preserve exact strings. This matters for:

  29. error codes
  30. button labels
  31. form labels
  32. product names
  33. amounts
  34. table values
  35. legal or compliance text


  36. Dense visual embeddings are weak at exact strings. Keep OCR searchable with lexical retrieval and filters.

    UI Element Channel



    When possible, extract structured UI elements:

  37. role: button, input, link, menu, checkbox
  38. text or accessible name
  39. bounding box
  40. enabled or disabled state
  41. selected state
  42. hierarchy
  43. DOM selector or accessibility path


  44. For web and desktop agents, this channel is often more reliable than pixels alone. For custom canvases, mobile screenshots, and older apps, you may need OCR and vision grounding as the fallback.

    Action and Tool Channel



    Actions and tool outputs form the agent's operational memory:

  45. click coordinates
  46. typed text, redacted when sensitive
  47. scroll direction
  48. selected tool
  49. tool arguments
  50. tool output
  51. safety checks
  52. retry count
  53. cancellation reason


  54. This channel explains why a state exists. It is the difference between "the page showed an error" and "the page showed an error after the agent submitted a password that failed policy validation."

    Visual Embedding Channel



    Visual embeddings support fuzzy search over layouts and states:

  55. similar modal layouts
  56. similar checkout failures
  57. similar visual diff patterns
  58. same UI in different themes or languages
  59. repeated app states with different text


  60. Use visual embeddings for recall. Use OCR, UI elements, and reranking for precision.

    Use Multi-Index Retrieval



    A computer-use memory search should query several indexes independently:

    agent task or debug query
      -> screenshot caption search
      -> OCR/BM25 search
      -> UI element filter
      -> action/tool search
      -> visual embedding search
      -> fusion
      -> reranking
      -> evidence packet
    


    The score scales are not comparable. BM25 over OCR, cosine similarity over screenshots, and a UI element filter do not produce the same kind of confidence. Start with rank-based fusion such as Reciprocal Rank Fusion, then rerank the top candidates with the full query and packet context.

    The result should tell the agent why it matched:

    {
      "source_uri": "s3://agent-runs/run_42/step_08.png",
      "run_id": "run_42",
      "step": 8,
      "summary": "Password settings page with disabled Save button",
      "matched_channels": ["ocr", "ui_element", "visual_embedding"],
      "matched_evidence": [
        {"channel": "ocr", "text": "Save"},
        {"channel": "ui_element", "role": "button", "enabled": false},
        {"channel": "tool", "result": "missing_special_character"}
      ],
      "next_actions": ["inspect_neighboring_steps", "compare_successful_transition"]
    }
    


    The agent should not receive a vague memory blob. It should receive evidence with a screen, a step, a reason, and allowed follow-up actions.

    Design the Agent Tool



    Expose computer-use memory as a bounded tool.

    {
      "tool": "search_computer_use_memory",
      "input_schema": {
        "query": "string",
        "run_filters": {
          "app": "string",
          "environment": "browser 
    desktop
    mobile", "task": "string", "url_contains": "string" }, "channels": ["screenshot", "ocr", "ui_element", "action", "tool_output"], "top_k": 20, "include_neighbor_steps": true, "budget_ms": 2000 } }


    Make the tool return observations, not conclusions. The agent can use the observations to decide whether to retry, inspect a neighbor step, compare against a successful run, or ask a human for review.

    Good tool outputs include:

  61. screenshot URI or signed URL
  62. step number and run ID
  63. matched channels
  64. UI element coordinates or selectors
  65. action before and after
  66. tool outputs that influenced the state
  67. confidence and rank
  68. provenance for every extracted feature


  69. Bad tool outputs only say "similar screen found." That forces the model to infer too much.

    Evaluate Memory Quality



    Computer-use memory needs retrieval evals and trace evals.

    Retrieval eval examples:

  70. Find the prior run where the billing modal failed after submit.
  71. Find the screenshot with error code AUTH-429.
  72. Find successful states that look like this failed state.
  73. Find the step before the agent clicked a disabled button.
  74. Find cases where the tool result said success but the UI still showed pending.


  75. Trace eval examples:

  76. Did the agent call memory before retrying the same failed action?
  77. Did it search the right channel, OCR for an error code and visual similarity for a layout?
  78. Did it cite the exact step it used?
  79. Did it inspect neighbor steps before concluding cause?
  80. Did it stop when memory returned no high-confidence match?


  81. Useful metrics:

  82. Recall@k for known prior states.
  83. nDCG by task class.
  84. UI element match accuracy.
  85. Step localization error.
  86. Screenshot citation rate.
  87. Unsupported visual claim rate.
  88. Stale-context reuse rate.
  89. Cost per successful retrieved observation.
  90. P95 tool latency.
  91. Cancellation rate for superseded searches.


  92. The key is separating "retrieved the right packet" from "used the packet correctly." A memory system can retrieve the right evidence while the agent ignores it.

    Security and Privacy Controls



    Computer-use traces can contain sensitive data. Treat them as production data, not debug leftovers.

    Design controls before indexing:

  93. redact passwords, tokens, addresses, and payment fields
  94. store screenshot crops instead of full screens when possible
  95. attach tenant, user, run, and app ACLs to every packet
  96. separate raw screenshots from derived features
  97. expire temporary task memory
  98. preserve audit lineage for who searched what
  99. block retrieval across tenants by default


  100. Memory is useful because it is searchable. That also makes access control more important.

    Mixpeek Implementation Pattern



    Use object storage as the source of truth for screenshots, screen recordings, and trace artifacts. Then index each observation packet with multiple feature channels.

    from mixpeek import Mixpeek

    mx = Mixpeek(api_key="YOUR_API_KEY")

    mx.ingest.images( collection="computer_use_memory", source={"type": "s3", "bucket": "agent-runs", "prefix": "run_42/"}, metadata={ "run_id": "run_42", "task": "reset account password", "environment": "browser" }, pipeline={ "captioning": {"model": "mixpeek://image_extractor@v1/hcompany_holo_31_4b_v1"}, "ocr": {"model": "mixpeek://image_extractor@v1/paddle_ocr_vl_16_v1"}, "visual_embedding": {"model": "mixpeek://image_extractor@v1/qwen3_vl_embedding_2b_v1"} } )


    Then expose a retrieval tool to the agent:

    results = mx.retrievers.retrieve(
        retriever_id="computer-use-memory",
        queries=[
            {
                "type": "text",
                "value": "disabled save button after password policy validation error"
            }
        ],
        filters={
            "environment": {"eq": "browser"},
            "task": {"eq": "reset account password"}
        },
        stages=[
            {"name": "ocr_bm25", "top_k": 100},
            {"name": "ui_element_filter", "top_k": 100},
            {"name": "visual_embedding", "top_k": 100},
            {"name": "rrf_fusion", "top_k": 40},
            {"name": "vlm_rerank", "top_k": 10}
        ],
        include_context=True,
        budget_ms=2000
    )
    


    If you bring your own screenshot embeddings, MVS can store and search them on object storage. If you want Mixpeek to run the perception layer, Managed indexing can populate captions, OCR, visual embeddings, and trace features from the objects you already keep in storage.

    Design Checklist



  101. Store one observation packet per screen state.
  102. Store transition packets for before-action-after sequences.
  103. Keep screenshot, OCR, UI element, action, tool, and visual embedding channels separate.
  104. Preserve run ID, step number, timestamp, source URI, model ID, and extractor version.
  105. Use lexical retrieval for exact visible text and error codes.
  106. Use visual embeddings for fuzzy screen similarity.
  107. Fuse multiple channels before reranking.
  108. Return evidence packets, not final answers.
  109. Let agents request neighbor steps and successful comparable transitions.
  110. Evaluate both retrieval quality and agent tool behavior.
  111. Enforce ACLs, redaction, and retention before indexing.
  112. Track latency, budget, cancellation, and stale-context reuse in production.


  113. Key Takeaways



    1. Computer-use traces are multimodal evidence, not just logs.

    2. The best retrieval unit is the observation packet: screen state, action, tool output, and provenance.

    3. Screenshots need multiple indexes. Visual embeddings, OCR, UI elements, and action traces answer different questions.

    4. Agent memory tools should return bounded evidence with citations and follow-up actions.

    5. Evaluation must test retrieval, step localization, and whether the agent used the memory correctly.

    6. Searchable memory makes agents more capable, but only if access control and redaction are designed into the index.

    Further Reading



  114. Object Decomposition and Layered Indexing
  115. Agent Perception Evals
  116. Retrieval Control Planes for AI Agents
  117. MCP Tool Design for Multimodal Search
  118. OpenAI Computer Use
  119. OpenAI Agent Evals
  120. Model Context Protocol tools
  121. Holo-3.1-4B on Hugging Face
  122. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs