A curated research index for visual agents that perceive, ground, plan, act, create, and evaluate in visually grounded environments.
Visual agents occupy the intersection of multimodal perception, grounded reasoning, tool use, interaction, and control. This repository curates papers, benchmarks, datasets, runtimes, and engineering resources for systems that close the loop between visual observation and purposeful action.
The list is selective rather than exhaustive. It prioritizes works that introduce a clear agent loop, action space, evaluation protocol, data engine, safety finding, or reusable implementation artifact, while excluding generic multimodal models and one-shot visual-generation systems without an agentic mechanism.
- Selection Boundary
- Research Taxonomy
- Curation Rubric
- Reading Pathways
- Recent Additions
- Research Map
- Benchmarks and Environments
- Skills, Tools, and Engineering Resources
- Workflow Stacks
- Official Docs and Engineering Notes
- Related Lists
- Contributing
- Maintenance Policy
- Citation
Included areas:
- GUI, web, desktop, and mobile agents that perceive screens and produce executable actions.
- Visual grounding work that is clearly tied to downstream agent control.
- Embodied vision-language-action systems for robot manipulation, navigation, and physical-world interaction.
- Agentic visual reasoning and generation systems with search, planning, memory, tools, critique, or iterative refinement.
- Benchmarks, data engines, simulators, safety suites, and toolchains that support visual-agent construction and evaluation.
Excluded by default:
- Broad multimodal foundation models with no visual-agent evaluation.
- Generic OCR, captioning, visual question answering, or layout parsing without an action or agent setting.
- Image, video, or 3D generators that are only prompt-in/artifact-out.
- Unverified arXiv IDs, placeholder-looking entries, product rumors, and duplicate rows.
| Track | Research question | Representative works |
|---|---|---|
| Screen grounding | Can the model localize text, widgets, controls, and regions well enough to act? | Set-of-Mark, SeeClick, OmniParser, UGround, ScreenSpot-Pro, GUI-Eyes |
| Computer use | Can the agent complete tasks in real websites, desktops, or phones over multiple steps? | WebArena, WebLINX, AppAgent, Mobile-Agent, OSWorld, AndroidWorld, Agent S, UI-TARS |
| Embodied VLA | Can visual observations and language be converted into safe physical actions? | PerAct, VIMA, RT-1, RT-2, Open X-Embodiment, OpenVLA, Pi-Zero, Magma |
| Agentic reasoning and creation | Can the system plan, search, critique, edit, or generate visual artifacts through a loop? | VISPROG, ViperGPT, GenArtist, DeepEyes, Agent Banana, VisionCreator, GEMS |
| Reliability and safety | Can we measure brittleness, privacy risk, prompt injection, unsafe actions, and deployment readiness? | VPI-Bench, OpenAgentSafety, OS-BLIND, HazardArena, UI-CUBE |
| Infrastructure | Which tools and environments support reproducible training, deployment, and evaluation? | BrowserGym, AgentLab, Stagehand, Playwright MCP, Agent S, Cua, OpenCUA, ScaleCUA, LeRobot |
An item should usually satisfy at least one of these conditions:
- It defines a new visual-agent capability, benchmark, data engine, training recipe, runtime, or safety evaluation.
- It evaluates closed-loop behavior rather than only static recognition or one-shot generation.
- It is widely used as a baseline, benchmark, dataset, environment, or builder tool.
- It has a stable paper, official code, project page, or documentation that readers can inspect.
An item is removed or left out when the visual-agent connection is weak, the link is unverifiable, the arXiv ID is wrong, the row duplicates a better entry, or the contribution is mostly a product announcement without enough technical detail.
GUI and computer use. Start with SeeClick, OmniParser, OSWorld, UI-TARS, Agent S2, OpenCUA, and UI-Copilot.
Mobile GUI agents. Read Android in the Wild, MM-Navigator, AppAgent, Mobile-Agent, Mobile-Agent-v2, A3, and MemGUI-Bench.
Grounding and perception. Read ScreenAI, Ferret-UI, UGround, ScreenSpot-Pro, GUI-Actor, Phi-Ground, GUI-Eyes, and UI-Zoomer.
Embodied VLA. Start with PerAct, VIMA, RT-1, PaLM-E, RT-2, Open X-Embodiment, OpenVLA, Pi-Zero, Magma, and World-Value-Action.
Agentic visual reasoning and creation. Read VISPROG, Visual ChatGPT, ViperGPT, LLaVA-Plus, GenArtist, DeepEyes, Agent Banana, VisionCreator, and GEMS.
| Work | Date | Contribution / Relevance |
|---|---|---|
| GUI-Eyes | 2026-01 | Active visual perception for GUI grounding with learned crop/zoom tool use. |
| ShowUI-Aloha | 2026-01 | Converts human screen recordings into structured GUI-agent supervision. |
| OS-Symphony | 2026-01 | Holistic framework for robust computer-using agents. |
| ActionEngine | 2026-02 | Uses state-machine memory to make GUI agents more programmatic and recoverable. |
| Agent Banana | 2026-02 | Agentic image editing with planning and tool execution rather than one-shot editing. |
| SAGE | 2026-02 | Agentic 3D scene generation for embodied-AI policy training. |
| CUA-Suite | 2026-03 | Large human-annotated video demonstrations for computer-use agents. |
| GEMS | 2026-03 | Multimodal generation loop with memory, skills, and iterative agent refinement. |
| UI-Copilot | 2026-04 | Long-horizon GUI automation with tool-integrated policy optimization. |
| UI-Zoomer | 2026-04 | Uncertainty-driven zoom-in for hard GUI grounding cases. |
| DynamicGUIBench | 2026-04 | Evaluates GUI agents in high-dynamic interfaces rather than static screenshots. |
| Video2GUI | 2026-05 | Synthesizes GUI interaction trajectories from instructional videos. |
| UI-Verse | 2026-05 | Studies interface design heuristics that improve computer-use-agent reliability. |
| Securing Computer-Use Agents | 2026-05 | Connects CUA architecture, lifecycle, permission scope, and runtime reliability. |
| Don't Click That | 2026-05 | Deception-aware web-agent benchmark and defense for misleading interface elements. |
| VLAs-as-Tools | 2026-05-13 | Long-horizon embodied-agent strategy that delegates bounded physical subtasks to specialized VLA tools. |
| SaaS-Bench | 2026-05-15 | Real-world SaaS workflow benchmark for long-horizon computer-use agents. |
| Work | Year | Links | Contribution / Relevance |
|---|---|---|---|
| A Comprehensive Survey of Agents for Computer Use | 2025 | paper | Broad map of computer-use-agent domains, agent loops, and evaluation bottlenecks. |
| GUI Agents: A Survey | 2024 | paper | Practical survey of GUI-agent architectures, datasets, benchmarks, and failure modes. |
| A Survey on (M)LLM-Based GUI Agents | 2025 | paper | Focused entry point for planning, grounding, memory, and GUI-agent evaluation. |
| Towards Trustworthy GUI Agents | 2025 | paper | Reliability and safety framing for deployment-facing GUI agents. |
| Large Multimodal Agents: A Survey | 2024 | paper | Contextual background on LLM-driven multimodal agent components. |
| A Survey on Vision-Language-Action Models for Embodied AI | 2024 | paper | Early VLA survey covering embodied perception, planning, and action. |
| Vision-Language-Action in Robotics | 2026 | paper | Data-centric survey of VLA datasets, benchmarks, and data engines. |
| Vision-Language-Action Safety | 2026 | paper | Focused taxonomy of threats, evaluations, and defenses for VLA systems. |
| Safety in Embodied AI | 2026 | paper | Wider safety survey across perception, planning, action, and interaction. |
| Visual Generation in the New Era | 2026 | paper | Conceptual lens for when visual generation becomes agentic world modeling. |
| Securing Computer-Use Agents | 2026 | paper | Deployment-grounded view of CUA reliability across architecture, lifecycle, permissions, and oversight. |
| Work | Year | Links | Contribution / Relevance |
|---|---|---|---|
| CogAgent | 2023 | paper, code | Early high-resolution VLM built explicitly for GUI understanding and navigation. |
| Set-of-Mark Prompting | 2023 | paper, code | Simple visual marking strategy that became a practical grounding primitive for LMM agents. |
| SeeClick | 2024 | paper, code | Shows that GUI grounding is a core bottleneck for visual GUI agents. |
| ScreenAI | 2024 | paper | Strong foundation for screen, document, infographic, and layout-heavy visual understanding. |
| Ferret-UI | 2024 | paper, code | Region-aware mobile UI understanding with explicit grounding. |
| OmniParser | 2024 | paper, code | Practical screenshot-to-interactable-region parser for pure-vision GUI agents. |
| UGround | 2024 | paper, code | Strong pure-vision grounding baseline without relying on accessibility trees. |
| OS-ATLAS | 2024 | paper | Foundation action model for generalist GUI agents. |
| ShowUI | 2024 | paper, code | Unifies screenshot-conditioned GUI perception and action modeling. |
| Aguvis | 2024 | paper | Pure-vision GUI agent direction with autonomous interface interaction. |
| UI-E2I-Synth | 2025 | paper | Synthetic instruction pipeline for scaling GUI grounding supervision. |
| ScreenSpot-Pro | 2025 | paper | Hard high-resolution grounding benchmark for professional computer-use screens. |
| GUI-G1 | 2025 | paper, code | Careful analysis of RL pitfalls in GUI grounding. |
| Enhancing Visual Grounding via Self-Evolutionary RL | 2025 | paper | Data-efficient RL recipe for high-resolution GUI grounding. |
| GUI-Actor | 2025 | paper | Coordinate-free grounding with an action head and verifier. |
| Phi-Ground | 2025 | paper | Strong empirical report on training compact GUI grounding models. |
| Test-Time RL for GUI Grounding | 2025 | paper | Test-time adaptation using region consistency. |
| Explicit Position-to-Coordinate Mapping | 2025 | paper | Addresses coordinate generation as a concrete grounding bottleneck. |
| GUI-Eyes | 2026 | paper | Learns when and how to call visual tools such as crop and zoom. |
| UI-Zoomer | 2026 | paper, code | Uses uncertainty to decide where to zoom for GUI grounding. |
| Work | Year | Links | Contribution / Relevance |
|---|---|---|---|
| Mind2Web | 2023 | paper | Foundational benchmark for generalist web agents. |
| Android in the Wild | 2023 | paper | Large-scale Android device-control dataset with realistic gestures. |
| WebArena | 2023 | paper, code | Realistic web-agent environment with execution-based tasks. |
| AutoDroid | 2023 | paper | Early Android task-automation system and benchmark that remains relevant as a mobile-agent baseline. |
| MM-Navigator | 2023 | paper | Early GPT-4V smartphone GUI navigation agent with zero-shot screen interaction. |
| AppAgent | 2023 | paper | Smartphone agent that learns app operation from autonomous exploration or demonstrations. |
| SeeAct | 2024 | paper | Web agent showing why grounding matters for GPT-4V-style agents. |
| Mobile-Agent | 2024 | paper, code | Vision-centric mobile device agent using visual perception tools and stepwise planning. |
| VisualWebArena | 2024 | paper, code | Adds visually grounded tasks to realistic web-agent evaluation. |
| WebVoyager | 2024 | paper | End-to-end multimodal web agent evaluated on live websites. |
| WebLINX | 2024 | paper, project | Large benchmark of multi-turn conversational web navigation with screenshots and action history. |
| OmniACT | 2024 | paper | Desktop and web benchmark where agents generate executable automation scripts. |
| WorkArena | 2024 | paper, code | Enterprise workflow benchmark for knowledge-work agents. |
| MMInA | 2024 | paper, code | Multihop multimodal Internet-agent benchmark on evolving real websites. |
| B-MoCA | 2024 | paper | Mobile device-control benchmark across diverse configurations. |
| OSWorld | 2024 | paper, code | Flagship benchmark for open-ended tasks in real desktop environments. |
| AndroidWorld | 2024 | paper, code | Dynamic Android benchmark with broad task diversity. |
| Mobile-Agent-v2 | 2024 | paper, code | Multi-agent mobile operation assistant with planning, decision, and reflection roles. |
| MobileAgentBench | 2024 | paper | Practical benchmark for mobile LLM agents. |
| WebCanvas | 2024 | paper | Online web-agent benchmark and framework built around Mind2Web-Live. |
| Agent S | 2024 | paper, code | Open agentic framework for using computers through GUI actions. |
| Windows Agent Arena | 2024 | paper, code | Scalable evaluation environment for Windows OS agents. |
| SPA-Bench | 2024 | paper | Comprehensive smartphone-agent evaluation benchmark. |
| AndroidLab | 2024 | paper | Android training and benchmarking environment with virtual devices and task suites. |
| VideoWebArena | 2024 | paper | Long-context video understanding inside web-agent workflows. |
| MageBench | 2024 | paper, code | Lightweight visual-agent benchmark covering WebUI, Sokoban, and Football environments. |
| UI-TARS | 2025 | paper | Native GUI-agent model trained for perception, grounding, and action. |
| A3 | 2025 | paper, project | Android Agent Arena for online mobile GUI-agent evaluation across real apps. |
| Agent S2 | 2025 | paper, code | Generalist-specialist framework for computer-use agents. |
| UI-Evol | 2025 | paper | Plug-in knowledge-evolution module that improves OSWorld execution reliability for CUAs. |
| ZeroGUI | 2025 | paper | Online GUI-agent learning with task generation and reward estimation. |
| OpenCUA | 2025 | paper, code | Open foundation stack for computer-use agents. |
| ScaleCUA | 2025 | paper, code | Cross-platform data scaling for open-source computer-use agents. |
| OmegaUse | 2026 | paper | General-purpose GUI agent for autonomous task execution. |
| MemGUI-Bench | 2026 | paper | Evaluates memory across mobile GUI sessions and changing environments. |
| SecAgent | 2026 | paper | Efficient 3B mobile GUI agent with semantic context compression and Chinese mobile data. |
| OS-Symphony | 2026 | paper, code | Framework for robust and generalist computer-use agents. |
| ActionEngine | 2026 | paper | State-machine memory for more structured GUI automation. |
| ContractSkill | 2026 | paper | Treats web-agent skills as repairable contracts that can be verified and reused. |
| UI-Copilot | 2026 | paper, code | Long-horizon GUI automation with tool-integrated policy optimization. |
| ClawGUI | 2026 | paper | Unified framework for training, evaluating, and deploying GUI agents. |
| RiskWebWorld | 2026 | paper | Realistic interactive benchmark for e-commerce risk-management GUI agents. |
| DynamicGUIBench | 2026 | paper | Stress-tests agents in dynamic, evolving GUI environments. |
| UI-Verse | 2026 | paper | Interface-design perspective on making CUAs more reliable. |
| SaaS-Bench | 2026 | paper, code | Long-horizon benchmark over real deployable SaaS systems and professional workflows. |
| Work | Year | Links | Contribution / Relevance |
|---|---|---|---|
| PerAct | 2022 | paper, project | Language-conditioned RGB-D manipulation agent that predicts voxel actions directly. |
| VIMA | 2022 | paper, project | Multimodal-prompt robot manipulation benchmark and transformer agent. |
| RT-1 | 2022 | paper, project | Large-scale real-robot action model that anchors later RT/VLA work. |
| PaLM-E | 2023 | paper | Embodied multimodal language model connecting visual input to robot tasks. |
| RT-2 | 2023 | paper | Canonical VLA model transferring web-scale vision-language knowledge to robot control. |
| Open X-Embodiment / RT-X | 2023 | paper | Large robot-learning dataset and RT-X model family. |
| Octo | 2024 | paper | Open-source generalist robot policy. |
| OpenVLA | 2024 | paper, code | Open-source VLA model and a common baseline for robot manipulation. |
| Pi-Zero | 2024 | paper | Flow-based VLA model for general robot control. |
| Magma | 2025 | paper, code | Bridges multimodal agents across digital and physical actions. |
| SafeVLA | 2025 | paper | Safety alignment for VLA models via constrained learning. |
| Interleave-VLA | 2025 | paper | Robot manipulation with interleaved image-text instructions. |
| ChatVLA-2 | 2025 | paper | Open-world embodied reasoning from pretrained knowledge. |
| VLA^2 | 2025 | paper | Agentic framework for unseen-concept manipulation. |
| World-Value-Action | 2026 | paper | Uses implicit planning and future-state value estimation for VLA systems. |
| VLAs-as-Tools | 2026 | paper | Splits long-horizon embodied tasks between a high-level VLM planner and specialized VLA tools. |
| SAGE | 2026 | paper, code | Agentically generates simulator-ready 3D scenes for embodied policy training. |
| Work | Year | Links | Contribution / Relevance |
|---|---|---|---|
| VISPROG | 2022 | paper, project | Foundational visual-programming approach for tool-composed visual reasoning and editing. |
| Visual ChatGPT | 2023 | paper, code | Early system connecting ChatGPT with visual foundation models for multi-step visual tasks. |
| ViperGPT | 2023 | paper, code | Uses Python execution to compose vision modules for interpretable visual reasoning. |
| LLaVA-Plus | 2023 | paper | Trains multimodal agents to select and use visual tools across understanding and generation. |
| GenArtist | 2024 | paper, code | MLLM-as-agent for image generation and editing through planning and tool use. |
| CIGEval | 2025 | paper | Agentic evaluation framework for conditional image generation. |
| DeepEyes | 2025 | paper | Reinforcement learning for active visual reasoning, grounding, and "thinking with images." |
| ImAgent | 2025 | paper | Test-time scalable multimodal agent framework for image generation. |
| GenAgent | 2026 | paper | Scales text-to-image generation through agentic multimodal reasoning. |
| Mind-Brush | 2026 | paper | Adds cognitive search and reasoning loops to image generation. |
| Agent Banana | 2026 | paper, code | High-fidelity image editing with planner-executor tooling. |
| M3 | 2026 | paper | Multi-modal, multi-agent, multi-round reasoning for high-fidelity text-to-image generation. |
| VisionCreator | 2026 | paper | Native visual-generation agentic model with understanding, planning, and creation. |
| Gen-Searcher | 2026 | paper, project | Reinforces agentic search for image generation. |
| GEMS | 2026 | paper, project | Multimodal generation with memory, skills, and iterative agent loops. |
| Visual Generation in the New Era | 2026 | paper | Helpful taxonomy for agentic world modeling and generation. |
| Work | Year | Links | Contribution / Relevance |
|---|---|---|---|
| AGENTSAFE | 2025 | paper | Safety benchmark for embodied agents under hazardous instructions. |
| IS-Bench | 2025 | paper | Interactive safety benchmark for VLM-driven household agents. |
| VPI-Bench | 2025 | paper, code | Visual prompt-injection benchmark for computer-use agents. |
| OpenAgentSafety | 2025 | paper | Framework for evaluating real-world agent safety across risk categories. |
| OS-Sentinel | 2025 | paper | Hybrid validation for safer mobile GUI agents. |
| UI-CUBE | 2025 | paper | Enterprise CUA benchmark that measures operational reliability beyond task accuracy. |
| GUIGuard-Bench | 2026 | paper | Privacy-preserving GUI-agent evaluation. |
| CUAAudit | 2026 | paper | Tests whether VLMs can audit autonomous computer-use agents. |
| OS-BLIND | 2026 | paper | Shows how benign-looking user instructions expose CUA vulnerabilities. |
| HazardArena | 2026 | paper | Semantic safety evaluation for VLA systems. |
| RedVLA | 2026 | paper | Physical red-teaming benchmark for VLA models. |
| GUI-Perturbed | 2026 | paper | Domain-randomization study exposing GUI-grounding brittleness. |
| OS-SPEAR | 2026 | paper | Toolkit for safety, performance, efficiency, and robustness analysis of OS agents. |
| Don't Click That | 2026 | paper | Benchmarks and mitigates deceptive UI elements for VLM-based web agents. |
| ProjGuard | 2026 | paper | Safety monitoring for computer-use agents via low-dimensional projections. |
| Area | Resource | Link | Primary Use |
|---|---|---|---|
| Web | MiniWoB++ | code | Compact browser-interaction environments for controlled RL-style experiments. |
| Web | Mind2Web | paper | Offline web-agent action prediction and grounding. |
| Web | WebArena | paper, code | Realistic web navigation with execution-based grading. |
| Web | WebArena-Verified | code | Audited WebArena task set with deterministic offline evaluation. |
| Web | VisualWebArena | paper, code | Visually grounded web tasks where screenshots matter. |
| Web | WebLINX | paper, project | Conversational web navigation from expert demonstrations. |
| Web | WorkArena | paper, code | Enterprise workflow automation in ServiceNow-style environments. |
| Web | MMInA | paper, code | Multihop multimodal tasks over evolving real websites. |
| Web | WebCanvas | paper | Online web-agent evaluation with Mind2Web-Live. |
| Web | RiskWebWorld | paper | Realistic e-commerce risk-management tasks for GUI agents. |
| Web | SaaS-Bench | paper, code | Long-horizon professional workflows across deployable SaaS systems. |
| Desktop | OSWorld | paper, code | Open-ended desktop tasks in real operating systems. |
| Desktop | Windows Agent Arena | paper, code | Windows-specific scaling and reproducible OS-agent evaluation. |
| Desktop | OmniACT | paper | Evaluating executable automation rather than only low-level clicks. |
| Mobile | Android in the Wild | paper | Large-scale Android device-control demonstrations with screen observations. |
| Mobile | B-MoCA | paper | Mobile control across diverse device configurations. |
| Mobile | AndroidWorld | paper, code | Dynamic Android tasks with broad app coverage. |
| Mobile | AndroidControl | paper | Diverse Android control dataset for studying scale and generalization. |
| Mobile | MobileAgentBench | paper | Efficient mobile-agent evaluation across open-source apps. |
| Mobile | SPA-Bench | paper | Smartphone-agent testing with comprehensive task coverage. |
| Mobile | AndroidLab | paper | Training and systematic benchmarking on Android virtual devices. |
| Mobile | A3 | paper, project | Real-app online evaluation for mobile GUI agents. |
| Mobile | SecAgent | paper | Chinese mobile GUI dataset, benchmark, and compact semantic-context agent. |
| Grounding | ScreenSpot-Pro | paper | High-resolution professional-screen grounding. |
| Visual-agent reasoning | MageBench | paper, code | Lightweight environments for vision-in-the-chain agent reasoning. |
| Memory | MemGUI-Bench | paper | Cross-session and cross-temporal mobile GUI memory. |
| Dynamic GUI | DynamicGUIBench | paper | Robustness under evolving interfaces and dynamic UI changes. |
| Enterprise reliability | UI-CUBE | paper | Deployment-readiness diagnostics beyond simple task success. |
| Security | VPI-Bench | paper, code | Visual prompt injection for GUI and computer-use agents. |
| Safety | AGENTSAFE | paper | Hazardous-instruction safety for embodied agents. |
| Safety | HazardArena | paper | Semantic safety evaluation for VLA systems. |
| Embodied | LIBERO | code | Lifelong robot manipulation tasks. |
| Embodied | RLBench | code | Simulation-based manipulation benchmark. |
These resources are intentionally separated from research papers. They are implementation and evaluation artifacts rather than, in every case, standalone research contributions.
| Resource | Type | Link | Primary Use |
|---|---|---|---|
| OpenAI Skills guide | docs | Docs | Understanding skill-style packaging for reusable agent capabilities. |
| awesome-agent-skills | collection | GitHub | Finding reusable agent skills across browsing, coding, documents, and visual tasks. |
| awesome-gpt-image-2 | collection | GitHub | Tracking prompt patterns and workflows around modern image generation. |
| gpt_image_2_skill | skill package | GitHub | Example of packaging image-generation workflows as reusable skills. |
| ToDiagram skills | skill collection | GitHub | Diagram and visual-communication skills that pair well with visual agents. |
| Resource | Type | Link | Primary Use |
|---|---|---|---|
| OmniParser | parser | GitHub | Converting screenshots into candidate interactable regions. |
| ShowUI | GUI model | GitHub | Screenshot-conditioned GUI action modeling and demonstration pipelines. |
| UGround | grounding model | GitHub | Pure-vision GUI grounding without accessibility trees. |
| OS-ATLAS | action model | Paper | Cross-platform GUI action grounding. |
| GUI-G1 | grounding model | GitHub | Studying RL recipes and evaluation pitfalls for GUI grounding. |
| UI-Zoomer | grounding tool | GitHub | Adaptive zoom-in when the target UI element is hard to localize. |
| Phi-Ground | grounding model | Paper | Compact GUI grounding baseline for resource-constrained settings. |
| Resource | Type | Link | Primary Use |
|---|---|---|---|
| UI-TARS Desktop | desktop agent | GitHub | Running multimodal desktop agents locally. |
| Agent S | runtime | GitHub | General computer-use experiments with a practical open framework. |
| Cua | operator stack | GitHub | Infrastructure for running and evaluating computer-use agents. |
| OpenAdapt | generative RPA stack | GitHub | Recording GUI demonstrations, training models, and evaluating agents from a unified CLI. |
| browser-use | browser runtime | GitHub | Browser automation workflows when DOM/tool access is acceptable. |
| Stagehand | browser runtime | GitHub | Hybrid code-plus-natural-language browser automation for production workflows. |
| Playwright MCP | browser MCP server | GitHub | Gives agents browser automation tools through the Model Context Protocol. |
| BrowserGym | browser harness | GitHub | Reproducible browser-agent experiments and benchmark orchestration. |
| AgentLab | experiment framework | GitHub | Running, comparing, and analyzing web-agent experiments. |
| OpenAdapt Desktop | desktop capture/runtime | GitHub | Capturing human demonstrations and replaying desktop workflows. |
| ScreenPipe | local data capture | GitHub | Recording local screen/audio context for personal or research agents. |
| Resource | Type | Link | Primary Use |
|---|---|---|---|
| OSWorld | desktop environment | GitHub | Standard desktop benchmark and environment. |
| AndroidWorld | mobile environment | GitHub | Dynamic Android environment for mobile agents. |
| AndroidControl | mobile dataset | Paper | Large Android control demonstrations for training and data-scaling studies. |
| Windows Agent Arena | desktop environment | GitHub | Windows-specific OS-agent evaluation. |
| WebArena | web benchmark | GitHub | Realistic web tasks with execution-based grading. |
| WebArena-Verified | web benchmark | GitHub | Audited and deterministic WebArena evaluation. |
| VisualWebArena | visual web benchmark | GitHub | Web tasks where screenshots and visual grounding matter. |
| WorkArena | enterprise benchmark | GitHub | Enterprise-style workflow automation. |
| OpenCUA | open CUA stack | GitHub | Data, models, and evaluation foundations for computer-use agents. |
| ScaleCUA | scaling stack | GitHub | Cross-platform CUA data scaling and evaluation. |
| CUA-Suite | data suite | Paper | Large human-annotated video demonstrations for CUA research. |
| ShowUI-Aloha | data pipeline | Paper, code | Turning screen recordings into GUI-agent training trajectories. |
| Video2GUI | data pipeline | Paper | Synthesizing GUI trajectories from instructional videos. |
| lmms-eval | eval toolkit | GitHub | Static multimodal evaluation that can complement closed-loop agent tests. |
| Resource | Type | Link | Primary Use |
|---|---|---|---|
| OpenVLA | VLA model | GitHub | Common open baseline for VLA robot manipulation. |
| LeRobot | robotics toolkit | GitHub | Robot-learning datasets, policies, training, and deployment tooling. |
| LIBERO | robotics benchmark | GitHub | Lifelong robot manipulation tasks. |
| RLBench | robotics benchmark | GitHub | Simulation-based manipulation evaluation. |
| SAGE | 3D scene engine | GitHub | Agentic 3D scene generation for embodied policy training. |
| Magma | foundation model | GitHub | Bridging digital computer use and physical action. |
| Workflow | Practical stack |
|---|---|
| GUI grounding research | ScreenSpot-Pro + OmniParser + UGround + GUI-G1 + UI-Zoomer |
| Browser-agent experiments | BrowserGym + AgentLab + WebArena + WebArena-Verified + VisualWebArena + WebLINX + MMInA + SaaS-Bench |
| Desktop computer-use agents | UI-TARS Desktop + Agent S + Cua + OSWorld + Windows Agent Arena |
| Mobile GUI agents | Android in the Wild + AndroidControl + AndroidWorld + A3 + MobileAgentBench + SPA-Bench + MemGUI-Bench |
| Demonstration and data pipelines | OpenAdapt + OpenAdapt Desktop + ScreenPipe + ShowUI-Aloha + CUA-Suite + Video2GUI |
| Agentic visual creation | gpt_image_2_skill + GenArtist + DeepEyes + Agent Banana + VisionCreator + GEMS |
| Embodied VLA research | OpenVLA + LeRobot + LIBERO + RLBench + SAGE + Magma + VLAs-as-Tools |
| Reliability and security testing | VPI-Bench + OpenAgentSafety + UI-CUBE + OS-BLIND + HazardArena + OS-SPEAR |
| Resource | Link | Why read it |
|---|---|---|
| OpenAI Computer Use guide | Docs | Developer-facing guide for building with computer-use tooling. |
| OpenAI Computer-Using Agent | Article | Product and research framing for modern CUAs. |
| OpenAI Skills guide | Docs | Practical reference for reusable agent skills. |
| OpenAI MCP and Connectors guide | Docs | Reference for connecting external tools and services to agents. |
| Anthropic: Developing a computer use model | Article | Strong public engineering writeup on GUI-agent training and evaluation. |
| Anthropic: Introducing computer use | Article | System framing and deployment context for computer-use models. |
| Google DeepMind: Gemini Robotics | Article | Industry view on embodied visual agents. |
| Google DeepMind: Gemini Robotics On-Device | Article | Notes on low-latency, local VLA deployment. |
| Repository | Link | Notes |
|---|---|---|
| Awesome-GUI-Agents | GitHub | Focused companion index for GUI grounding and automation papers. |
| GUI-Agents-Paper-List | GitHub | Systematic paper index focused on GUI agents. |
| awesome-ui-agents | GitHub | Neighboring index for UI-agent papers and projects. |
| Evolving Visual Generation | GitHub | Adjacent map for visual-generation systems. |
| Awesome Multimodal Modeling | GitHub | Broader multimodal modeling list beyond the stricter agent boundary here. |
Pull requests are welcome when they improve precision rather than volume.
Recommended metadata:
- The paper title or project name.
- Official paper, code, project page, or documentation link.
- The best category for the item.
- One sentence explaining the visual-agent loop, benchmark role, or builder value.
Out of scope:
- Generic multimodal model releases with no visual-agent evaluation.
- One-shot generation papers without planning, tools, search, critique, or interaction.
- Duplicate benchmark rows unless the new row adds a distinct environment or protocol.
- Unverified arXiv IDs, placeholder links, and marketing-only announcements.
This repository is maintained as a precision-oriented research map:
- Prefer primary sources: official papers, project pages, code repositories, datasets, benchmarks, and technical documentation.
- Keep research entries, benchmarks, and engineering resources separated when their roles differ.
- Add recent work only when it improves the conceptual coverage, empirical coverage, or builder utility of the map.
- Verify arXiv identifiers, project links, and benchmark names before adding new entries.
- Prune duplicate, weakly scoped, or marketing-only entries even when they are recent.
- Preserve a strict visual-agent boundary: perception alone is not sufficient without grounding, planning, tool use, interaction, control, or agent-oriented evaluation.
If you use this curated index in research or engineering work, please cite it as:
@misc{awesome-visual-agent,
title = {Awesome Visual Agent},
author = {OpenEnvision and contributors},
year = {2026},
howpublished = {\url{https://github.com/OpenEnvision/Awesome-Visual-Agent}},
note = {Curated list of visual-agent papers, benchmarks, and tooling}
}