Skip to content

OpenEnvision/Awesome-Visual-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 

Repository files navigation

Awesome Visual Agent

A curated research index for visual agents that perceive, ground, plan, act, create, and evaluate in visually grounded environments.

Awesome Scope Boundary Style

Visual agents occupy the intersection of multimodal perception, grounded reasoning, tool use, interaction, and control. This repository curates papers, benchmarks, datasets, runtimes, and engineering resources for systems that close the loop between visual observation and purposeful action.

The list is selective rather than exhaustive. It prioritizes works that introduce a clear agent loop, action space, evaluation protocol, data engine, safety finding, or reusable implementation artifact, while excluding generic multimodal models and one-shot visual-generation systems without an agentic mechanism.

Contents

Selection Boundary

Included areas:

  • GUI, web, desktop, and mobile agents that perceive screens and produce executable actions.
  • Visual grounding work that is clearly tied to downstream agent control.
  • Embodied vision-language-action systems for robot manipulation, navigation, and physical-world interaction.
  • Agentic visual reasoning and generation systems with search, planning, memory, tools, critique, or iterative refinement.
  • Benchmarks, data engines, simulators, safety suites, and toolchains that support visual-agent construction and evaluation.

Excluded by default:

  • Broad multimodal foundation models with no visual-agent evaluation.
  • Generic OCR, captioning, visual question answering, or layout parsing without an action or agent setting.
  • Image, video, or 3D generators that are only prompt-in/artifact-out.
  • Unverified arXiv IDs, placeholder-looking entries, product rumors, and duplicate rows.

Back to top

Research Taxonomy

Track Research question Representative works
Screen grounding Can the model localize text, widgets, controls, and regions well enough to act? Set-of-Mark, SeeClick, OmniParser, UGround, ScreenSpot-Pro, GUI-Eyes
Computer use Can the agent complete tasks in real websites, desktops, or phones over multiple steps? WebArena, WebLINX, AppAgent, Mobile-Agent, OSWorld, AndroidWorld, Agent S, UI-TARS
Embodied VLA Can visual observations and language be converted into safe physical actions? PerAct, VIMA, RT-1, RT-2, Open X-Embodiment, OpenVLA, Pi-Zero, Magma
Agentic reasoning and creation Can the system plan, search, critique, edit, or generate visual artifacts through a loop? VISPROG, ViperGPT, GenArtist, DeepEyes, Agent Banana, VisionCreator, GEMS
Reliability and safety Can we measure brittleness, privacy risk, prompt injection, unsafe actions, and deployment readiness? VPI-Bench, OpenAgentSafety, OS-BLIND, HazardArena, UI-CUBE
Infrastructure Which tools and environments support reproducible training, deployment, and evaluation? BrowserGym, AgentLab, Stagehand, Playwright MCP, Agent S, Cua, OpenCUA, ScaleCUA, LeRobot

Back to top

Curation Rubric

An item should usually satisfy at least one of these conditions:

  • It defines a new visual-agent capability, benchmark, data engine, training recipe, runtime, or safety evaluation.
  • It evaluates closed-loop behavior rather than only static recognition or one-shot generation.
  • It is widely used as a baseline, benchmark, dataset, environment, or builder tool.
  • It has a stable paper, official code, project page, or documentation that readers can inspect.

An item is removed or left out when the visual-agent connection is weak, the link is unverifiable, the arXiv ID is wrong, the row duplicates a better entry, or the contribution is mostly a product announcement without enough technical detail.

Back to top

Reading Pathways

GUI and computer use. Start with SeeClick, OmniParser, OSWorld, UI-TARS, Agent S2, OpenCUA, and UI-Copilot.

Mobile GUI agents. Read Android in the Wild, MM-Navigator, AppAgent, Mobile-Agent, Mobile-Agent-v2, A3, and MemGUI-Bench.

Grounding and perception. Read ScreenAI, Ferret-UI, UGround, ScreenSpot-Pro, GUI-Actor, Phi-Ground, GUI-Eyes, and UI-Zoomer.

Embodied VLA. Start with PerAct, VIMA, RT-1, PaLM-E, RT-2, Open X-Embodiment, OpenVLA, Pi-Zero, Magma, and World-Value-Action.

Agentic visual reasoning and creation. Read VISPROG, Visual ChatGPT, ViperGPT, LLaVA-Plus, GenArtist, DeepEyes, Agent Banana, VisionCreator, and GEMS.

Back to top

Recent Additions

Work Date Contribution / Relevance
GUI-Eyes 2026-01 Active visual perception for GUI grounding with learned crop/zoom tool use.
ShowUI-Aloha 2026-01 Converts human screen recordings into structured GUI-agent supervision.
OS-Symphony 2026-01 Holistic framework for robust computer-using agents.
ActionEngine 2026-02 Uses state-machine memory to make GUI agents more programmatic and recoverable.
Agent Banana 2026-02 Agentic image editing with planning and tool execution rather than one-shot editing.
SAGE 2026-02 Agentic 3D scene generation for embodied-AI policy training.
CUA-Suite 2026-03 Large human-annotated video demonstrations for computer-use agents.
GEMS 2026-03 Multimodal generation loop with memory, skills, and iterative agent refinement.
UI-Copilot 2026-04 Long-horizon GUI automation with tool-integrated policy optimization.
UI-Zoomer 2026-04 Uncertainty-driven zoom-in for hard GUI grounding cases.
DynamicGUIBench 2026-04 Evaluates GUI agents in high-dynamic interfaces rather than static screenshots.
Video2GUI 2026-05 Synthesizes GUI interaction trajectories from instructional videos.
UI-Verse 2026-05 Studies interface design heuristics that improve computer-use-agent reliability.
Securing Computer-Use Agents 2026-05 Connects CUA architecture, lifecycle, permission scope, and runtime reliability.
Don't Click That 2026-05 Deception-aware web-agent benchmark and defense for misleading interface elements.
VLAs-as-Tools 2026-05-13 Long-horizon embodied-agent strategy that delegates bounded physical subtasks to specialized VLA tools.
SaaS-Bench 2026-05-15 Real-world SaaS workflow benchmark for long-horizon computer-use agents.

Back to top

Research Map

Surveys and Landscape

Work Year Links Contribution / Relevance
A Comprehensive Survey of Agents for Computer Use 2025 paper Broad map of computer-use-agent domains, agent loops, and evaluation bottlenecks.
GUI Agents: A Survey 2024 paper Practical survey of GUI-agent architectures, datasets, benchmarks, and failure modes.
A Survey on (M)LLM-Based GUI Agents 2025 paper Focused entry point for planning, grounding, memory, and GUI-agent evaluation.
Towards Trustworthy GUI Agents 2025 paper Reliability and safety framing for deployment-facing GUI agents.
Large Multimodal Agents: A Survey 2024 paper Contextual background on LLM-driven multimodal agent components.
A Survey on Vision-Language-Action Models for Embodied AI 2024 paper Early VLA survey covering embodied perception, planning, and action.
Vision-Language-Action in Robotics 2026 paper Data-centric survey of VLA datasets, benchmarks, and data engines.
Vision-Language-Action Safety 2026 paper Focused taxonomy of threats, evaluations, and defenses for VLA systems.
Safety in Embodied AI 2026 paper Wider safety survey across perception, planning, action, and interaction.
Visual Generation in the New Era 2026 paper Conceptual lens for when visual generation becomes agentic world modeling.
Securing Computer-Use Agents 2026 paper Deployment-grounded view of CUA reliability across architecture, lifecycle, permissions, and oversight.

Back to top

GUI Grounding and Screen Perception

Work Year Links Contribution / Relevance
CogAgent 2023 paper, code Early high-resolution VLM built explicitly for GUI understanding and navigation.
Set-of-Mark Prompting 2023 paper, code Simple visual marking strategy that became a practical grounding primitive for LMM agents.
SeeClick 2024 paper, code Shows that GUI grounding is a core bottleneck for visual GUI agents.
ScreenAI 2024 paper Strong foundation for screen, document, infographic, and layout-heavy visual understanding.
Ferret-UI 2024 paper, code Region-aware mobile UI understanding with explicit grounding.
OmniParser 2024 paper, code Practical screenshot-to-interactable-region parser for pure-vision GUI agents.
UGround 2024 paper, code Strong pure-vision grounding baseline without relying on accessibility trees.
OS-ATLAS 2024 paper Foundation action model for generalist GUI agents.
ShowUI 2024 paper, code Unifies screenshot-conditioned GUI perception and action modeling.
Aguvis 2024 paper Pure-vision GUI agent direction with autonomous interface interaction.
UI-E2I-Synth 2025 paper Synthetic instruction pipeline for scaling GUI grounding supervision.
ScreenSpot-Pro 2025 paper Hard high-resolution grounding benchmark for professional computer-use screens.
GUI-G1 2025 paper, code Careful analysis of RL pitfalls in GUI grounding.
Enhancing Visual Grounding via Self-Evolutionary RL 2025 paper Data-efficient RL recipe for high-resolution GUI grounding.
GUI-Actor 2025 paper Coordinate-free grounding with an action head and verifier.
Phi-Ground 2025 paper Strong empirical report on training compact GUI grounding models.
Test-Time RL for GUI Grounding 2025 paper Test-time adaptation using region consistency.
Explicit Position-to-Coordinate Mapping 2025 paper Addresses coordinate generation as a concrete grounding bottleneck.
GUI-Eyes 2026 paper Learns when and how to call visual tools such as crop and zoom.
UI-Zoomer 2026 paper, code Uses uncertainty to decide where to zoom for GUI grounding.

Back to top

Computer-Use Agents and Environments

Work Year Links Contribution / Relevance
Mind2Web 2023 paper Foundational benchmark for generalist web agents.
Android in the Wild 2023 paper Large-scale Android device-control dataset with realistic gestures.
WebArena 2023 paper, code Realistic web-agent environment with execution-based tasks.
AutoDroid 2023 paper Early Android task-automation system and benchmark that remains relevant as a mobile-agent baseline.
MM-Navigator 2023 paper Early GPT-4V smartphone GUI navigation agent with zero-shot screen interaction.
AppAgent 2023 paper Smartphone agent that learns app operation from autonomous exploration or demonstrations.
SeeAct 2024 paper Web agent showing why grounding matters for GPT-4V-style agents.
Mobile-Agent 2024 paper, code Vision-centric mobile device agent using visual perception tools and stepwise planning.
VisualWebArena 2024 paper, code Adds visually grounded tasks to realistic web-agent evaluation.
WebVoyager 2024 paper End-to-end multimodal web agent evaluated on live websites.
WebLINX 2024 paper, project Large benchmark of multi-turn conversational web navigation with screenshots and action history.
OmniACT 2024 paper Desktop and web benchmark where agents generate executable automation scripts.
WorkArena 2024 paper, code Enterprise workflow benchmark for knowledge-work agents.
MMInA 2024 paper, code Multihop multimodal Internet-agent benchmark on evolving real websites.
B-MoCA 2024 paper Mobile device-control benchmark across diverse configurations.
OSWorld 2024 paper, code Flagship benchmark for open-ended tasks in real desktop environments.
AndroidWorld 2024 paper, code Dynamic Android benchmark with broad task diversity.
Mobile-Agent-v2 2024 paper, code Multi-agent mobile operation assistant with planning, decision, and reflection roles.
MobileAgentBench 2024 paper Practical benchmark for mobile LLM agents.
WebCanvas 2024 paper Online web-agent benchmark and framework built around Mind2Web-Live.
Agent S 2024 paper, code Open agentic framework for using computers through GUI actions.
Windows Agent Arena 2024 paper, code Scalable evaluation environment for Windows OS agents.
SPA-Bench 2024 paper Comprehensive smartphone-agent evaluation benchmark.
AndroidLab 2024 paper Android training and benchmarking environment with virtual devices and task suites.
VideoWebArena 2024 paper Long-context video understanding inside web-agent workflows.
MageBench 2024 paper, code Lightweight visual-agent benchmark covering WebUI, Sokoban, and Football environments.
UI-TARS 2025 paper Native GUI-agent model trained for perception, grounding, and action.
A3 2025 paper, project Android Agent Arena for online mobile GUI-agent evaluation across real apps.
Agent S2 2025 paper, code Generalist-specialist framework for computer-use agents.
UI-Evol 2025 paper Plug-in knowledge-evolution module that improves OSWorld execution reliability for CUAs.
ZeroGUI 2025 paper Online GUI-agent learning with task generation and reward estimation.
OpenCUA 2025 paper, code Open foundation stack for computer-use agents.
ScaleCUA 2025 paper, code Cross-platform data scaling for open-source computer-use agents.
OmegaUse 2026 paper General-purpose GUI agent for autonomous task execution.
MemGUI-Bench 2026 paper Evaluates memory across mobile GUI sessions and changing environments.
SecAgent 2026 paper Efficient 3B mobile GUI agent with semantic context compression and Chinese mobile data.
OS-Symphony 2026 paper, code Framework for robust and generalist computer-use agents.
ActionEngine 2026 paper State-machine memory for more structured GUI automation.
ContractSkill 2026 paper Treats web-agent skills as repairable contracts that can be verified and reused.
UI-Copilot 2026 paper, code Long-horizon GUI automation with tool-integrated policy optimization.
ClawGUI 2026 paper Unified framework for training, evaluating, and deploying GUI agents.
RiskWebWorld 2026 paper Realistic interactive benchmark for e-commerce risk-management GUI agents.
DynamicGUIBench 2026 paper Stress-tests agents in dynamic, evolving GUI environments.
UI-Verse 2026 paper Interface-design perspective on making CUAs more reliable.
SaaS-Bench 2026 paper, code Long-horizon benchmark over real deployable SaaS systems and professional workflows.

Back to top

Embodied Vision-Language-Action Agents

Work Year Links Contribution / Relevance
PerAct 2022 paper, project Language-conditioned RGB-D manipulation agent that predicts voxel actions directly.
VIMA 2022 paper, project Multimodal-prompt robot manipulation benchmark and transformer agent.
RT-1 2022 paper, project Large-scale real-robot action model that anchors later RT/VLA work.
PaLM-E 2023 paper Embodied multimodal language model connecting visual input to robot tasks.
RT-2 2023 paper Canonical VLA model transferring web-scale vision-language knowledge to robot control.
Open X-Embodiment / RT-X 2023 paper Large robot-learning dataset and RT-X model family.
Octo 2024 paper Open-source generalist robot policy.
OpenVLA 2024 paper, code Open-source VLA model and a common baseline for robot manipulation.
Pi-Zero 2024 paper Flow-based VLA model for general robot control.
Magma 2025 paper, code Bridges multimodal agents across digital and physical actions.
SafeVLA 2025 paper Safety alignment for VLA models via constrained learning.
Interleave-VLA 2025 paper Robot manipulation with interleaved image-text instructions.
ChatVLA-2 2025 paper Open-world embodied reasoning from pretrained knowledge.
VLA^2 2025 paper Agentic framework for unseen-concept manipulation.
World-Value-Action 2026 paper Uses implicit planning and future-state value estimation for VLA systems.
VLAs-as-Tools 2026 paper Splits long-horizon embodied tasks between a high-level VLM planner and specialized VLA tools.
SAGE 2026 paper, code Agentically generates simulator-ready 3D scenes for embodied policy training.

Back to top

Agentic Visual Reasoning, Generation, and World Building

Work Year Links Contribution / Relevance
VISPROG 2022 paper, project Foundational visual-programming approach for tool-composed visual reasoning and editing.
Visual ChatGPT 2023 paper, code Early system connecting ChatGPT with visual foundation models for multi-step visual tasks.
ViperGPT 2023 paper, code Uses Python execution to compose vision modules for interpretable visual reasoning.
LLaVA-Plus 2023 paper Trains multimodal agents to select and use visual tools across understanding and generation.
GenArtist 2024 paper, code MLLM-as-agent for image generation and editing through planning and tool use.
CIGEval 2025 paper Agentic evaluation framework for conditional image generation.
DeepEyes 2025 paper Reinforcement learning for active visual reasoning, grounding, and "thinking with images."
ImAgent 2025 paper Test-time scalable multimodal agent framework for image generation.
GenAgent 2026 paper Scales text-to-image generation through agentic multimodal reasoning.
Mind-Brush 2026 paper Adds cognitive search and reasoning loops to image generation.
Agent Banana 2026 paper, code High-fidelity image editing with planner-executor tooling.
M3 2026 paper Multi-modal, multi-agent, multi-round reasoning for high-fidelity text-to-image generation.
VisionCreator 2026 paper Native visual-generation agentic model with understanding, planning, and creation.
Gen-Searcher 2026 paper, project Reinforces agentic search for image generation.
GEMS 2026 paper, project Multimodal generation with memory, skills, and iterative agent loops.
Visual Generation in the New Era 2026 paper Helpful taxonomy for agentic world modeling and generation.

Back to top

Safety, Robustness, and Evaluation

Work Year Links Contribution / Relevance
AGENTSAFE 2025 paper Safety benchmark for embodied agents under hazardous instructions.
IS-Bench 2025 paper Interactive safety benchmark for VLM-driven household agents.
VPI-Bench 2025 paper, code Visual prompt-injection benchmark for computer-use agents.
OpenAgentSafety 2025 paper Framework for evaluating real-world agent safety across risk categories.
OS-Sentinel 2025 paper Hybrid validation for safer mobile GUI agents.
UI-CUBE 2025 paper Enterprise CUA benchmark that measures operational reliability beyond task accuracy.
GUIGuard-Bench 2026 paper Privacy-preserving GUI-agent evaluation.
CUAAudit 2026 paper Tests whether VLMs can audit autonomous computer-use agents.
OS-BLIND 2026 paper Shows how benign-looking user instructions expose CUA vulnerabilities.
HazardArena 2026 paper Semantic safety evaluation for VLA systems.
RedVLA 2026 paper Physical red-teaming benchmark for VLA models.
GUI-Perturbed 2026 paper Domain-randomization study exposing GUI-grounding brittleness.
OS-SPEAR 2026 paper Toolkit for safety, performance, efficiency, and robustness analysis of OS agents.
Don't Click That 2026 paper Benchmarks and mitigates deceptive UI elements for VLM-based web agents.
ProjGuard 2026 paper Safety monitoring for computer-use agents via low-dimensional projections.

Back to top

Benchmarks and Environments

Area Resource Link Primary Use
Web MiniWoB++ code Compact browser-interaction environments for controlled RL-style experiments.
Web Mind2Web paper Offline web-agent action prediction and grounding.
Web WebArena paper, code Realistic web navigation with execution-based grading.
Web WebArena-Verified code Audited WebArena task set with deterministic offline evaluation.
Web VisualWebArena paper, code Visually grounded web tasks where screenshots matter.
Web WebLINX paper, project Conversational web navigation from expert demonstrations.
Web WorkArena paper, code Enterprise workflow automation in ServiceNow-style environments.
Web MMInA paper, code Multihop multimodal tasks over evolving real websites.
Web WebCanvas paper Online web-agent evaluation with Mind2Web-Live.
Web RiskWebWorld paper Realistic e-commerce risk-management tasks for GUI agents.
Web SaaS-Bench paper, code Long-horizon professional workflows across deployable SaaS systems.
Desktop OSWorld paper, code Open-ended desktop tasks in real operating systems.
Desktop Windows Agent Arena paper, code Windows-specific scaling and reproducible OS-agent evaluation.
Desktop OmniACT paper Evaluating executable automation rather than only low-level clicks.
Mobile Android in the Wild paper Large-scale Android device-control demonstrations with screen observations.
Mobile B-MoCA paper Mobile control across diverse device configurations.
Mobile AndroidWorld paper, code Dynamic Android tasks with broad app coverage.
Mobile AndroidControl paper Diverse Android control dataset for studying scale and generalization.
Mobile MobileAgentBench paper Efficient mobile-agent evaluation across open-source apps.
Mobile SPA-Bench paper Smartphone-agent testing with comprehensive task coverage.
Mobile AndroidLab paper Training and systematic benchmarking on Android virtual devices.
Mobile A3 paper, project Real-app online evaluation for mobile GUI agents.
Mobile SecAgent paper Chinese mobile GUI dataset, benchmark, and compact semantic-context agent.
Grounding ScreenSpot-Pro paper High-resolution professional-screen grounding.
Visual-agent reasoning MageBench paper, code Lightweight environments for vision-in-the-chain agent reasoning.
Memory MemGUI-Bench paper Cross-session and cross-temporal mobile GUI memory.
Dynamic GUI DynamicGUIBench paper Robustness under evolving interfaces and dynamic UI changes.
Enterprise reliability UI-CUBE paper Deployment-readiness diagnostics beyond simple task success.
Security VPI-Bench paper, code Visual prompt injection for GUI and computer-use agents.
Safety AGENTSAFE paper Hazardous-instruction safety for embodied agents.
Safety HazardArena paper Semantic safety evaluation for VLA systems.
Embodied LIBERO code Lifelong robot manipulation tasks.
Embodied RLBench code Simulation-based manipulation benchmark.

Back to top

Skills, Tools, and Engineering Resources

These resources are intentionally separated from research papers. They are implementation and evaluation artifacts rather than, in every case, standalone research contributions.

Skill and Prompt Libraries

Resource Type Link Primary Use
OpenAI Skills guide docs Docs Understanding skill-style packaging for reusable agent capabilities.
awesome-agent-skills collection GitHub Finding reusable agent skills across browsing, coding, documents, and visual tasks.
awesome-gpt-image-2 collection GitHub Tracking prompt patterns and workflows around modern image generation.
gpt_image_2_skill skill package GitHub Example of packaging image-generation workflows as reusable skills.
ToDiagram skills skill collection GitHub Diagram and visual-communication skills that pair well with visual agents.

Models, Parsers, and Grounding Tools

Resource Type Link Primary Use
OmniParser parser GitHub Converting screenshots into candidate interactable regions.
ShowUI GUI model GitHub Screenshot-conditioned GUI action modeling and demonstration pipelines.
UGround grounding model GitHub Pure-vision GUI grounding without accessibility trees.
OS-ATLAS action model Paper Cross-platform GUI action grounding.
GUI-G1 grounding model GitHub Studying RL recipes and evaluation pitfalls for GUI grounding.
UI-Zoomer grounding tool GitHub Adaptive zoom-in when the target UI element is hard to localize.
Phi-Ground grounding model Paper Compact GUI grounding baseline for resource-constrained settings.

Agent Runtimes and Operator Stacks

Resource Type Link Primary Use
UI-TARS Desktop desktop agent GitHub Running multimodal desktop agents locally.
Agent S runtime GitHub General computer-use experiments with a practical open framework.
Cua operator stack GitHub Infrastructure for running and evaluating computer-use agents.
OpenAdapt generative RPA stack GitHub Recording GUI demonstrations, training models, and evaluating agents from a unified CLI.
browser-use browser runtime GitHub Browser automation workflows when DOM/tool access is acceptable.
Stagehand browser runtime GitHub Hybrid code-plus-natural-language browser automation for production workflows.
Playwright MCP browser MCP server GitHub Gives agents browser automation tools through the Model Context Protocol.
BrowserGym browser harness GitHub Reproducible browser-agent experiments and benchmark orchestration.
AgentLab experiment framework GitHub Running, comparing, and analyzing web-agent experiments.
OpenAdapt Desktop desktop capture/runtime GitHub Capturing human demonstrations and replaying desktop workflows.
ScreenPipe local data capture GitHub Recording local screen/audio context for personal or research agents.

Data Capture, Training, and Evaluation Stacks

Resource Type Link Primary Use
OSWorld desktop environment GitHub Standard desktop benchmark and environment.
AndroidWorld mobile environment GitHub Dynamic Android environment for mobile agents.
AndroidControl mobile dataset Paper Large Android control demonstrations for training and data-scaling studies.
Windows Agent Arena desktop environment GitHub Windows-specific OS-agent evaluation.
WebArena web benchmark GitHub Realistic web tasks with execution-based grading.
WebArena-Verified web benchmark GitHub Audited and deterministic WebArena evaluation.
VisualWebArena visual web benchmark GitHub Web tasks where screenshots and visual grounding matter.
WorkArena enterprise benchmark GitHub Enterprise-style workflow automation.
OpenCUA open CUA stack GitHub Data, models, and evaluation foundations for computer-use agents.
ScaleCUA scaling stack GitHub Cross-platform CUA data scaling and evaluation.
CUA-Suite data suite Paper Large human-annotated video demonstrations for CUA research.
ShowUI-Aloha data pipeline Paper, code Turning screen recordings into GUI-agent training trajectories.
Video2GUI data pipeline Paper Synthesizing GUI trajectories from instructional videos.
lmms-eval eval toolkit GitHub Static multimodal evaluation that can complement closed-loop agent tests.

Embodied and Robotics Tooling

Resource Type Link Primary Use
OpenVLA VLA model GitHub Common open baseline for VLA robot manipulation.
LeRobot robotics toolkit GitHub Robot-learning datasets, policies, training, and deployment tooling.
LIBERO robotics benchmark GitHub Lifelong robot manipulation tasks.
RLBench robotics benchmark GitHub Simulation-based manipulation evaluation.
SAGE 3D scene engine GitHub Agentic 3D scene generation for embodied policy training.
Magma foundation model GitHub Bridging digital computer use and physical action.

Back to top

Workflow Stacks

Workflow Practical stack
GUI grounding research ScreenSpot-Pro + OmniParser + UGround + GUI-G1 + UI-Zoomer
Browser-agent experiments BrowserGym + AgentLab + WebArena + WebArena-Verified + VisualWebArena + WebLINX + MMInA + SaaS-Bench
Desktop computer-use agents UI-TARS Desktop + Agent S + Cua + OSWorld + Windows Agent Arena
Mobile GUI agents Android in the Wild + AndroidControl + AndroidWorld + A3 + MobileAgentBench + SPA-Bench + MemGUI-Bench
Demonstration and data pipelines OpenAdapt + OpenAdapt Desktop + ScreenPipe + ShowUI-Aloha + CUA-Suite + Video2GUI
Agentic visual creation gpt_image_2_skill + GenArtist + DeepEyes + Agent Banana + VisionCreator + GEMS
Embodied VLA research OpenVLA + LeRobot + LIBERO + RLBench + SAGE + Magma + VLAs-as-Tools
Reliability and security testing VPI-Bench + OpenAgentSafety + UI-CUBE + OS-BLIND + HazardArena + OS-SPEAR

Back to top

Official Docs and Engineering Notes

Resource Link Why read it
OpenAI Computer Use guide Docs Developer-facing guide for building with computer-use tooling.
OpenAI Computer-Using Agent Article Product and research framing for modern CUAs.
OpenAI Skills guide Docs Practical reference for reusable agent skills.
OpenAI MCP and Connectors guide Docs Reference for connecting external tools and services to agents.
Anthropic: Developing a computer use model Article Strong public engineering writeup on GUI-agent training and evaluation.
Anthropic: Introducing computer use Article System framing and deployment context for computer-use models.
Google DeepMind: Gemini Robotics Article Industry view on embodied visual agents.
Google DeepMind: Gemini Robotics On-Device Article Notes on low-latency, local VLA deployment.

Back to top

Related Lists

Repository Link Notes
Awesome-GUI-Agents GitHub Focused companion index for GUI grounding and automation papers.
GUI-Agents-Paper-List GitHub Systematic paper index focused on GUI agents.
awesome-ui-agents GitHub Neighboring index for UI-agent papers and projects.
Evolving Visual Generation GitHub Adjacent map for visual-generation systems.
Awesome Multimodal Modeling GitHub Broader multimodal modeling list beyond the stricter agent boundary here.

Back to top

Contributing

Pull requests are welcome when they improve precision rather than volume.

Recommended metadata:

  • The paper title or project name.
  • Official paper, code, project page, or documentation link.
  • The best category for the item.
  • One sentence explaining the visual-agent loop, benchmark role, or builder value.

Out of scope:

  • Generic multimodal model releases with no visual-agent evaluation.
  • One-shot generation papers without planning, tools, search, critique, or interaction.
  • Duplicate benchmark rows unless the new row adds a distinct environment or protocol.
  • Unverified arXiv IDs, placeholder links, and marketing-only announcements.

Back to top

Maintenance Policy

This repository is maintained as a precision-oriented research map:

  • Prefer primary sources: official papers, project pages, code repositories, datasets, benchmarks, and technical documentation.
  • Keep research entries, benchmarks, and engineering resources separated when their roles differ.
  • Add recent work only when it improves the conceptual coverage, empirical coverage, or builder utility of the map.
  • Verify arXiv identifiers, project links, and benchmark names before adding new entries.
  • Prune duplicate, weakly scoped, or marketing-only entries even when they are recent.
  • Preserve a strict visual-agent boundary: perception alone is not sufficient without grounding, planning, tool use, interaction, control, or agent-oriented evaluation.

Back to top

Citation

If you use this curated index in research or engineering work, please cite it as:

@misc{awesome-visual-agent,
  title        = {Awesome Visual Agent},
  author       = {OpenEnvision and contributors},
  year         = {2026},
  howpublished = {\url{https://github.com/OpenEnvision/Awesome-Visual-Agent}},
  note         = {Curated list of visual-agent papers, benchmarks, and tooling}
}

Back to top

About

Awesome Visual Agent

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors