Awesome Visual Agent

A curated research index for visual agents that perceive, ground, plan, act, create, and evaluate in visually grounded environments.

Visual agents occupy the intersection of multimodal perception, grounded reasoning, tool use, interaction, and control. This repository curates papers, benchmarks, datasets, runtimes, and engineering resources for systems that close the loop between visual observation and purposeful action.

The list is selective rather than exhaustive. It prioritizes works that introduce a clear agent loop, action space, evaluation protocol, data engine, safety finding, or reusable implementation artifact, while excluding generic multimodal models and one-shot visual-generation systems without an agentic mechanism.

Selection Boundary

Included areas:

GUI, web, desktop, and mobile agents that perceive screens and produce executable actions.
Visual grounding work that is clearly tied to downstream agent control.
Embodied vision-language-action systems for robot manipulation, navigation, and physical-world interaction.
Agentic visual reasoning and generation systems with search, planning, memory, tools, critique, or iterative refinement.
Benchmarks, data engines, simulators, safety suites, and toolchains that support visual-agent construction and evaluation.

Excluded by default:

Broad multimodal foundation models with no visual-agent evaluation.
Generic OCR, captioning, visual question answering, or layout parsing without an action or agent setting.
Image, video, or 3D generators that are only prompt-in/artifact-out.
Unverified arXiv IDs, placeholder-looking entries, product rumors, and duplicate rows.

Track	Research question	Representative works
Screen grounding	Can the model localize text, widgets, controls, and regions well enough to act?	Set-of-Mark, SeeClick, OmniParser, UGround, ScreenSpot-Pro, GUI-Eyes
Computer use	Can the agent complete tasks in real websites, desktops, or phones over multiple steps?	WebArena, WebLINX, AppAgent, Mobile-Agent, OSWorld, AndroidWorld, Agent S, UI-TARS
Embodied VLA	Can visual observations and language be converted into safe physical actions?	PerAct, VIMA, RT-1, RT-2, Open X-Embodiment, OpenVLA, Pi-Zero, Magma
Agentic reasoning and creation	Can the system plan, search, critique, edit, or generate visual artifacts through a loop?	VISPROG, ViperGPT, GenArtist, DeepEyes, Agent Banana, VisionCreator, GEMS
Reliability and safety	Can we measure brittleness, privacy risk, prompt injection, unsafe actions, and deployment readiness?	VPI-Bench, OpenAgentSafety, OS-BLIND, HazardArena, UI-CUBE
Infrastructure	Which tools and environments support reproducible training, deployment, and evaluation?	BrowserGym, AgentLab, Stagehand, Playwright MCP, Agent S, Cua, OpenCUA, ScaleCUA, LeRobot

Work	Date	Contribution / Relevance
GUI-Eyes	2026-01	Active visual perception for GUI grounding with learned crop/zoom tool use.
ShowUI-Aloha	2026-01	Converts human screen recordings into structured GUI-agent supervision.
OS-Symphony	2026-01	Holistic framework for robust computer-using agents.
ActionEngine	2026-02	Uses state-machine memory to make GUI agents more programmatic and recoverable.
Agent Banana	2026-02	Agentic image editing with planning and tool execution rather than one-shot editing.
SAGE	2026-02	Agentic 3D scene generation for embodied-AI policy training.
CUA-Suite	2026-03	Large human-annotated video demonstrations for computer-use agents.
GEMS	2026-03	Multimodal generation loop with memory, skills, and iterative agent refinement.
UI-Copilot	2026-04	Long-horizon GUI automation with tool-integrated policy optimization.
UI-Zoomer	2026-04	Uncertainty-driven zoom-in for hard GUI grounding cases.
DynamicGUIBench	2026-04	Evaluates GUI agents in high-dynamic interfaces rather than static screenshots.
Video2GUI	2026-05	Synthesizes GUI interaction trajectories from instructional videos.
UI-Verse	2026-05	Studies interface design heuristics that improve computer-use-agent reliability.
Securing Computer-Use Agents	2026-05	Connects CUA architecture, lifecycle, permission scope, and runtime reliability.
Don't Click That	2026-05	Deception-aware web-agent benchmark and defense for misleading interface elements.
VLAs-as-Tools	2026-05-13	Long-horizon embodied-agent strategy that delegates bounded physical subtasks to specialized VLA tools.
SaaS-Bench	2026-05-15	Real-world SaaS workflow benchmark for long-horizon computer-use agents.

Work	Year	Links	Contribution / Relevance
A Comprehensive Survey of Agents for Computer Use	2025	paper	Broad map of computer-use-agent domains, agent loops, and evaluation bottlenecks.
GUI Agents: A Survey	2024	paper	Practical survey of GUI-agent architectures, datasets, benchmarks, and failure modes.
A Survey on (M)LLM-Based GUI Agents	2025	paper	Focused entry point for planning, grounding, memory, and GUI-agent evaluation.
Towards Trustworthy GUI Agents	2025	paper	Reliability and safety framing for deployment-facing GUI agents.
Large Multimodal Agents: A Survey	2024	paper	Contextual background on LLM-driven multimodal agent components.
A Survey on Vision-Language-Action Models for Embodied AI	2024	paper	Early VLA survey covering embodied perception, planning, and action.
Vision-Language-Action in Robotics	2026	paper	Data-centric survey of VLA datasets, benchmarks, and data engines.
Vision-Language-Action Safety	2026	paper	Focused taxonomy of threats, evaluations, and defenses for VLA systems.
Safety in Embodied AI	2026	paper	Wider safety survey across perception, planning, action, and interaction.
Visual Generation in the New Era	2026	paper	Conceptual lens for when visual generation becomes agentic world modeling.
Securing Computer-Use Agents	2026	paper	Deployment-grounded view of CUA reliability across architecture, lifecycle, permissions, and oversight.

Work	Year	Links	Contribution / Relevance
CogAgent	2023	paper, code	Early high-resolution VLM built explicitly for GUI understanding and navigation.
Set-of-Mark Prompting	2023	paper, code	Simple visual marking strategy that became a practical grounding primitive for LMM agents.
SeeClick	2024	paper, code	Shows that GUI grounding is a core bottleneck for visual GUI agents.
ScreenAI	2024	paper	Strong foundation for screen, document, infographic, and layout-heavy visual understanding.
Ferret-UI	2024	paper, code	Region-aware mobile UI understanding with explicit grounding.
OmniParser	2024	paper, code	Practical screenshot-to-interactable-region parser for pure-vision GUI agents.
UGround	2024	paper, code	Strong pure-vision grounding baseline without relying on accessibility trees.
OS-ATLAS	2024	paper	Foundation action model for generalist GUI agents.
ShowUI	2024	paper, code	Unifies screenshot-conditioned GUI perception and action modeling.
Aguvis	2024	paper	Pure-vision GUI agent direction with autonomous interface interaction.
UI-E2I-Synth	2025	paper	Synthetic instruction pipeline for scaling GUI grounding supervision.
ScreenSpot-Pro	2025	paper	Hard high-resolution grounding benchmark for professional computer-use screens.
GUI-G1	2025	paper, code	Careful analysis of RL pitfalls in GUI grounding.
Enhancing Visual Grounding via Self-Evolutionary RL	2025	paper	Data-efficient RL recipe for high-resolution GUI grounding.
GUI-Actor	2025	paper	Coordinate-free grounding with an action head and verifier.
Phi-Ground	2025	paper	Strong empirical report on training compact GUI grounding models.
Test-Time RL for GUI Grounding	2025	paper	Test-time adaptation using region consistency.
Explicit Position-to-Coordinate Mapping	2025	paper	Addresses coordinate generation as a concrete grounding bottleneck.
GUI-Eyes	2026	paper	Learns when and how to call visual tools such as crop and zoom.
UI-Zoomer	2026	paper, code	Uses uncertainty to decide where to zoom for GUI grounding.

Work	Year	Links	Contribution / Relevance
Mind2Web	2023	paper	Foundational benchmark for generalist web agents.
Android in the Wild	2023	paper	Large-scale Android device-control dataset with realistic gestures.
WebArena	2023	paper, code	Realistic web-agent environment with execution-based tasks.
AutoDroid	2023	paper	Early Android task-automation system and benchmark that remains relevant as a mobile-agent baseline.
MM-Navigator	2023	paper	Early GPT-4V smartphone GUI navigation agent with zero-shot screen interaction.
AppAgent	2023	paper	Smartphone agent that learns app operation from autonomous exploration or demonstrations.
SeeAct	2024	paper	Web agent showing why grounding matters for GPT-4V-style agents.
Mobile-Agent	2024	paper, code	Vision-centric mobile device agent using visual perception tools and stepwise planning.
VisualWebArena	2024	paper, code	Adds visually grounded tasks to realistic web-agent evaluation.
WebVoyager	2024	paper	End-to-end multimodal web agent evaluated on live websites.
WebLINX	2024	paper, project	Large benchmark of multi-turn conversational web navigation with screenshots and action history.
OmniACT	2024	paper	Desktop and web benchmark where agents generate executable automation scripts.
WorkArena	2024	paper, code	Enterprise workflow benchmark for knowledge-work agents.
MMInA	2024	paper, code	Multihop multimodal Internet-agent benchmark on evolving real websites.
B-MoCA	2024	paper	Mobile device-control benchmark across diverse configurations.
OSWorld	2024	paper, code	Flagship benchmark for open-ended tasks in real desktop environments.
AndroidWorld	2024	paper, code	Dynamic Android benchmark with broad task diversity.
Mobile-Agent-v2	2024	paper, code	Multi-agent mobile operation assistant with planning, decision, and reflection roles.
MobileAgentBench	2024	paper	Practical benchmark for mobile LLM agents.
WebCanvas	2024	paper	Online web-agent benchmark and framework built around Mind2Web-Live.
Agent S	2024	paper, code	Open agentic framework for using computers through GUI actions.
Windows Agent Arena	2024	paper, code	Scalable evaluation environment for Windows OS agents.
SPA-Bench	2024	paper	Comprehensive smartphone-agent evaluation benchmark.
AndroidLab	2024	paper	Android training and benchmarking environment with virtual devices and task suites.
VideoWebArena	2024	paper	Long-context video understanding inside web-agent workflows.
MageBench	2024	paper, code	Lightweight visual-agent benchmark covering WebUI, Sokoban, and Football environments.
UI-TARS	2025	paper	Native GUI-agent model trained for perception, grounding, and action.
A3	2025	paper, project	Android Agent Arena for online mobile GUI-agent evaluation across real apps.
Agent S2	2025	paper, code	Generalist-specialist framework for computer-use agents.
UI-Evol	2025	paper	Plug-in knowledge-evolution module that improves OSWorld execution reliability for CUAs.
ZeroGUI	2025	paper	Online GUI-agent learning with task generation and reward estimation.
OpenCUA	2025	paper, code	Open foundation stack for computer-use agents.
ScaleCUA	2025	paper, code	Cross-platform data scaling for open-source computer-use agents.
OmegaUse	2026	paper	General-purpose GUI agent for autonomous task execution.
MemGUI-Bench	2026	paper	Evaluates memory across mobile GUI sessions and changing environments.
SecAgent	2026	paper	Efficient 3B mobile GUI agent with semantic context compression and Chinese mobile data.
OS-Symphony	2026	paper, code	Framework for robust and generalist computer-use agents.
ActionEngine	2026	paper	State-machine memory for more structured GUI automation.
ContractSkill	2026	paper	Treats web-agent skills as repairable contracts that can be verified and reused.
UI-Copilot	2026	paper, code	Long-horizon GUI automation with tool-integrated policy optimization.
ClawGUI	2026	paper	Unified framework for training, evaluating, and deploying GUI agents.
RiskWebWorld	2026	paper	Realistic interactive benchmark for e-commerce risk-management GUI agents.
DynamicGUIBench	2026	paper	Stress-tests agents in dynamic, evolving GUI environments.
UI-Verse	2026	paper	Interface-design perspective on making CUAs more reliable.
SaaS-Bench	2026	paper, code	Long-horizon benchmark over real deployable SaaS systems and professional workflows.

Work	Year	Links	Contribution / Relevance
PerAct	2022	paper, project	Language-conditioned RGB-D manipulation agent that predicts voxel actions directly.
VIMA	2022	paper, project	Multimodal-prompt robot manipulation benchmark and transformer agent.
RT-1	2022	paper, project	Large-scale real-robot action model that anchors later RT/VLA work.
PaLM-E	2023	paper	Embodied multimodal language model connecting visual input to robot tasks.
RT-2	2023	paper	Canonical VLA model transferring web-scale vision-language knowledge to robot control.
Open X-Embodiment / RT-X	2023	paper	Large robot-learning dataset and RT-X model family.
Octo	2024	paper	Open-source generalist robot policy.
OpenVLA	2024	paper, code	Open-source VLA model and a common baseline for robot manipulation.
Pi-Zero	2024	paper	Flow-based VLA model for general robot control.
Magma	2025	paper, code	Bridges multimodal agents across digital and physical actions.
SafeVLA	2025	paper	Safety alignment for VLA models via constrained learning.
Interleave-VLA	2025	paper	Robot manipulation with interleaved image-text instructions.
ChatVLA-2	2025	paper	Open-world embodied reasoning from pretrained knowledge.
VLA^2	2025	paper	Agentic framework for unseen-concept manipulation.
World-Value-Action	2026	paper	Uses implicit planning and future-state value estimation for VLA systems.
VLAs-as-Tools	2026	paper	Splits long-horizon embodied tasks between a high-level VLM planner and specialized VLA tools.
SAGE	2026	paper, code	Agentically generates simulator-ready 3D scenes for embodied policy training.

Work	Year	Links	Contribution / Relevance
VISPROG	2022	paper, project	Foundational visual-programming approach for tool-composed visual reasoning and editing.
Visual ChatGPT	2023	paper, code	Early system connecting ChatGPT with visual foundation models for multi-step visual tasks.
ViperGPT	2023	paper, code	Uses Python execution to compose vision modules for interpretable visual reasoning.
LLaVA-Plus	2023	paper	Trains multimodal agents to select and use visual tools across understanding and generation.
GenArtist	2024	paper, code	MLLM-as-agent for image generation and editing through planning and tool use.
CIGEval	2025	paper	Agentic evaluation framework for conditional image generation.
DeepEyes	2025	paper	Reinforcement learning for active visual reasoning, grounding, and "thinking with images."
ImAgent	2025	paper	Test-time scalable multimodal agent framework for image generation.
GenAgent	2026	paper	Scales text-to-image generation through agentic multimodal reasoning.
Mind-Brush	2026	paper	Adds cognitive search and reasoning loops to image generation.
Agent Banana	2026	paper, code	High-fidelity image editing with planner-executor tooling.
M3	2026	paper	Multi-modal, multi-agent, multi-round reasoning for high-fidelity text-to-image generation.
VisionCreator	2026	paper	Native visual-generation agentic model with understanding, planning, and creation.
Gen-Searcher	2026	paper, project	Reinforces agentic search for image generation.
GEMS	2026	paper, project	Multimodal generation with memory, skills, and iterative agent loops.
Visual Generation in the New Era	2026	paper	Helpful taxonomy for agentic world modeling and generation.

Work	Year	Links	Contribution / Relevance
AGENTSAFE	2025	paper	Safety benchmark for embodied agents under hazardous instructions.
IS-Bench	2025	paper	Interactive safety benchmark for VLM-driven household agents.
VPI-Bench	2025	paper, code	Visual prompt-injection benchmark for computer-use agents.
OpenAgentSafety	2025	paper	Framework for evaluating real-world agent safety across risk categories.
OS-Sentinel	2025	paper	Hybrid validation for safer mobile GUI agents.
UI-CUBE	2025	paper	Enterprise CUA benchmark that measures operational reliability beyond task accuracy.
GUIGuard-Bench	2026	paper	Privacy-preserving GUI-agent evaluation.
CUAAudit	2026	paper	Tests whether VLMs can audit autonomous computer-use agents.
OS-BLIND	2026	paper	Shows how benign-looking user instructions expose CUA vulnerabilities.
HazardArena	2026	paper	Semantic safety evaluation for VLA systems.
RedVLA	2026	paper	Physical red-teaming benchmark for VLA models.
GUI-Perturbed	2026	paper	Domain-randomization study exposing GUI-grounding brittleness.
OS-SPEAR	2026	paper	Toolkit for safety, performance, efficiency, and robustness analysis of OS agents.
Don't Click That	2026	paper	Benchmarks and mitigates deceptive UI elements for VLM-based web agents.
ProjGuard	2026	paper	Safety monitoring for computer-use agents via low-dimensional projections.

Area	Resource	Link	Primary Use
Web	MiniWoB++	code	Compact browser-interaction environments for controlled RL-style experiments.
Web	Mind2Web	paper	Offline web-agent action prediction and grounding.
Web	WebArena	paper, code	Realistic web navigation with execution-based grading.
Web	WebArena-Verified	code	Audited WebArena task set with deterministic offline evaluation.
Web	VisualWebArena	paper, code	Visually grounded web tasks where screenshots matter.
Web	WebLINX	paper, project	Conversational web navigation from expert demonstrations.
Web	WorkArena	paper, code	Enterprise workflow automation in ServiceNow-style environments.
Web	MMInA	paper, code	Multihop multimodal tasks over evolving real websites.
Web	WebCanvas	paper	Online web-agent evaluation with Mind2Web-Live.
Web	RiskWebWorld	paper	Realistic e-commerce risk-management tasks for GUI agents.
Web	SaaS-Bench	paper, code	Long-horizon professional workflows across deployable SaaS systems.
Desktop	OSWorld	paper, code	Open-ended desktop tasks in real operating systems.
Desktop	Windows Agent Arena	paper, code	Windows-specific scaling and reproducible OS-agent evaluation.
Desktop	OmniACT	paper	Evaluating executable automation rather than only low-level clicks.
Mobile	Android in the Wild	paper	Large-scale Android device-control demonstrations with screen observations.
Mobile	B-MoCA	paper	Mobile control across diverse device configurations.
Mobile	AndroidWorld	paper, code	Dynamic Android tasks with broad app coverage.
Mobile	AndroidControl	paper	Diverse Android control dataset for studying scale and generalization.
Mobile	MobileAgentBench	paper	Efficient mobile-agent evaluation across open-source apps.
Mobile	SPA-Bench	paper	Smartphone-agent testing with comprehensive task coverage.
Mobile	AndroidLab	paper	Training and systematic benchmarking on Android virtual devices.
Mobile	A3	paper, project	Real-app online evaluation for mobile GUI agents.
Mobile	SecAgent	paper	Chinese mobile GUI dataset, benchmark, and compact semantic-context agent.
Grounding	ScreenSpot-Pro	paper	High-resolution professional-screen grounding.
Visual-agent reasoning	MageBench	paper, code	Lightweight environments for vision-in-the-chain agent reasoning.
Memory	MemGUI-Bench	paper	Cross-session and cross-temporal mobile GUI memory.
Dynamic GUI	DynamicGUIBench	paper	Robustness under evolving interfaces and dynamic UI changes.
Enterprise reliability	UI-CUBE	paper	Deployment-readiness diagnostics beyond simple task success.
Security	VPI-Bench	paper, code	Visual prompt injection for GUI and computer-use agents.
Safety	AGENTSAFE	paper	Hazardous-instruction safety for embodied agents.
Safety	HazardArena	paper	Semantic safety evaluation for VLA systems.
Embodied	LIBERO	code	Lifelong robot manipulation tasks.
Embodied	RLBench	code	Simulation-based manipulation benchmark.

Resource	Type	Link	Primary Use
OpenAI Skills guide	docs	Docs	Understanding skill-style packaging for reusable agent capabilities.
awesome-agent-skills	collection	GitHub	Finding reusable agent skills across browsing, coding, documents, and visual tasks.
awesome-gpt-image-2	collection	GitHub	Tracking prompt patterns and workflows around modern image generation.
gpt_image_2_skill	skill package	GitHub	Example of packaging image-generation workflows as reusable skills.
ToDiagram skills	skill collection	GitHub	Diagram and visual-communication skills that pair well with visual agents.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
README.md		README.md

Resource	Type	Link	Primary Use
OmniParser	parser	GitHub	Converting screenshots into candidate interactable regions.
ShowUI	GUI model	GitHub	Screenshot-conditioned GUI action modeling and demonstration pipelines.
UGround	grounding model	GitHub	Pure-vision GUI grounding without accessibility trees.
OS-ATLAS	action model	Paper	Cross-platform GUI action grounding.
GUI-G1	grounding model	GitHub	Studying RL recipes and evaluation pitfalls for GUI grounding.
UI-Zoomer	grounding tool	GitHub	Adaptive zoom-in when the target UI element is hard to localize.
Phi-Ground	grounding model	Paper	Compact GUI grounding baseline for resource-constrained settings.

Resource	Type	Link	Primary Use
UI-TARS Desktop	desktop agent	GitHub	Running multimodal desktop agents locally.
Agent S	runtime	GitHub	General computer-use experiments with a practical open framework.
Cua	operator stack	GitHub	Infrastructure for running and evaluating computer-use agents.
OpenAdapt	generative RPA stack	GitHub	Recording GUI demonstrations, training models, and evaluating agents from a unified CLI.
browser-use	browser runtime	GitHub	Browser automation workflows when DOM/tool access is acceptable.
Stagehand	browser runtime	GitHub	Hybrid code-plus-natural-language browser automation for production workflows.
Playwright MCP	browser MCP server	GitHub	Gives agents browser automation tools through the Model Context Protocol.
BrowserGym	browser harness	GitHub	Reproducible browser-agent experiments and benchmark orchestration.
AgentLab	experiment framework	GitHub	Running, comparing, and analyzing web-agent experiments.
OpenAdapt Desktop	desktop capture/runtime	GitHub	Capturing human demonstrations and replaying desktop workflows.
ScreenPipe	local data capture	GitHub	Recording local screen/audio context for personal or research agents.

Resource	Type	Link	Primary Use
OSWorld	desktop environment	GitHub	Standard desktop benchmark and environment.
AndroidWorld	mobile environment	GitHub	Dynamic Android environment for mobile agents.
AndroidControl	mobile dataset	Paper	Large Android control demonstrations for training and data-scaling studies.
Windows Agent Arena	desktop environment	GitHub	Windows-specific OS-agent evaluation.
WebArena	web benchmark	GitHub	Realistic web tasks with execution-based grading.
WebArena-Verified	web benchmark	GitHub	Audited and deterministic WebArena evaluation.
VisualWebArena	visual web benchmark	GitHub	Web tasks where screenshots and visual grounding matter.
WorkArena	enterprise benchmark	GitHub	Enterprise-style workflow automation.
OpenCUA	open CUA stack	GitHub	Data, models, and evaluation foundations for computer-use agents.
ScaleCUA	scaling stack	GitHub	Cross-platform CUA data scaling and evaluation.
CUA-Suite	data suite	Paper	Large human-annotated video demonstrations for CUA research.
ShowUI-Aloha	data pipeline	Paper, code	Turning screen recordings into GUI-agent training trajectories.
Video2GUI	data pipeline	Paper	Synthesizing GUI trajectories from instructional videos.
lmms-eval	eval toolkit	GitHub	Static multimodal evaluation that can complement closed-loop agent tests.

Resource	Type	Link	Primary Use
OpenVLA	VLA model	GitHub	Common open baseline for VLA robot manipulation.
LeRobot	robotics toolkit	GitHub	Robot-learning datasets, policies, training, and deployment tooling.
LIBERO	robotics benchmark	GitHub	Lifelong robot manipulation tasks.
RLBench	robotics benchmark	GitHub	Simulation-based manipulation evaluation.
SAGE	3D scene engine	GitHub	Agentic 3D scene generation for embodied policy training.
Magma	foundation model	GitHub	Bridging digital computer use and physical action.

Workflow	Practical stack
GUI grounding research	ScreenSpot-Pro + OmniParser + UGround + GUI-G1 + UI-Zoomer
Browser-agent experiments	BrowserGym + AgentLab + WebArena + WebArena-Verified + VisualWebArena + WebLINX + MMInA + SaaS-Bench
Desktop computer-use agents	UI-TARS Desktop + Agent S + Cua + OSWorld + Windows Agent Arena
Mobile GUI agents	Android in the Wild + AndroidControl + AndroidWorld + A3 + MobileAgentBench + SPA-Bench + MemGUI-Bench
Demonstration and data pipelines	OpenAdapt + OpenAdapt Desktop + ScreenPipe + ShowUI-Aloha + CUA-Suite + Video2GUI
Agentic visual creation	gpt_image_2_skill + GenArtist + DeepEyes + Agent Banana + VisionCreator + GEMS
Embodied VLA research	OpenVLA + LeRobot + LIBERO + RLBench + SAGE + Magma + VLAs-as-Tools
Reliability and security testing	VPI-Bench + OpenAgentSafety + UI-CUBE + OS-BLIND + HazardArena + OS-SPEAR

Resource	Link	Why read it
OpenAI Computer Use guide	Docs	Developer-facing guide for building with computer-use tooling.
OpenAI Computer-Using Agent	Article	Product and research framing for modern CUAs.
OpenAI Skills guide	Docs	Practical reference for reusable agent skills.
OpenAI MCP and Connectors guide	Docs	Reference for connecting external tools and services to agents.
Anthropic: Developing a computer use model	Article	Strong public engineering writeup on GUI-agent training and evaluation.
Anthropic: Introducing computer use	Article	System framing and deployment context for computer-use models.
Google DeepMind: Gemini Robotics	Article	Industry view on embodied visual agents.
Google DeepMind: Gemini Robotics On-Device	Article	Notes on low-latency, local VLA deployment.

Repository	Link	Notes
Awesome-GUI-Agents	GitHub	Focused companion index for GUI grounding and automation papers.
GUI-Agents-Paper-List	GitHub	Systematic paper index focused on GUI agents.
awesome-ui-agents	GitHub	Neighboring index for UI-agent papers and projects.
Evolving Visual Generation	GitHub	Adjacent map for visual-generation systems.
Awesome Multimodal Modeling	GitHub	Broader multimodal modeling list beyond the stricter agent boundary here.

Folders and files

Latest commit

History

Repository files navigation

Awesome Visual Agent

Contents

Selection Boundary

Research Taxonomy

Curation Rubric

Reading Pathways

Recent Additions

Research Map

Surveys and Landscape

GUI Grounding and Screen Perception

Computer-Use Agents and Environments

Embodied Vision-Language-Action Agents

Agentic Visual Reasoning, Generation, and World Building

Safety, Robustness, and Evaluation

Benchmarks and Environments

Skills, Tools, and Engineering Resources

Skill and Prompt Libraries

Models, Parsers, and Grounding Tools

Agent Runtimes and Operator Stacks

Data Capture, Training, and Evaluation Stacks

Embodied and Robotics Tooling

Workflow Stacks

Official Docs and Engineering Notes

Related Lists

Contributing

Maintenance Policy

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages