A curated list of papers, projects, and resources for multi-modal Graphical User Interface (GUI) agents.
Build a digital assistant on your screen. Generated by DALL-E-3.
WELCOME CONTRIBUTE!
π₯ This project is actively maintained, and we welcome your contributions. If you have any suggestions, such as missing papers or information, please feel free to open an issue or submit a pull request.
π€ Try our Awesome-Paper-Agent. Just provide an arXiv URL link, and it will automatically return formatted information, like this:
User:
https://arxiv.org/abs/2312.13108
GPT:
+ [AssistGUI: Task-Oriented Desktop Graphical User Interface Automation](https://arxiv.org/abs/2312.13108) (Dec. 2023)
[![Star](https://img.shields.io/github/stars/showlab/assistgui.svg?style=social&label=Star)](https://github.com/showlab/assistgui)
[![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2312.13108)
[![Website](https://img.shields.io/badge/Website-9cf)](https://showlab.github.io/assistgui/)
So then you can easily copy and use this information in your pull requests.
β If you find this repository useful, please give it a star.
Quick Navigation: [Datasets / Benchmarks] [Models / Agents] [Surveys] [Projects]
-
World of Bits: An Open-Domain Platform for Web-Based Agents (Aug. 2017, ICML 2017)
-
A Unified Solution for Structured Web Data Extraction (Jul. 2011, SIGIR 2011)
-
Rico: A Mobile App Dataset for Building Data-Driven Design Applications (Oct. 2017)
-
Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration (Feb. 2018, ICLR 2018)
-
Mapping Natural Language Instructions to Mobile UI Action Sequences (May. 2020, ACL 2020)
-
WebSRC: A Dataset for Web-Based Structural Reading Comprehension (Jan. 2021, EMNLP 2021)
-
AndroidEnv: A Reinforcement Learning Platform for Android (May. 2021)
-
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility (Feb. 2022)
-
META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI (May. 2022)
-
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents (Jul. 2022)
-
Language Models can Solve Computer Tasks (Mar. 2023)
-
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction (May. 2023)
-
Mind2Web: Towards a Generalist Agent for the Web (Jun. 2023)
-
Android in the Wild: A Large-Scale Dataset for Android Device Control (Jul. 2023)
-
WebArena: A Realistic Web Environment for Building Autonomous Agents (Jul. 2023)
-
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models (Nov. 2023)
-
AssistGUI: Task-Oriented Desktop Graphical User Interface Automation (Dec. 2023, CVPR 2024)
-
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (Jan. 2024, ACL 2024)
-
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web (Feb. 2024)
-
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue (Feb. 2024)
-
On the Multi-turn Instruction Following for Conversational Web Agents (Feb. 2024)
-
AgentStudio: A Toolkit for Building General Virtual Agents (Mar. 2024)
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Apr. 2024)
-
Benchmarking Mobile Device Control Agents across Diverse Configurations (Apr. 2024, ICLR 2024)
-
MMInA: Benchmarking Multihop Multimodal Internet Agents (Apr. 2024)
-
Autonomous Evaluation and Refinement of Digital Agents (Apr. 2024)
-
LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation (Apr. 2024)
-
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? (Apr. 2024)
-
GUICourse: From General Vision Language Models to Versatile GUI Agents (Jun. 2024)
-
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents (Jun. 2024)
-
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices (Jun. 2024)
-
VideoGUI: A Benchmark for GUI Automation from Instructional Videos (Jun. 2024)
-
Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding (Jun. 2024)
-
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents (Jun. 2024)
-
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents (Jun. 2024)
-
Practical, Automated Scenario-based Mobile App Testing (Jun. 2024)
-
WebCanvas: Benchmarking Web Agents in Online Environments (Jun. 2024)
-
On the Effects of Data Scale on Computer Control Agents (Jun. 2024)
-
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents (Jul. 2024)
-
WebVLN: Vision-and-Language Navigation on Websites (AAAI 2024)
-
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? (Jul. 2024)
-
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
-
Harnessing Webpage UIs for Text-Rich Visual Understanding (Oct, 2024)
-
Grounding Open-Domain Instructions to Automate Web Support Tasks (Mar. 2021)
-
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning (Aug. 2021)
-
A Data-Driven Approach for Learning to Control Computers (Feb. 2022)
-
Augmenting Autotelic Agents with Large Language Models (May. 2023)
-
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control (Jun. 2023, ICLR 2024)
-
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis (Jul. 2023, ICLR 2024)
-
LASER: LLM Agent with State-Space Exploration for Web Navigation (Sep. 2023)
-
CogAgent: A Visual Language Model for GUI Agents (Dec. 2023, CVPR 2024)
-
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
-
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement (Feb. 2024)
-
UFO: A UI-Focused Agent for Windows OS Interaction (Feb. 2024)
-
Comprehensive Cognitive LLM Agent for Smartphone GUI Automation (Feb. 2024)
-
Improving Language Understanding from Screenshots (Feb. 2024)
-
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent (Apr. 2024, KDD 2024)
-
SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models (May. 2023, NeurIPS 2023)
-
You Only Look at Screens: Multimodal Chain-of-Action Agents (Sep. 2023)
-
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API (Oct. 2023)
-
OpenAgents: AN OPEN PLATFORM FOR LANGUAGE AGENTS IN THE WILD (Oct. 2023)
-
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation (Nov. 2023)
-
AppAgent: Multimodal Agents as Smartphone Users (Dec. 2023)
-
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents (Jan. 2024, ACL 2024)
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded (Jan. 2024, ICML 2024)
-
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception (Jan. 2024)
-
Dual-View Visual Contextualization for Web Navigation (Feb. 2024, CVPR 2024)
-
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (Jun. 2024)
-
Visual Grounding for User Interfaces (NAACL 2024)
-
ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (Feb. 2024)
-
ScreenAI: A Vision-Language Model for UI and Infographics Understanding (Feb. 2024)
-
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Apr. 2024)
-
Octopus: On-device language model for function calling of software APIs (Apr., 2024)
-
Octopus v2: On-device language model for super agent (Apr., 2024)
-
Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent (Apr., 2024)
-
Octopus v4: Graph of language models (Apr., 2024)
-
AutoWebGLM: Bootstrap and Reinforce a Large Language Model-based Web Navigating Agent (Apr. 2024)
-
Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning (Apr. 2024)
-
Enhancing Mobile "How-to" Queries with Automated Search Results Verification and Reranking (Apr. 2024, SIGIR 2024)
-
Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation (Dec. 2023, MobiCom 2024)
-
Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study (Mar. 2024)
-
Android in the Zoo: Chain-of-Action-Thought for GUI Agents (Mar. 2024)
-
GUI Action Narrator: Where and When Did That Action Take Place? (Jun. 2024)
-
Identifying User Goals from UI Trajectories (Jun. 2024)
-
VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning (Jun. 2024)
-
Octo-planner: On-device Language Model for Planner-Action Agents (Jun. 2024)
-
E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion (Jun. 2024)
-
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration (Jun. 2024)
-
MobileFlow: A Multimodal LLM For Mobile GUI Agent (Jul. 2024)
-
Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model (Jul. 2024)
-
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence (Jul. 2024)
-
MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices (Jul. 2024)
-
AUITestAgent: Automatic Requirements Oriented GUI Function Testing (Jul. 2024)
-
Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems (Jul. 2024)
-
OmniParser for Pure Vision Based GUI Agent (Aug. 2024)
-
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents (Aug. 2024)
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (Aug. 2024)
-
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher (Jul. 2023)
-
AppAgent v2: Advanced Agent for Flexible Mobile Interactions (Aug. 2024)
-
Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions (Aug. 2024)
-
Agent Workflow Memory (Sep. 2024)
-
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understandin (Sep. 2024)
-
Agent S: An Open Agentic Framework that Uses Computers Like a Human (Oct. 2024)
-
MobA: A Two-Level Agent System for Efficient Mobile Task Automation (Oct. 2024)
-
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents (Oct. 2024)
-
OS-ATLAS: A Foundation Action Model For Generalist GUI Agents (Oct. 2024)
-
Attacking Vision-Language Computer Agents via Pop-ups (Nov. 2024)
-
AutoGLM: Autonomous Foundation Agents for GUIs (Nov. 2024)
-
AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations (Nov. 2024)
-
ShowUI: One Vision-Language-Action Model for Generalist GUI Agent (Nov. 2024)
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction (Dec. 2024)
-
Falcon-UI: Understanding GUI Before Following User Instructions (Dec. 2024)
-
GUI Agents with Foundation Models: A Comprehensive Survey (Nov. 2024)
-
Large Language Model-Brained GUI Agents: A Survey (Nov. 2024)
-
GUI Agents: A Survey (Dec. 2024)
-
GPT-4V-Act: AI agent using GPT-4V(ision) for web UI interaction
-
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
-
LaVague: Large Action Model Framework to Develop AI Web Agents
-
OpenAdapt: AI-First Process Automation with Large Multimodal Models
-
Surfkit: A toolkit for building and sharing AI agents that operate on devices
-
WebMarker: Mark web pages for use with vision-language models
-
AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents
-
MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control
-
EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage
-
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents
-
Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions
-
Security Matrix for Multimodal Agents on Mobile Devices: A Systematic and Proof of Concept Study
- awesome-llm-powered-agent
- Awesome-LLM-based-Web-Agent-and-Tools
- awesome-ui-agents
- computer-control-agent-knowledge-base
- Awesome GUI Agent Paper List
This template is provided by Awesome-Video-Diffusion and Awesome-MLLM-Hallucination.