diff --git a/README.md b/README.md index 3452ccf..d1a5e7d 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ A variety of papers related to GUI Agents, including but not limited to: ## All Papers (from most recent to oldest)
-# Papers +Papers - [The BrowserGym Ecosystem for Web Agent Research](https://arxiv.org/abs/2412.05467) - Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, LΓ©o Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han LΓΉ, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste @@ -119,15 +119,6 @@ A variety of papers related to GUI Agents, including but not limited to: - πŸ”‘ Key: [framework], [reinforcement learning], [RL], [self-evolving curriculum], [WebRL], [outcome-supervised reward model] - πŸ“– TLDR: This paper introduces *WebRL*, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open large language models (LLMs). WebRL addresses challenges such as the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. It incorporates a self-evolving curriculum that generates new tasks from unsuccessful attempts, a robust outcome-supervised reward model (ORM), and adaptive reinforcement learning strategies to ensure consistent improvements. Applied to Llama-3.1 and GLM-4 models, WebRL significantly enhances their performance on web-based tasks, surpassing existing state-of-the-art web agents. -- [AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents](https://arxiv.org/abs/2410.24024) - - Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong - - πŸ›οΈ Institutions: Tsinghua University, Peking University - - πŸ“… Date: October 31, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [framework], [dataset], [benchmark], [AndroidLab] - - πŸ“– TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates. - - [From Context to Action: Analysis of the Impact of State Representation and Context on the Generalization of Multi-Turn Web Navigation Agents](https://arxiv.org/abs/2409.13701) - Nalin Tiwary, Vardhan Dongre, Sanil Arun Chawla, Ashwin Lamani, Dilek Hakkani-TΓΌr - πŸ›οΈ Institutions: UIUC @@ -137,6 +128,15 @@ A variety of papers related to GUI Agents, including but not limited to: - πŸ”‘ Key: [framework], [context management], [generalization], [multi-turn navigation], [CWA] - πŸ“– TLDR: This study examines how different contextual elements affect the performance and generalization of Conversational Web Agents (CWAs) in multi-turn web navigation tasks. By optimizing context managementβ€”specifically interaction history and web page representationβ€”the research demonstrates enhanced agent performance across various out-of-distribution scenarios, including unseen websites, categories, and geographic locations. +- [AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents](https://arxiv.org/abs/2410.24024) + - Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong + - πŸ›οΈ Institutions: Tsinghua University, Peking University + - πŸ“… Date: October 31, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [framework], [dataset], [benchmark], [AndroidLab] + - πŸ“– TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates. + - [Evaluating Cultural and Social Awareness of LLM Web Agents](https://arxiv.org/abs/2410.23252) - Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu - πŸ›οΈ Institutions: UCLA, Salesforce AI Research @@ -191,15 +191,6 @@ A variety of papers related to GUI Agents, including but not limited to: - πŸ”‘ Key: [dataset], [framework], [synthetic data] - πŸ“– TLDR: The *EDGE* framework proposes an innovative approach to improve GUI understanding and interaction capabilities in vision-language models through large-scale, multi-granularity synthetic data generation. By leveraging webpage data, EDGE minimizes the need for manual annotations and enhances the adaptability of models across desktop and mobile GUI environments. Evaluations show its effectiveness in diverse GUI-related tasks, contributing significantly to autonomous agent development in GUI navigation and interaction. -- [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100) - - Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida - - πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft - - πŸ“… Date: October 24, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA] - - πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements. - - [AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant](https://arxiv.org/abs/2410.18603) - Chengyou Jia, Minnan Luo, Zhuohang Dang, Qiushi Sun, Fangzhi Xu, Junlin Hu, Tianbao Xie, Zhiyong Wu - πŸ›οΈ Institutions: XJTU, Shanghai AI Lab, HKU @@ -218,14 +209,14 @@ A variety of papers related to GUI Agents, including but not limited to: - πŸ”‘ Key: [API-based agent], [hybrid agent], [benchmark], [WebArena], [SOTA performance] - πŸ“– TLDR: This paper introduces API-based and hybrid agents designed to execute online tasks by accessing both APIs and traditional web browsing interfaces. In evaluations using WebArena, a benchmark for web navigation, the API-based agent achieves higher performance than browser-based agents, and the hybrid model achieves a success rate of 35.8%, setting a new state-of-the-art (SOTA) in task-agnostic web navigation. The findings highlight the efficiency and reliability gains of API interactions for web agents. -- [MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control](https://arxiv.org/abs/2410.17520) - - Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee - - πŸ›οΈ Institutions: KAIST, UT at Austin - - πŸ“… Date: October 23, 2024 +- [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100) + - Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida + - πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft + - πŸ“… Date: October 24, 2024 - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [benchmark], [safety], [evaluation], [Android emulator] - - πŸ“– TLDR: *MobileSafetyBench* introduces a benchmark for evaluating the safety of large language model (LLM)-based autonomous agents in mobile device control. Using Android emulators, the benchmark simulates real-world tasks in apps such as messaging and banking to assess agents' safety and helpfulness. The safety-focused tasks test for privacy risk management and robustness against adversarial prompt injections. Experiments show agents perform well in helpful tasks but struggle with safety-related challenges, underscoring the need for continued advancements in mobile safety mechanisms for autonomous agents. + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA] + - πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements. - [Lightweight Neural App Control](https://arxiv.org/abs/2410.17883) - Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao @@ -236,6 +227,15 @@ A variety of papers related to GUI Agents, including but not limited to: - πŸ”‘ Key: [framework], [vision-language model], [Action Transformer], [app agent], [Android control], [multi-modal] - πŸ“– TLDR: This paper introduces LiMAC, a mobile control framework for Android that integrates an Action Transformer and fine-tuned vision-language models to execute precise actions in mobile apps. Tested on open-source datasets, LiMAC improves action accuracy by up to 42% over traditional prompt engineering baselines, demonstrating enhanced efficiency and accuracy in mobile app control tasks. +- [MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control](https://arxiv.org/abs/2410.17520) + - Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee + - πŸ›οΈ Institutions: KAIST, UT at Austin + - πŸ“… Date: October 23, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [benchmark], [safety], [evaluation], [Android emulator] + - πŸ“– TLDR: *MobileSafetyBench* introduces a benchmark for evaluating the safety of large language model (LLM)-based autonomous agents in mobile device control. Using Android emulators, the benchmark simulates real-world tasks in apps such as messaging and banking to assess agents' safety and helpfulness. The safety-focused tasks test for privacy risk management and robustness against adversarial prompt injections. Experiments show agents perform well in helpful tasks but struggle with safety-related challenges, underscoring the need for continued advancements in mobile safety mechanisms for autonomous agents. + - [Large Language Models Empowered Personalized Web Agents](https://ar5iv.org/abs/2410.17236) - Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, Tat-Seng Chua - πŸ›οΈ Institutions: HK PolyU, NTU Singapore @@ -290,6 +290,15 @@ A variety of papers related to GUI Agents, including but not limited to: - πŸ”‘ Key: [framework], [autonomous GUI interaction], [experience-augmented hierarchical planning] - πŸ“– TLDR: This paper introduces Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI). The system addresses key challenges in automating computer tasks through experience-augmented hierarchical planning and an Agent-Computer Interface (ACI). Agent S demonstrates significant improvements over baselines on the OSWorld benchmark, achieving a 20.58% success rate (83.6% relative improvement). The framework shows generalizability across different operating systems and provides insights for developing more effective GUI agents. +- [ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents](https://sites.google.com/view/st-webagentbench/home) + - Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov + - πŸ›οΈ Institutions: IBM Research + - πŸ“… Date: October 9, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [safety], [trustworthiness], [ST-WebAgentBench] + - πŸ“– TLDR: This paper introduces **ST-WebAgentBench**, a benchmark designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. It defines safe and trustworthy agent behavior, outlines the structure of safety policies, and introduces the "Completion under Policies" metric to assess agent performance. The study reveals that current state-of-the-art agents struggle with policy adherence, highlighting the need for improved policy awareness and compliance in web agents. + - [TinyClick: Single-Turn Agent for Empowering GUI Automation](https://arxiv.org/abs/2410.11871) - Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Marcin Skorupa, Adam Wiacek, Sebastien Postansque, Jakub Hoscilowicz - πŸ›οΈ Institutions: Samsung R&D Poland, Warsaw University of Technology @@ -308,15 +317,6 @@ A variety of papers related to GUI Agents, including but not limited to: - πŸ”‘ Key: [framework], [model], [SeeClick], [AITW benchmark] - πŸ“– TLDR: The paper introduces *ClickAgent*, a framework that enhances autonomous agents' interaction with mobile UIs by improving their ability to locate interface elements accurately. This is achieved through a dual-component system where an MLLM performs reasoning and action planning, while a dedicated UI location model (e.g., SeeClick) handles element identification. ClickAgent, evaluated on the AITW benchmark and tested on both emulators and real Android devices, surpasses other agents like CogAgent and AppAgent in task success rate, advancing automation reliability on mobile platforms. -- [ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents](https://sites.google.com/view/st-webagentbench/home) - - Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov - - πŸ›οΈ Institutions: IBM Research - - πŸ“… Date: October 9, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [safety], [trustworthiness], [ST-WebAgentBench] - - πŸ“– TLDR: This paper introduces **ST-WebAgentBench**, a benchmark designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. It defines safe and trustworthy agent behavior, outlines the structure of safety policies, and introduces the "Completion under Policies" metric to assess agent performance. The study reveals that current state-of-the-art agents struggle with policy adherence, highlighting the need for improved policy awareness and compliance in web agents. - - [Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents](https://osu-nlp-group.github.io/UGround/) - Boyu Gou, Ruochen Wang, Boyuan Zheng, Yucheng Xie, Cheng Chang, Yiheng Shu, Haotian Sun, Yu Su - πŸ›οΈ Institutions: OSU, Orby AI @@ -470,6 +470,24 @@ A variety of papers related to GUI Agents, including but not limited to: - πŸ”‘ Key: [framework], [AppAgent v2] - πŸ“– TLDR: This work presents *AppAgent v2*, a novel LLM-based multimodal agent framework for mobile devices capable of navigating applications by emulating human-like interactions such as tapping and swiping. The agent constructs a flexible action space that enhances adaptability across various applications, including parsing text and vision descriptions. It operates through two main phases: exploration and deployment, utilizing retrieval-augmented generation (RAG) technology to efficiently retrieve and update information from a knowledge base, thereby empowering the agent to perform tasks effectively and accurately. +- [OmniParser for Pure Vision Based GUI Agent](https://microsoft.github.io/OmniParser/) + - Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah + - πŸ›οΈ Institutions: Microsoft Research, Microsoft Gen AI + - πŸ“… Date: August 1, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [framework], [dataset], [OmniParser] + - πŸ“– TLDR: This paper introduces **OmniParser**, a method for parsing user interface screenshots into structured elements, enhancing the ability of models like GPT-4V to generate actions accurately grounded in corresponding UI regions. The authors curated datasets for interactable icon detection and icon description, fine-tuning models to parse interactable regions and extract functional semantics of UI elements. + +- [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://aclanthology.org/2024.findings-acl.539) + - Xinbei Ma, Zhuosheng Zhang, Hai Zhao + - πŸ›οΈ Institutions: SJTU + - πŸ“… Date: August 2024 + - πŸ“‘ Publisher: ACL 2024 + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [model], [framework], [benchmark] + - πŸ“– TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios​:contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}​:contentReference[oaicite:2]{index=2}​:contentReference[oaicite:3]{index=3}. + - [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https://arxiv.org/abs/2408.07199) - [Author information not available] - πŸ›οΈ Institutions: MultiOn, Stanford @@ -488,24 +506,6 @@ A variety of papers related to GUI Agents, including but not limited to: - πŸ”‘ Key: [multimodal agents], [environmental distractions], [robustness] - πŸ“– TLDR: This paper highlights the vulnerability of multimodal agents to environmental distractions. The researchers demonstrate that these agents, which process multiple types of input (e.g., text, images, audio), can be significantly impacted by irrelevant or misleading environmental cues. The study provides insights into the limitations of current multimodal systems and emphasizes the need for more robust architectures that can filter out distractions and maintain focus on relevant information in complex, real-world environments. -- [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://aclanthology.org/2024.findings-acl.539) - - Xinbei Ma, Zhuosheng Zhang, Hai Zhao - - πŸ›οΈ Institutions: SJTU - - πŸ“… Date: August 2024 - - πŸ“‘ Publisher: ACL 2024 - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [model], [framework], [benchmark] - - πŸ“– TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios​:contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}​:contentReference[oaicite:2]{index=2}​:contentReference[oaicite:3]{index=3}. - -- [OmniParser for Pure Vision Based GUI Agent](https://microsoft.github.io/OmniParser/) - - Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah - - πŸ›οΈ Institutions: Microsoft Research, Microsoft Gen AI - - πŸ“… Date: August 1, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [GUI] - - πŸ”‘ Key: [framework], [dataset], [OmniParser] - - πŸ“– TLDR: This paper introduces **OmniParser**, a method for parsing user interface screenshots into structured elements, enhancing the ability of models like GPT-4V to generate actions accurately grounded in corresponding UI regions. The authors curated datasets for interactable icon detection and icon description, fine-tuning models to parse interactable regions and extract functional semantics of UI elements. - - [OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation](https://arxiv.org/abs/2407.19056) - Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, Jingbo Shang - πŸ›οΈ Institutions: UCSD, UCLA, AI2 @@ -569,6 +569,15 @@ A variety of papers related to GUI Agents, including but not limited to: - πŸ”‘ Key: [framework], [tool formulation], [multi-agent collaboration], [MobileExperts] - πŸ“– TLDR: This paper introduces *MobileExperts*, a framework that enhances autonomous operations on mobile devices by dynamically assembling agent teams based on user requirements. Each agent independently explores and formulates tools to evolve into an expert, improving efficiency and reducing reasoning costs. +- [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents](https://yuxiangchai.github.io/AMEX/) + - Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng Li + - πŸ›οΈ Institutions: CUHK, SJTU, Shanghai AI Lab, vivo AI Lab + - πŸ“… Date: July 3, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [dataset], [benchmark], [AMEX] + - πŸ“– TLDR: This paper introduces the **Android Multi-annotation EXpo (AMEX)**, a comprehensive dataset designed for training and evaluating mobile GUI-control agents. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, annotated at multiple levels, including GUI interactive element grounding, functionality descriptions, and complex natural language instructions. The dataset aims to advance research on AI agents capable of completing complex tasks by interacting directly with mobile device GUIs. + - [CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents](https://arxiv.org/abs/2407.01511) - Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li - πŸ›οΈ Institutions: KAUST, UTokyo, CMU, Stanford, Harvard, Tsinghua, SUSTech, Oxford @@ -587,15 +596,6 @@ A variety of papers related to GUI Agents, including but not limited to: - πŸ”‘ Key: [framework], [VisionDroid] - πŸ“– TLDR: The paper presents **VisionDroid**, a vision-driven automated GUI testing approach utilizing Multimodal Large Language Models (MLLM) to detect non-crash functional bugs in mobile applications. By extracting GUI text information and aligning it with screenshots, VisionDroid enables MLLM to understand GUI context, facilitating deeper and function-oriented exploration. The approach segments exploration history into logically cohesive parts, prompting MLLM for bug detection, demonstrating superior performance over existing methods. -- [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents](https://yuxiangchai.github.io/AMEX/) - - Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng Li - - πŸ›οΈ Institutions: CUHK, SJTU, Shanghai AI Lab, vivo AI Lab - - πŸ“… Date: July 3, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [dataset], [benchmark], [AMEX] - - πŸ“– TLDR: This paper introduces the **Android Multi-annotation EXpo (AMEX)**, a comprehensive dataset designed for training and evaluating mobile GUI-control agents. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, annotated at multiple levels, including GUI interactive element grounding, functionality descriptions, and complex natural language instructions. The dataset aims to advance research on AI agents capable of completing complex tasks by interacting directly with mobile device GUIs. - - [Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding](https://screen-point-and-read.github.io/) - Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang - πŸ›οΈ Institutions: UCSC, Microsoft Research @@ -740,15 +740,6 @@ A variety of papers related to GUI Agents, including but not limited to: - πŸ”‘ Key: [framework], [multi-agent], [planning], [decision-making], [reflection] - πŸ“– TLDR: The paper presents **Mobile-Agent-v2**, a multi-agent architecture designed to assist with mobile device operations. It comprises three agents: a planning agent that generates task progress, a decision agent that navigates tasks using a memory unit, and a reflection agent that corrects erroneous operations. This collaborative approach addresses challenges in navigation and long-context input scenarios, achieving over a 30% improvement in task completion compared to single-agent architectures. -- [Visual Grounding for User Interfaces](https://aclanthology.org/2024.naacl-industry.9/) - - Yijun Qian, Yujie Lu, Alexander Hauptmann, Oriana Riva - - πŸ›οΈ Institutions: CMU, UCSB - - πŸ“… Date: June 2024 - - πŸ“‘ Publisher: NAACL 2024 - - πŸ’» Env: [GUI] - - πŸ”‘ Key: [framework], [visual grounding], [UI element localization], [LVG] - - πŸ“– TLDR: This work introduces the task of visual UI grounding, which unifies detection and grounding by enabling models to identify UI elements referenced by natural language commands solely from visual input. The authors propose **LVG**, a model that outperforms baselines pre-trained on larger datasets by over 4.9 points in top-1 accuracy, demonstrating its effectiveness in localizing referenced UI elements without relying on UI metadata. - - [WebSuite: Systematically Evaluating Why Web Agents Fail](https://arxiv.org/abs/2406.01623) - Eric Li, Jim Waldo - πŸ›οΈ Institutions: Harvard @@ -767,6 +758,15 @@ A variety of papers related to GUI Agents, including but not limited to: - πŸ”‘ Key: [benchmark], [instructional videos], [visual planning], [hierarchical task decomposition], [complex software interaction] - πŸ“– TLDR: VideoGUI presents a benchmark for evaluating GUI automation on tasks derived from instructional videos, focusing on visually intensive applications like Adobe Photoshop and video editing software. The benchmark includes 178 tasks, with a hierarchical evaluation method distinguishing high-level planning, mid-level procedural steps, and precise action execution. VideoGUI reveals current model limitations in complex visual tasks, marking a significant step toward improved visual planning in GUI automation. +- [Visual Grounding for User Interfaces](https://aclanthology.org/2024.naacl-industry.9/) + - Yijun Qian, Yujie Lu, Alexander Hauptmann, Oriana Riva + - πŸ›οΈ Institutions: CMU, UCSB + - πŸ“… Date: June 2024 + - πŸ“‘ Publisher: NAACL 2024 + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [framework], [visual grounding], [UI element localization], [LVG] + - πŸ“– TLDR: This work introduces the task of visual UI grounding, which unifies detection and grounding by enabling models to identify UI elements referenced by natural language commands solely from visual input. The authors propose **LVG**, a model that outperforms baselines pre-trained on larger datasets by over 4.9 points in top-1 accuracy, demonstrating its effectiveness in localizing referenced UI elements without relying on UI metadata. + - [Large Language Models Can Self-Improve At Web Agent Tasks](https://arxiv.org/abs/2405.20309) - Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter - πŸ›οΈ Institutions: University of Pennsylvania, ExtensityAI, Johannes Kepler University Linz, NXAI diff --git a/update_template_or_data/update_paper_list.md b/update_template_or_data/update_paper_list.md index 9941215..af6983c 100644 --- a/update_template_or_data/update_paper_list.md +++ b/update_template_or_data/update_paper_list.md @@ -97,15 +97,6 @@ - πŸ”‘ Key: [framework], [reinforcement learning], [RL], [self-evolving curriculum], [WebRL], [outcome-supervised reward model] - πŸ“– TLDR: This paper introduces *WebRL*, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open large language models (LLMs). WebRL addresses challenges such as the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. It incorporates a self-evolving curriculum that generates new tasks from unsuccessful attempts, a robust outcome-supervised reward model (ORM), and adaptive reinforcement learning strategies to ensure consistent improvements. Applied to Llama-3.1 and GLM-4 models, WebRL significantly enhances their performance on web-based tasks, surpassing existing state-of-the-art web agents. -- [AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents](https://arxiv.org/abs/2410.24024) - - Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong - - πŸ›οΈ Institutions: Tsinghua University, Peking University - - πŸ“… Date: October 31, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [framework], [dataset], [benchmark], [AndroidLab] - - πŸ“– TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates. - - [From Context to Action: Analysis of the Impact of State Representation and Context on the Generalization of Multi-Turn Web Navigation Agents](https://arxiv.org/abs/2409.13701) - Nalin Tiwary, Vardhan Dongre, Sanil Arun Chawla, Ashwin Lamani, Dilek Hakkani-TΓΌr - πŸ›οΈ Institutions: UIUC @@ -115,6 +106,15 @@ - πŸ”‘ Key: [framework], [context management], [generalization], [multi-turn navigation], [CWA] - πŸ“– TLDR: This study examines how different contextual elements affect the performance and generalization of Conversational Web Agents (CWAs) in multi-turn web navigation tasks. By optimizing context managementβ€”specifically interaction history and web page representationβ€”the research demonstrates enhanced agent performance across various out-of-distribution scenarios, including unseen websites, categories, and geographic locations. +- [AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents](https://arxiv.org/abs/2410.24024) + - Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong + - πŸ›οΈ Institutions: Tsinghua University, Peking University + - πŸ“… Date: October 31, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [framework], [dataset], [benchmark], [AndroidLab] + - πŸ“– TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates. + - [Evaluating Cultural and Social Awareness of LLM Web Agents](https://arxiv.org/abs/2410.23252) - Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu - πŸ›οΈ Institutions: UCLA, Salesforce AI Research @@ -169,15 +169,6 @@ - πŸ”‘ Key: [dataset], [framework], [synthetic data] - πŸ“– TLDR: The *EDGE* framework proposes an innovative approach to improve GUI understanding and interaction capabilities in vision-language models through large-scale, multi-granularity synthetic data generation. By leveraging webpage data, EDGE minimizes the need for manual annotations and enhances the adaptability of models across desktop and mobile GUI environments. Evaluations show its effectiveness in diverse GUI-related tasks, contributing significantly to autonomous agent development in GUI navigation and interaction. -- [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100) - - Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida - - πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft - - πŸ“… Date: October 24, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA] - - πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements. - - [AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant](https://arxiv.org/abs/2410.18603) - Chengyou Jia, Minnan Luo, Zhuohang Dang, Qiushi Sun, Fangzhi Xu, Junlin Hu, Tianbao Xie, Zhiyong Wu - πŸ›οΈ Institutions: XJTU, Shanghai AI Lab, HKU @@ -196,14 +187,14 @@ - πŸ”‘ Key: [API-based agent], [hybrid agent], [benchmark], [WebArena], [SOTA performance] - πŸ“– TLDR: This paper introduces API-based and hybrid agents designed to execute online tasks by accessing both APIs and traditional web browsing interfaces. In evaluations using WebArena, a benchmark for web navigation, the API-based agent achieves higher performance than browser-based agents, and the hybrid model achieves a success rate of 35.8%, setting a new state-of-the-art (SOTA) in task-agnostic web navigation. The findings highlight the efficiency and reliability gains of API interactions for web agents. -- [MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control](https://arxiv.org/abs/2410.17520) - - Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee - - πŸ›οΈ Institutions: KAIST, UT at Austin - - πŸ“… Date: October 23, 2024 +- [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100) + - Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida + - πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft + - πŸ“… Date: October 24, 2024 - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [benchmark], [safety], [evaluation], [Android emulator] - - πŸ“– TLDR: *MobileSafetyBench* introduces a benchmark for evaluating the safety of large language model (LLM)-based autonomous agents in mobile device control. Using Android emulators, the benchmark simulates real-world tasks in apps such as messaging and banking to assess agents' safety and helpfulness. The safety-focused tasks test for privacy risk management and robustness against adversarial prompt injections. Experiments show agents perform well in helpful tasks but struggle with safety-related challenges, underscoring the need for continued advancements in mobile safety mechanisms for autonomous agents. + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA] + - πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements. - [Lightweight Neural App Control](https://arxiv.org/abs/2410.17883) - Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao @@ -214,6 +205,15 @@ - πŸ”‘ Key: [framework], [vision-language model], [Action Transformer], [app agent], [Android control], [multi-modal] - πŸ“– TLDR: This paper introduces LiMAC, a mobile control framework for Android that integrates an Action Transformer and fine-tuned vision-language models to execute precise actions in mobile apps. Tested on open-source datasets, LiMAC improves action accuracy by up to 42% over traditional prompt engineering baselines, demonstrating enhanced efficiency and accuracy in mobile app control tasks. +- [MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control](https://arxiv.org/abs/2410.17520) + - Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee + - πŸ›οΈ Institutions: KAIST, UT at Austin + - πŸ“… Date: October 23, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [benchmark], [safety], [evaluation], [Android emulator] + - πŸ“– TLDR: *MobileSafetyBench* introduces a benchmark for evaluating the safety of large language model (LLM)-based autonomous agents in mobile device control. Using Android emulators, the benchmark simulates real-world tasks in apps such as messaging and banking to assess agents' safety and helpfulness. The safety-focused tasks test for privacy risk management and robustness against adversarial prompt injections. Experiments show agents perform well in helpful tasks but struggle with safety-related challenges, underscoring the need for continued advancements in mobile safety mechanisms for autonomous agents. + - [Large Language Models Empowered Personalized Web Agents](https://ar5iv.org/abs/2410.17236) - Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, Tat-Seng Chua - πŸ›οΈ Institutions: HK PolyU, NTU Singapore @@ -268,6 +268,15 @@ - πŸ”‘ Key: [framework], [autonomous GUI interaction], [experience-augmented hierarchical planning] - πŸ“– TLDR: This paper introduces Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI). The system addresses key challenges in automating computer tasks through experience-augmented hierarchical planning and an Agent-Computer Interface (ACI). Agent S demonstrates significant improvements over baselines on the OSWorld benchmark, achieving a 20.58% success rate (83.6% relative improvement). The framework shows generalizability across different operating systems and provides insights for developing more effective GUI agents. +- [ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents](https://sites.google.com/view/st-webagentbench/home) + - Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov + - πŸ›οΈ Institutions: IBM Research + - πŸ“… Date: October 9, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [safety], [trustworthiness], [ST-WebAgentBench] + - πŸ“– TLDR: This paper introduces **ST-WebAgentBench**, a benchmark designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. It defines safe and trustworthy agent behavior, outlines the structure of safety policies, and introduces the "Completion under Policies" metric to assess agent performance. The study reveals that current state-of-the-art agents struggle with policy adherence, highlighting the need for improved policy awareness and compliance in web agents. + - [TinyClick: Single-Turn Agent for Empowering GUI Automation](https://arxiv.org/abs/2410.11871) - Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Marcin Skorupa, Adam Wiacek, Sebastien Postansque, Jakub Hoscilowicz - πŸ›οΈ Institutions: Samsung R&D Poland, Warsaw University of Technology @@ -286,15 +295,6 @@ - πŸ”‘ Key: [framework], [model], [SeeClick], [AITW benchmark] - πŸ“– TLDR: The paper introduces *ClickAgent*, a framework that enhances autonomous agents' interaction with mobile UIs by improving their ability to locate interface elements accurately. This is achieved through a dual-component system where an MLLM performs reasoning and action planning, while a dedicated UI location model (e.g., SeeClick) handles element identification. ClickAgent, evaluated on the AITW benchmark and tested on both emulators and real Android devices, surpasses other agents like CogAgent and AppAgent in task success rate, advancing automation reliability on mobile platforms. -- [ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents](https://sites.google.com/view/st-webagentbench/home) - - Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov - - πŸ›οΈ Institutions: IBM Research - - πŸ“… Date: October 9, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [safety], [trustworthiness], [ST-WebAgentBench] - - πŸ“– TLDR: This paper introduces **ST-WebAgentBench**, a benchmark designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. It defines safe and trustworthy agent behavior, outlines the structure of safety policies, and introduces the "Completion under Policies" metric to assess agent performance. The study reveals that current state-of-the-art agents struggle with policy adherence, highlighting the need for improved policy awareness and compliance in web agents. - - [Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents](https://osu-nlp-group.github.io/UGround/) - Boyu Gou, Ruochen Wang, Boyuan Zheng, Yucheng Xie, Cheng Chang, Yiheng Shu, Haotian Sun, Yu Su - πŸ›οΈ Institutions: OSU, Orby AI @@ -448,6 +448,24 @@ - πŸ”‘ Key: [framework], [AppAgent v2] - πŸ“– TLDR: This work presents *AppAgent v2*, a novel LLM-based multimodal agent framework for mobile devices capable of navigating applications by emulating human-like interactions such as tapping and swiping. The agent constructs a flexible action space that enhances adaptability across various applications, including parsing text and vision descriptions. It operates through two main phases: exploration and deployment, utilizing retrieval-augmented generation (RAG) technology to efficiently retrieve and update information from a knowledge base, thereby empowering the agent to perform tasks effectively and accurately. +- [OmniParser for Pure Vision Based GUI Agent](https://microsoft.github.io/OmniParser/) + - Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah + - πŸ›οΈ Institutions: Microsoft Research, Microsoft Gen AI + - πŸ“… Date: August 1, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [framework], [dataset], [OmniParser] + - πŸ“– TLDR: This paper introduces **OmniParser**, a method for parsing user interface screenshots into structured elements, enhancing the ability of models like GPT-4V to generate actions accurately grounded in corresponding UI regions. The authors curated datasets for interactable icon detection and icon description, fine-tuning models to parse interactable regions and extract functional semantics of UI elements. + +- [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://aclanthology.org/2024.findings-acl.539) + - Xinbei Ma, Zhuosheng Zhang, Hai Zhao + - πŸ›οΈ Institutions: SJTU + - πŸ“… Date: August 2024 + - πŸ“‘ Publisher: ACL 2024 + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [model], [framework], [benchmark] + - πŸ“– TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios​:contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}​:contentReference[oaicite:2]{index=2}​:contentReference[oaicite:3]{index=3}. + - [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https://arxiv.org/abs/2408.07199) - [Author information not available] - πŸ›οΈ Institutions: MultiOn, Stanford @@ -466,24 +484,6 @@ - πŸ”‘ Key: [multimodal agents], [environmental distractions], [robustness] - πŸ“– TLDR: This paper highlights the vulnerability of multimodal agents to environmental distractions. The researchers demonstrate that these agents, which process multiple types of input (e.g., text, images, audio), can be significantly impacted by irrelevant or misleading environmental cues. The study provides insights into the limitations of current multimodal systems and emphasizes the need for more robust architectures that can filter out distractions and maintain focus on relevant information in complex, real-world environments. -- [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://aclanthology.org/2024.findings-acl.539) - - Xinbei Ma, Zhuosheng Zhang, Hai Zhao - - πŸ›οΈ Institutions: SJTU - - πŸ“… Date: August 2024 - - πŸ“‘ Publisher: ACL 2024 - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [model], [framework], [benchmark] - - πŸ“– TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios​:contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}​:contentReference[oaicite:2]{index=2}​:contentReference[oaicite:3]{index=3}. - -- [OmniParser for Pure Vision Based GUI Agent](https://microsoft.github.io/OmniParser/) - - Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah - - πŸ›οΈ Institutions: Microsoft Research, Microsoft Gen AI - - πŸ“… Date: August 1, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [GUI] - - πŸ”‘ Key: [framework], [dataset], [OmniParser] - - πŸ“– TLDR: This paper introduces **OmniParser**, a method for parsing user interface screenshots into structured elements, enhancing the ability of models like GPT-4V to generate actions accurately grounded in corresponding UI regions. The authors curated datasets for interactable icon detection and icon description, fine-tuning models to parse interactable regions and extract functional semantics of UI elements. - - [OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation](https://arxiv.org/abs/2407.19056) - Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, Jingbo Shang - πŸ›οΈ Institutions: UCSD, UCLA, AI2 @@ -547,6 +547,15 @@ - πŸ”‘ Key: [framework], [tool formulation], [multi-agent collaboration], [MobileExperts] - πŸ“– TLDR: This paper introduces *MobileExperts*, a framework that enhances autonomous operations on mobile devices by dynamically assembling agent teams based on user requirements. Each agent independently explores and formulates tools to evolve into an expert, improving efficiency and reducing reasoning costs. +- [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents](https://yuxiangchai.github.io/AMEX/) + - Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng Li + - πŸ›οΈ Institutions: CUHK, SJTU, Shanghai AI Lab, vivo AI Lab + - πŸ“… Date: July 3, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [dataset], [benchmark], [AMEX] + - πŸ“– TLDR: This paper introduces the **Android Multi-annotation EXpo (AMEX)**, a comprehensive dataset designed for training and evaluating mobile GUI-control agents. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, annotated at multiple levels, including GUI interactive element grounding, functionality descriptions, and complex natural language instructions. The dataset aims to advance research on AI agents capable of completing complex tasks by interacting directly with mobile device GUIs. + - [CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents](https://arxiv.org/abs/2407.01511) - Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li - πŸ›οΈ Institutions: KAUST, UTokyo, CMU, Stanford, Harvard, Tsinghua, SUSTech, Oxford @@ -565,15 +574,6 @@ - πŸ”‘ Key: [framework], [VisionDroid] - πŸ“– TLDR: The paper presents **VisionDroid**, a vision-driven automated GUI testing approach utilizing Multimodal Large Language Models (MLLM) to detect non-crash functional bugs in mobile applications. By extracting GUI text information and aligning it with screenshots, VisionDroid enables MLLM to understand GUI context, facilitating deeper and function-oriented exploration. The approach segments exploration history into logically cohesive parts, prompting MLLM for bug detection, demonstrating superior performance over existing methods. -- [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents](https://yuxiangchai.github.io/AMEX/) - - Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng Li - - πŸ›οΈ Institutions: CUHK, SJTU, Shanghai AI Lab, vivo AI Lab - - πŸ“… Date: July 3, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [dataset], [benchmark], [AMEX] - - πŸ“– TLDR: This paper introduces the **Android Multi-annotation EXpo (AMEX)**, a comprehensive dataset designed for training and evaluating mobile GUI-control agents. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, annotated at multiple levels, including GUI interactive element grounding, functionality descriptions, and complex natural language instructions. The dataset aims to advance research on AI agents capable of completing complex tasks by interacting directly with mobile device GUIs. - - [Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding](https://screen-point-and-read.github.io/) - Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang - πŸ›οΈ Institutions: UCSC, Microsoft Research @@ -718,15 +718,6 @@ - πŸ”‘ Key: [framework], [multi-agent], [planning], [decision-making], [reflection] - πŸ“– TLDR: The paper presents **Mobile-Agent-v2**, a multi-agent architecture designed to assist with mobile device operations. It comprises three agents: a planning agent that generates task progress, a decision agent that navigates tasks using a memory unit, and a reflection agent that corrects erroneous operations. This collaborative approach addresses challenges in navigation and long-context input scenarios, achieving over a 30% improvement in task completion compared to single-agent architectures. -- [Visual Grounding for User Interfaces](https://aclanthology.org/2024.naacl-industry.9/) - - Yijun Qian, Yujie Lu, Alexander Hauptmann, Oriana Riva - - πŸ›οΈ Institutions: CMU, UCSB - - πŸ“… Date: June 2024 - - πŸ“‘ Publisher: NAACL 2024 - - πŸ’» Env: [GUI] - - πŸ”‘ Key: [framework], [visual grounding], [UI element localization], [LVG] - - πŸ“– TLDR: This work introduces the task of visual UI grounding, which unifies detection and grounding by enabling models to identify UI elements referenced by natural language commands solely from visual input. The authors propose **LVG**, a model that outperforms baselines pre-trained on larger datasets by over 4.9 points in top-1 accuracy, demonstrating its effectiveness in localizing referenced UI elements without relying on UI metadata. - - [WebSuite: Systematically Evaluating Why Web Agents Fail](https://arxiv.org/abs/2406.01623) - Eric Li, Jim Waldo - πŸ›οΈ Institutions: Harvard @@ -745,6 +736,15 @@ - πŸ”‘ Key: [benchmark], [instructional videos], [visual planning], [hierarchical task decomposition], [complex software interaction] - πŸ“– TLDR: VideoGUI presents a benchmark for evaluating GUI automation on tasks derived from instructional videos, focusing on visually intensive applications like Adobe Photoshop and video editing software. The benchmark includes 178 tasks, with a hierarchical evaluation method distinguishing high-level planning, mid-level procedural steps, and precise action execution. VideoGUI reveals current model limitations in complex visual tasks, marking a significant step toward improved visual planning in GUI automation. +- [Visual Grounding for User Interfaces](https://aclanthology.org/2024.naacl-industry.9/) + - Yijun Qian, Yujie Lu, Alexander Hauptmann, Oriana Riva + - πŸ›οΈ Institutions: CMU, UCSB + - πŸ“… Date: June 2024 + - πŸ“‘ Publisher: NAACL 2024 + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [framework], [visual grounding], [UI element localization], [LVG] + - πŸ“– TLDR: This work introduces the task of visual UI grounding, which unifies detection and grounding by enabling models to identify UI elements referenced by natural language commands solely from visual input. The authors propose **LVG**, a model that outperforms baselines pre-trained on larger datasets by over 4.9 points in top-1 accuracy, demonstrating its effectiveness in localizing referenced UI elements without relying on UI metadata. + - [Large Language Models Can Self-Improve At Web Agent Tasks](https://arxiv.org/abs/2405.20309) - Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter - πŸ›οΈ Institutions: University of Pennsylvania, ExtensityAI, Johannes Kepler University Linz, NXAI