diff --git a/README.md b/README.md index 202cf4f..5d94895 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ This repo covers a variety of papers related to GUI Agents, such as: | [Web](paper_by_env/paper_web.md) | [Mobile](paper_by_env/paper_mobile.md) | [Desktop](paper_by_env/paper_desktop.md) | [GUI](paper_by_env/paper_gui.md) | [Misc](paper_by_env/paper_misc.md) | |--------------------------------|---------------------------------------|------------------------------------------|----------------------------------|------------------------------------| -(Misc: Papers for general topics that have important applications in GUI agents) +(Misc: Papers that do not specifically study for GUIs.) ## Papers Grouped by Keywords [Model](paper_by_key/paper_model.md) | [Framework](paper_by_key/paper_framework.md) | [Dataset](paper_by_key/paper_dataset.md) | [Benchmark](paper_by_key/paper_benchmark.md) | [Safety](paper_by_key/paper_safety.md) | [Survey](paper_by_key/paper_survey.md) | @@ -38,6 +38,15 @@ This repo covers a variety of papers related to GUI Agents, such as:
Papers +- [GUI Agents: A Survey](https://arxiv.org/pdf/2412.13501) + - Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt + - πŸ›οΈ Institutions: University of Maryland, SUNY Buffalo, Univ. of Oregon, Adobe Research, Meta AI, Univ. of Rochester, UC San Diego, Carnegie Mellon Univ., Dolby Labs, Intel AI Research, UNSW + - πŸ“… Date: December 18, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [survey] + - πŸ“– TLDR: This survey provides a comprehensive overview of GUI agents powered by Large Foundation Models, detailing their benchmarks, evaluation metrics, architectures, and training methods. It introduces a unified framework outlining their perception, reasoning, planning, and acting capabilities, identifies open challenges, and discusses future research directions, serving as a resource for both practitioners and researchers in the field. + - [Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents](https://arxiv.org/abs/2412.13194) - Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, Erran Li - πŸ›οΈ Institutions: UCB, UIUC, Amazon @@ -58,7 +67,7 @@ This repo covers a variety of papers related to GUI Agents, such as: - [The BrowserGym Ecosystem for Web Agent Research](https://arxiv.org/abs/2412.05467) - Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, LΓ©o Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han LΓΉ, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste - - πŸ›οΈ Institutions: ServiceNow Research, Mila, Polytechnique MontrΓ©al, CMU, McGill University, Tel Aviv University, UniversitΓ© de MontrΓ©al, iMean AI + - πŸ›οΈ Institutions: ServiceNow Research, Mila, Polytechnique MontrΓ©al , CMU, McGill University, Tel Aviv University, UniversitΓ© de MontrΓ©al, iMean AI - πŸ“… Date: December 6, 2024 - πŸ“‘ Publisher: arXiv - πŸ’» Env: [Web] @@ -164,15 +173,6 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [attack], [adversarial pop-ups], [VLM agents], [safety] - πŸ“– TLDR: This paper demonstrates that vision-language model (VLM) agents can be easily deceived by carefully designed adversarial pop-ups, leading them to perform unintended actions such as clicking on these pop-ups instead of completing their assigned tasks. Integrating these pop-ups into environments like OSWorld and VisualWebArena resulted in an average attack success rate of 86% and a 47% decrease in task success rate. Basic defense strategies, such as instructing the agent to ignore pop-ups or adding advertisement notices, were found to be ineffective against these attacks. -- [AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents](https://arxiv.org/abs/2410.24024) - - Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong - - πŸ›οΈ Institutions: Tsinghua University, Peking University - - πŸ“… Date: October 31, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [framework], [dataset], [benchmark], [AndroidLab] - - πŸ“– TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates. - - [From Context to Action: Analysis of the Impact of State Representation and Context on the Generalization of Multi-Turn Web Navigation Agents](https://arxiv.org/abs/2409.13701) - Nalin Tiwary, Vardhan Dongre, Sanil Arun Chawla, Ashwin Lamani, Dilek Hakkani-TΓΌr - πŸ›οΈ Institutions: UIUC @@ -182,14 +182,14 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [framework], [context management], [generalization], [multi-turn navigation], [CWA] - πŸ“– TLDR: This study examines how different contextual elements affect the performance and generalization of Conversational Web Agents (CWAs) in multi-turn web navigation tasks. By optimizing context managementβ€”specifically interaction history and web page representationβ€”the research demonstrates enhanced agent performance across various out-of-distribution scenarios, including unseen websites, categories, and geographic locations. -- [Evaluating Cultural and Social Awareness of LLM Web Agents](https://arxiv.org/abs/2410.23252) - - Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu - - πŸ›οΈ Institutions: UCLA, Salesforce AI Research - - πŸ“… Date: October 30, 2024 +- [AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents](https://arxiv.org/abs/2410.24024) + - Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong + - πŸ›οΈ Institutions: Tsinghua University, Peking University + - πŸ“… Date: October 31, 2024 - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [CASA], [cultural awareness], [social awareness], [fine-tuning], [prompting] - - πŸ“– TLDR: This paper introduces CASA, a benchmark designed to assess the cultural and social awareness of LLM web agents in tasks like online shopping and social discussion forums. It evaluates agents' abilities to detect and appropriately respond to norm-violating user queries and observations. The study finds that current LLM agents have limited cultural and social awareness, with less than 10% awareness coverage and over 40% violation rates. To enhance performance, the authors explore prompting and fine-tuning methods, demonstrating that combining both can offer complementary advantages. + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [framework], [dataset], [benchmark], [AndroidLab] + - πŸ“– TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates. - [OS-ATLAS: A Foundation Action Model For Generalist GUI Agents](https://osatlas.github.io/) - Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao @@ -200,6 +200,15 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [model], [dataset], [benchmark], [OS-Atlas] - πŸ“– TLDR: This paper introduces OS-Atlas, a foundational GUI action model designed to enhance GUI grounding and out-of-distribution tasks. The authors developed a toolkit to synthesize multi-platform GUI grounding data, resulting in a cross-platform corpus of over 13 million GUI elements. OS-Atlas demonstrates significant performance improvements across six benchmarks spanning mobile, desktop, and web platforms. +- [Evaluating Cultural and Social Awareness of LLM Web Agents](https://arxiv.org/abs/2410.23252) + - Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu + - πŸ›οΈ Institutions: UCLA, Salesforce AI Research + - πŸ“… Date: October 30, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [CASA], [cultural awareness], [social awareness], [fine-tuning], [prompting] + - πŸ“– TLDR: This paper introduces CASA, a benchmark designed to assess the cultural and social awareness of LLM web agents in tasks like online shopping and social discussion forums. It evaluates agents' abilities to detect and appropriately respond to norm-violating user queries and observations. The study finds that current LLM agents have limited cultural and social awareness, with less than 10% awareness coverage and over 40% violation rates. To enhance performance, the authors explore prompting and fine-tuning methods, demonstrating that combining both can offer complementary advantages. + - [Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents](https://arxiv.org/abs/2410.22552) - Jaekyeom Kim, Dong-Ki Kim, Lajanugen Logeswaran, Sungryull Sohn, Honglak Lee - πŸ›οΈ Institutions: LG AI Research, Field AI, University of Michigan @@ -236,14 +245,14 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [framework], [model], [learning], [AutoGLM] - πŸ“– TLDR: This paper introduces AutoGLM, a new series in the ChatGLM family, designed as foundation agents for autonomous control of digital devices through GUIs. It addresses the challenges foundation models face in decision-making within dynamic environments by developing agents capable of learning through autonomous interactions. Focusing on web browsers and Android devices, AutoGLM integrates various techniques to create deployable agent systems. Key insights include the importance of designing an appropriate "intermediate interface" for GUI control and a novel progressive training framework for self-evolving online curriculum reinforcement learning. Evaluations demonstrate AutoGLM's effectiveness across multiple domains, achieving notable success rates in web browsing and Android device control tasks. -- [AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant](https://arxiv.org/abs/2410.18603) - - Chengyou Jia, Minnan Luo, Zhuohang Dang, Qiushi Sun, Fangzhi Xu, Junlin Hu, Tianbao Xie, Zhiyong Wu - - πŸ›οΈ Institutions: XJTU, Shanghai AI Lab, HKU +- [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100) + - Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida + - πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft - πŸ“… Date: October 24, 2024 - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [GUI] - - πŸ”‘ Key: [framework], [multi-agent systems], [specialized generalist agent], [OSWorld benchmark] - - πŸ“– TLDR: AgentStore introduces a scalable platform to integrate and manage heterogeneous agents, designed to enhance generalist assistant capabilities for diverse computer tasks. Using a MetaAgent and AgentToken strategy, AgentStore shows improved generalization on the OSWorld benchmark. + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA] + - πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements. - [Beyond Browsing: API-Based Web Agents](https://arxiv.org/pdf/2410.16464) - Yueqi Song, Frank Xu, Shuyan Zhou, Graham Neubig @@ -254,23 +263,14 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [API-based agent], [hybrid agent], [benchmark], [WebArena], [SOTA performance] - πŸ“– TLDR: This paper introduces API-based and hybrid agents designed to execute online tasks by accessing both APIs and traditional web browsing interfaces. In evaluations using WebArena, a benchmark for web navigation, the API-based agent achieves higher performance than browser-based agents, and the hybrid model achieves a success rate of 35.8%, setting a new state-of-the-art (SOTA) in task-agnostic web navigation. The findings highlight the efficiency and reliability gains of API interactions for web agents. -- [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100) - - Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida - - πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft +- [AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant](https://arxiv.org/abs/2410.18603) + - Chengyou Jia, Minnan Luo, Zhuohang Dang, Qiushi Sun, Fangzhi Xu, Junlin Hu, Tianbao Xie, Zhiyong Wu + - πŸ›οΈ Institutions: XJTU, Shanghai AI Lab, HKU - πŸ“… Date: October 24, 2024 - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA] - - πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements. - -- [Lightweight Neural App Control](https://arxiv.org/abs/2410.17883) - - Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao - - πŸ›οΈ Institutions: Huawei Noah's Ark Lab, UCL - - πŸ“… Date: October 23, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [framework], [vision language model], [Action Transformer], [app agent], [Android control], [multi-modal] - - πŸ“– TLDR: This paper introduces LiMAC, a mobile control framework for Android that integrates an Action Transformer and fine-tuned vision-language models to execute precise actions in mobile apps. Tested on open-source datasets, LiMAC improves action accuracy by up to 42% over traditional prompt engineering baselines, demonstrating enhanced efficiency and accuracy in mobile app control tasks. + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [framework], [multi-agent systems], [specialized generalist agent], [OSWorld benchmark] + - πŸ“– TLDR: AgentStore introduces a scalable platform to integrate and manage heterogeneous agents, designed to enhance generalist assistant capabilities for diverse computer tasks. Using a MetaAgent and AgentToken strategy, AgentStore shows improved generalization on the OSWorld benchmark. - [MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control](https://arxiv.org/abs/2410.17520) - Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee @@ -281,6 +281,15 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [benchmark], [safety], [evaluation], [Android emulator] - πŸ“– TLDR: *MobileSafetyBench* introduces a benchmark for evaluating the safety of large language model (LLM)-based autonomous agents in mobile device control. Using Android emulators, the benchmark simulates real-world tasks in apps such as messaging and banking to assess agents' safety and helpfulness. The safety-focused tasks test for privacy risk management and robustness against adversarial prompt injections. Experiments show agents perform well in helpful tasks but struggle with safety-related challenges, underscoring the need for continued advancements in mobile safety mechanisms for autonomous agents. +- [Lightweight Neural App Control](https://arxiv.org/abs/2410.17883) + - Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao + - πŸ›οΈ Institutions: Huawei Noah's Ark Lab, UCL + - πŸ“… Date: October 23, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [framework], [vision language model], [Action Transformer], [app agent], [Android control], [multi-modal] + - πŸ“– TLDR: This paper introduces LiMAC, a mobile control framework for Android that integrates an Action Transformer and fine-tuned vision-language models to execute precise actions in mobile apps. Tested on open-source datasets, LiMAC improves action accuracy by up to 42% over traditional prompt engineering baselines, demonstrating enhanced efficiency and accuracy in mobile app control tasks. + - [Large Language Models Empowered Personalized Web Agents](https://ar5iv.org/abs/2410.17236) - Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, Tat-Seng Chua - πŸ›οΈ Institutions: HK PolyU, NTU Singapore @@ -290,15 +299,6 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [framework], [benchmark], [personalized web agent], [user behavior alignment], [memory-enhanced alignment] - πŸ“– TLDR: This paper proposes a novel framework, *Personalized User Memory-enhanced Alignment (PUMA)*, enabling large language models to serve as personalized web agents by incorporating user-specific data and historical web interactions. The authors also introduce a benchmark, *PersonalWAB*, to evaluate these agents on various personalized web tasks. Results show that PUMA improves web agent performance by optimizing action execution based on user-specific preferences. -- [Dissecting Adversarial Robustness of Multimodal LM Agents](https://openreview.net/forum?id=LjVIGva5Ct) - - Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, Aditi Raghunathan - - πŸ›οΈ Institutions: CMU, Stanford - - πŸ“… Date: October 21, 2024 - - πŸ“‘ Publisher: NeurIPS 2024 Workshop - - πŸ’» Env: [Web] - - πŸ”‘ Key: [dataset], [attack], [ARE], [safety] - - πŸ“– TLDR: This paper introduces the Agent Robustness Evaluation (ARE) framework to assess the adversarial robustness of multimodal language model agents in web environments. By creating 200 targeted adversarial tasks within VisualWebArena, the study reveals that minimal perturbations can significantly compromise agent performance, even in advanced systems utilizing reflection and tree-search mechanisms. The findings highlight the need for enhanced safety measures in deploying such agents. - - [AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?](https://arxiv.org/abs/2407.15711) - Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant - πŸ›οΈ Institutions: Tel Aviv University @@ -308,6 +308,15 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [benchmark], [dataset], [planning and reasoning] - πŸ“– TLDR: AssistantBench is a benchmark designed to test the abilities of web agents in completing time-intensive, realistic web-based tasks. Covering 214 tasks across various domains, the benchmark introduces the SPA (See-Plan-Act) framework to handle multi-step planning and memory retention. AssistantBench emphasizes realistic task completion, showing that current agents achieve only modest success, with significant improvements needed for complex information synthesis and execution across multiple web domains. +- [Dissecting Adversarial Robustness of Multimodal LM Agents](https://openreview.net/forum?id=LjVIGva5Ct) + - Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, Aditi Raghunathan + - πŸ›οΈ Institutions: CMU, Stanford + - πŸ“… Date: October 21, 2024 + - πŸ“‘ Publisher: NeurIPS 2024 Workshop + - πŸ’» Env: [Web] + - πŸ”‘ Key: [dataset], [attack], [ARE], [safety] + - πŸ“– TLDR: This paper introduces the Agent Robustness Evaluation (ARE) framework to assess the adversarial robustness of multimodal language model agents in web environments. By creating 200 targeted adversarial tasks within VisualWebArena, the study reveals that minimal perturbations can significantly compromise agent performance, even in advanced systems utilizing reflection and tree-search mechanisms. The findings highlight the need for enhanced safety measures in deploying such agents. + - [SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation](https://ar5iv.org/abs/2410.15164) - Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao - πŸ›οΈ Institutions: Huawei Noah’s Ark Lab, Harbin Institute of Technology, Shenzhen, UCL @@ -353,15 +362,6 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [framework], [autonomous GUI interaction], [experience-augmented hierarchical planning] - πŸ“– TLDR: This paper introduces Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI). The system addresses key challenges in automating computer tasks through experience-augmented hierarchical planning and an Agent-Computer Interface (ACI). Agent S demonstrates significant improvements over baselines on the OSWorld benchmark, achieving a 20.58% success rate (83.6% relative improvement). The framework shows generalizability across different operating systems and provides insights for developing more effective GUI agents. -- [ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents](https://sites.google.com/view/st-webagentbench/home) - - Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov - - πŸ›οΈ Institutions: IBM Research - - πŸ“… Date: October 9, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [safety], [trustworthiness], [ST-WebAgentBench] - - πŸ“– TLDR: This paper introduces **ST-WebAgentBench**, a benchmark designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. It defines safe and trustworthy agent behavior, outlines the structure of safety policies, and introduces the "Completion under Policies" metric to assess agent performance. The study reveals that current state-of-the-art agents struggle with policy adherence, highlighting the need for improved policy awareness and compliance in web agents. - - [ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents](https://arxiv.org/abs/2410.11872) - Jakub Hoscilowicz, Bartosz Maj, Bartosz Kozakiewicz, Oleksii Tymoschuk, Artur Janicki - πŸ›οΈ Institutions: Samsung R&D Poland, Warsaw University of Technology @@ -371,6 +371,15 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [framework], [model], [SeeClick], [AITW benchmark] - πŸ“– TLDR: The paper introduces *ClickAgent*, a framework that enhances autonomous agents' interaction with mobile UIs by improving their ability to locate interface elements accurately. This is achieved through a dual-component system where an MLLM performs reasoning and action planning, while a dedicated UI location model (e.g., SeeClick) handles element identification. ClickAgent, evaluated on the AITW benchmark and tested on both emulators and real Android devices, surpasses other agents like CogAgent and AppAgent in task success rate, advancing automation reliability on mobile platforms. +- [ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents](https://sites.google.com/view/st-webagentbench/home) + - Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov + - πŸ›οΈ Institutions: IBM Research + - πŸ“… Date: October 9, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [safety], [trustworthiness], [ST-WebAgentBench] + - πŸ“– TLDR: This paper introduces **ST-WebAgentBench**, a benchmark designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. It defines safe and trustworthy agent behavior, outlines the structure of safety policies, and introduces the "Completion under Policies" metric to assess agent performance. The study reveals that current state-of-the-art agents struggle with policy adherence, highlighting the need for improved policy awareness and compliance in web agents. + - [TinyClick: Single-Turn Agent for Empowering GUI Automation](https://arxiv.org/abs/2410.11871) - Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Marcin Skorupa, Adam Wiacek, Sebastien Postansque, Jakub Hoscilowicz - πŸ›οΈ Institutions: Samsung R&D Poland, Warsaw University of Technology @@ -542,14 +551,14 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [framework], [AppAgent v2] - πŸ“– TLDR: This work presents *AppAgent v2*, a novel LLM-based multimodal agent framework for mobile devices capable of navigating applications by emulating human-like interactions such as tapping and swiping. The agent constructs a flexible action space that enhances adaptability across various applications, including parsing text and vision descriptions. It operates through two main phases: exploration and deployment, utilizing retrieval-augmented generation (RAG) technology to efficiently retrieve and update information from a knowledge base, thereby empowering the agent to perform tasks effectively and accurately. -- [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://aclanthology.org/2024.findings-acl.539) - - Xinbei Ma, Zhuosheng Zhang, Hai Zhao - - πŸ›οΈ Institutions: SJTU - - πŸ“… Date: August 2024 - - πŸ“‘ Publisher: ACL 2024 - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [model], [framework], [benchmark] - - πŸ“– TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios​:contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}​:contentReference[oaicite:2]{index=2}​:contentReference[oaicite:3]{index=3}. +- [OmniParser for Pure Vision Based GUI Agent](https://microsoft.github.io/OmniParser/) + - Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah + - πŸ›οΈ Institutions: MSR, Microsoft Gen AI + - πŸ“… Date: August 1, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [framework], [dataset], [OmniParser] + - πŸ“– TLDR: This paper introduces **OmniParser**, a method for parsing user interface screenshots into structured elements, enhancing the ability of models like GPT-4V to generate actions accurately grounded in corresponding UI regions. The authors curated datasets for interactable icon detection and icon description, fine-tuning models to parse interactable regions and extract functional semantics of UI elements. - [Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions](https://arxiv.org/abs/2408.02544) - Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao @@ -560,14 +569,14 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [multimodal agents], [environmental distractions], [robustness] - πŸ“– TLDR: This paper highlights the vulnerability of multimodal agents to environmental distractions. The researchers demonstrate that these agents, which process multiple types of input (e.g., text, images, audio), can be significantly impacted by irrelevant or misleading environmental cues. The study provides insights into the limitations of current multimodal systems and emphasizes the need for more robust architectures that can filter out distractions and maintain focus on relevant information in complex, real-world environments. -- [OmniParser for Pure Vision Based GUI Agent](https://microsoft.github.io/OmniParser/) - - Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah - - πŸ›οΈ Institutions: MSR, Microsoft Gen AI - - πŸ“… Date: August 1, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [GUI] - - πŸ”‘ Key: [framework], [dataset], [OmniParser] - - πŸ“– TLDR: This paper introduces **OmniParser**, a method for parsing user interface screenshots into structured elements, enhancing the ability of models like GPT-4V to generate actions accurately grounded in corresponding UI regions. The authors curated datasets for interactable icon detection and icon description, fine-tuning models to parse interactable regions and extract functional semantics of UI elements. +- [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://aclanthology.org/2024.findings-acl.539) + - Xinbei Ma, Zhuosheng Zhang, Hai Zhao + - πŸ›οΈ Institutions: SJTU + - πŸ“… Date: August 2024 + - πŸ“‘ Publisher: ACL 2024 + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [model], [framework], [benchmark] + - πŸ“– TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios​:contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}​:contentReference[oaicite:2]{index=2}​:contentReference[oaicite:3]{index=3}. - [OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation](https://arxiv.org/abs/2407.19056) - Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, Jingbo Shang @@ -650,15 +659,6 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [framework], [VisionDroid] - πŸ“– TLDR: The paper presents **VisionDroid**, a vision-driven automated GUI testing approach utilizing Multimodal Large Language Models (MLLM) to detect non-crash functional bugs in mobile applications. By extracting GUI text information and aligning it with screenshots, VisionDroid enables MLLM to understand GUI context, facilitating deeper and function-oriented exploration. The approach segments exploration history into logically cohesive parts, prompting MLLM for bug detection, demonstrating superior performance over existing methods. -- [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents](https://yuxiangchai.github.io/AMEX/) - - Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng Li - - πŸ›οΈ Institutions: CUHK, SJTU, Shanghai AI Lab, vivo AI Lab - - πŸ“… Date: July 3, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [dataset], [benchmark], [AMEX] - - πŸ“– TLDR: This paper introduces the **Android Multi-annotation EXpo (AMEX)**, a comprehensive dataset designed for training and evaluating mobile GUI-control agents. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, annotated at multiple levels, including GUI interactive element grounding, functionality descriptions, and complex natural language instructions. The dataset aims to advance research on AI agents capable of completing complex tasks by interacting directly with mobile device GUIs. - - [CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents](https://arxiv.org/abs/2407.01511) - Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li - πŸ›οΈ Institutions: KAUST, UTokyo, CMU, Stanford, Harvard, Tsinghua University, SUSTech, Oxford @@ -668,6 +668,15 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [benchmark], [framework], [evaluation], [CRAB] - πŸ“– TLDR: The authors present *CRAB*, a benchmark framework designed to evaluate Multimodal Language Model agents across multiple environments. It features a graph-based fine-grained evaluation method and supports automatic task generation, addressing limitations in existing benchmarks. +- [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents](https://yuxiangchai.github.io/AMEX/) + - Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng Li + - πŸ›οΈ Institutions: CUHK, SJTU, Shanghai AI Lab, vivo AI Lab + - πŸ“… Date: July 3, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [dataset], [benchmark], [AMEX] + - πŸ“– TLDR: This paper introduces the **Android Multi-annotation EXpo (AMEX)**, a comprehensive dataset designed for training and evaluating mobile GUI-control agents. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, annotated at multiple levels, including GUI interactive element grounding, functionality descriptions, and complex natural language instructions. The dataset aims to advance research on AI agents capable of completing complex tasks by interacting directly with mobile device GUIs. + - [Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding](https://screen-point-and-read.github.io/) - Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang - πŸ›οΈ Institutions: UCSC, MSR @@ -830,15 +839,6 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [benchmark], [framework], [failure analysis], [analysis], [task disaggregation] - πŸ“– TLDR: This paper introduces *WebSuite*, a diagnostic benchmark to investigate the causes of web agent failures. By categorizing agent tasks using a taxonomy of operational, informational, and navigational actions, WebSuite offers granular insights into the specific actions where agents struggle, like filtering or form completion. It enables detailed comparison across agents, identifying areas for architectural and UX adaptation to improve agent reliability and task success on the web. -- [Visual Grounding for User Interfaces](https://aclanthology.org/2024.naacl-industry.9/) - - Yijun Qian, Yujie Lu, Alexander Hauptmann, Oriana Riva - - πŸ›οΈ Institutions: CMU, UCSB - - πŸ“… Date: June 2024 - - πŸ“‘ Publisher: NAACL 2024 - - πŸ’» Env: [GUI] - - πŸ”‘ Key: [framework], [visual grounding], [UI element localization], [LVG] - - πŸ“– TLDR: This work introduces the task of visual UI grounding, which unifies detection and grounding by enabling models to identify UI elements referenced by natural language commands solely from visual input. The authors propose **LVG**, a model that outperforms baselines pre-trained on larger datasets by over 4.9 points in top-1 accuracy, demonstrating its effectiveness in localizing referenced UI elements without relying on UI metadata. - - [VideoGUI: A Benchmark for GUI Automation from Instructional Videos](https://arxiv.org/abs/2406.10227) - Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou - πŸ›οΈ Institutions: NUS, Microsoft Gen AI @@ -848,6 +848,15 @@ This repo covers a variety of papers related to GUI Agents, such as: - πŸ”‘ Key: [benchmark], [instructional videos], [visual planning], [hierarchical task decomposition], [complex software interaction] - πŸ“– TLDR: VideoGUI presents a benchmark for evaluating GUI automation on tasks derived from instructional videos, focusing on visually intensive applications like Adobe Photoshop and video editing software. The benchmark includes 178 tasks, with a hierarchical evaluation method distinguishing high-level planning, mid-level procedural steps, and precise action execution. VideoGUI reveals current model limitations in complex visual tasks, marking a significant step toward improved visual planning in GUI automation. +- [Visual Grounding for User Interfaces](https://aclanthology.org/2024.naacl-industry.9/) + - Yijun Qian, Yujie Lu, Alexander Hauptmann, Oriana Riva + - πŸ›οΈ Institutions: CMU, UCSB + - πŸ“… Date: June 2024 + - πŸ“‘ Publisher: NAACL 2024 + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [framework], [visual grounding], [UI element localization], [LVG] + - πŸ“– TLDR: This work introduces the task of visual UI grounding, which unifies detection and grounding by enabling models to identify UI elements referenced by natural language commands solely from visual input. The authors propose **LVG**, a model that outperforms baselines pre-trained on larger datasets by over 4.9 points in top-1 accuracy, demonstrating its effectiveness in localizing referenced UI elements without relying on UI metadata. + - [Large Language Models Can Self-Improve At Web Agent Tasks](https://arxiv.org/abs/2405.20309) - Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter - πŸ›οΈ Institutions: University of Pennsylvania, ExtensityAI, Johannes Kepler University Linz, NXAI diff --git a/paper_by_env/paper_gui.md b/paper_by_env/paper_gui.md index d84af67..bfa36ef 100644 --- a/paper_by_env/paper_gui.md +++ b/paper_by_env/paper_gui.md @@ -1,3 +1,12 @@ +- [GUI Agents: A Survey](https://arxiv.org/pdf/2412.13501) + - Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt + - πŸ›οΈ Institutions: University of Maryland, SUNY Buffalo, Univ. of Oregon, Adobe Research, Meta AI, Univ. of Rochester, UC San Diego, Carnegie Mellon Univ., Dolby Labs, Intel AI Research, UNSW + - πŸ“… Date: December 18, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [survey] + - πŸ“– TLDR: This survey provides a comprehensive overview of GUI agents powered by Large Foundation Models, detailing their benchmarks, evaluation metrics, architectures, and training methods. It introduces a unified framework outlining their perception, reasoning, planning, and acting capabilities, identifies open challenges, and discusses future research directions, serving as a resource for both practitioners and researchers in the field. + - [Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining](https://arxiv.org/abs/2412.10342) - Zhiqi Ge, Juncheng Li, Xinglei Pang, Minghe Gao, Kaihang Pan, Wang Lin, Hao Fei, Wenqiao Zhang, Siliang Tang, Yueting Zhuang - πŸ›οΈ Institutions: Zhejiang University, National University of Singapore diff --git a/paper_by_env/paper_mobile.md b/paper_by_env/paper_mobile.md index 7e50fb9..fb89f8a 100644 --- a/paper_by_env/paper_mobile.md +++ b/paper_by_env/paper_mobile.md @@ -7,15 +7,6 @@ - πŸ”‘ Key: [framework], [dataset], [benchmark], [AndroidLab] - πŸ“– TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates. -- [Lightweight Neural App Control](https://arxiv.org/abs/2410.17883) - - Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao - - πŸ›οΈ Institutions: Huawei Noah's Ark Lab, UCL - - πŸ“… Date: October 23, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [framework], [vision language model], [Action Transformer], [app agent], [Android control], [multi-modal] - - πŸ“– TLDR: This paper introduces LiMAC, a mobile control framework for Android that integrates an Action Transformer and fine-tuned vision-language models to execute precise actions in mobile apps. Tested on open-source datasets, LiMAC improves action accuracy by up to 42% over traditional prompt engineering baselines, demonstrating enhanced efficiency and accuracy in mobile app control tasks. - - [MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control](https://arxiv.org/abs/2410.17520) - Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee - πŸ›οΈ Institutions: KAIST, UT at Austin @@ -25,6 +16,15 @@ - πŸ”‘ Key: [benchmark], [safety], [evaluation], [Android emulator] - πŸ“– TLDR: *MobileSafetyBench* introduces a benchmark for evaluating the safety of large language model (LLM)-based autonomous agents in mobile device control. Using Android emulators, the benchmark simulates real-world tasks in apps such as messaging and banking to assess agents' safety and helpfulness. The safety-focused tasks test for privacy risk management and robustness against adversarial prompt injections. Experiments show agents perform well in helpful tasks but struggle with safety-related challenges, underscoring the need for continued advancements in mobile safety mechanisms for autonomous agents. +- [Lightweight Neural App Control](https://arxiv.org/abs/2410.17883) + - Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao + - πŸ›οΈ Institutions: Huawei Noah's Ark Lab, UCL + - πŸ“… Date: October 23, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [framework], [vision language model], [Action Transformer], [app agent], [Android control], [multi-modal] + - πŸ“– TLDR: This paper introduces LiMAC, a mobile control framework for Android that integrates an Action Transformer and fine-tuned vision-language models to execute precise actions in mobile apps. Tested on open-source datasets, LiMAC improves action accuracy by up to 42% over traditional prompt engineering baselines, demonstrating enhanced efficiency and accuracy in mobile app control tasks. + - [SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation](https://ar5iv.org/abs/2410.15164) - Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao - πŸ›οΈ Institutions: Huawei Noah’s Ark Lab, Harbin Institute of Technology, Shenzhen, UCL diff --git a/paper_by_env/paper_web.md b/paper_by_env/paper_web.md index 7b0b24d..a0511e7 100644 --- a/paper_by_env/paper_web.md +++ b/paper_by_env/paper_web.md @@ -88,15 +88,6 @@ - πŸ”‘ Key: [framework], [learning], [imitation learning], [exploration], [AI feedback] - πŸ“– TLDR: The paper presents **OpenWebVoyager**, an open-source framework for training web agents that explore real-world online environments autonomously. The framework employs a cycle of exploration, feedback, and optimization, enhancing agent capabilities through multimodal perception and iterative learning. Initial skills are acquired through imitation learning, followed by real-world exploration, where the agent’s performance is evaluated and refined through feedback loops. -- [Beyond Browsing: API-Based Web Agents](https://arxiv.org/pdf/2410.16464) - - Yueqi Song, Frank Xu, Shuyan Zhou, Graham Neubig - - πŸ›οΈ Institutions: CMU - - πŸ“… Date: October 24, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [API-based agent], [hybrid agent], [benchmark], [WebArena], [SOTA performance] - - πŸ“– TLDR: This paper introduces API-based and hybrid agents designed to execute online tasks by accessing both APIs and traditional web browsing interfaces. In evaluations using WebArena, a benchmark for web navigation, the API-based agent achieves higher performance than browser-based agents, and the hybrid model achieves a success rate of 35.8%, setting a new state-of-the-art (SOTA) in task-agnostic web navigation. The findings highlight the efficiency and reliability gains of API interactions for web agents. - - [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100) - Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida - πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft @@ -106,6 +97,15 @@ - πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA] - πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements. +- [Beyond Browsing: API-Based Web Agents](https://arxiv.org/pdf/2410.16464) + - Yueqi Song, Frank Xu, Shuyan Zhou, Graham Neubig + - πŸ›οΈ Institutions: CMU + - πŸ“… Date: October 24, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Web] + - πŸ”‘ Key: [API-based agent], [hybrid agent], [benchmark], [WebArena], [SOTA performance] + - πŸ“– TLDR: This paper introduces API-based and hybrid agents designed to execute online tasks by accessing both APIs and traditional web browsing interfaces. In evaluations using WebArena, a benchmark for web navigation, the API-based agent achieves higher performance than browser-based agents, and the hybrid model achieves a success rate of 35.8%, setting a new state-of-the-art (SOTA) in task-agnostic web navigation. The findings highlight the efficiency and reliability gains of API interactions for web agents. + - [Large Language Models Empowered Personalized Web Agents](https://ar5iv.org/abs/2410.17236) - Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, Tat-Seng Chua - πŸ›οΈ Institutions: HK PolyU, NTU Singapore @@ -115,15 +115,6 @@ - πŸ”‘ Key: [framework], [benchmark], [personalized web agent], [user behavior alignment], [memory-enhanced alignment] - πŸ“– TLDR: This paper proposes a novel framework, *Personalized User Memory-enhanced Alignment (PUMA)*, enabling large language models to serve as personalized web agents by incorporating user-specific data and historical web interactions. The authors also introduce a benchmark, *PersonalWAB*, to evaluate these agents on various personalized web tasks. Results show that PUMA improves web agent performance by optimizing action execution based on user-specific preferences. -- [Dissecting Adversarial Robustness of Multimodal LM Agents](https://openreview.net/forum?id=LjVIGva5Ct) - - Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, Aditi Raghunathan - - πŸ›οΈ Institutions: CMU, Stanford - - πŸ“… Date: October 21, 2024 - - πŸ“‘ Publisher: NeurIPS 2024 Workshop - - πŸ’» Env: [Web] - - πŸ”‘ Key: [dataset], [attack], [ARE], [safety] - - πŸ“– TLDR: This paper introduces the Agent Robustness Evaluation (ARE) framework to assess the adversarial robustness of multimodal language model agents in web environments. By creating 200 targeted adversarial tasks within VisualWebArena, the study reveals that minimal perturbations can significantly compromise agent performance, even in advanced systems utilizing reflection and tree-search mechanisms. The findings highlight the need for enhanced safety measures in deploying such agents. - - [AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?](https://arxiv.org/abs/2407.15711) - Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant - πŸ›οΈ Institutions: Tel Aviv University @@ -133,6 +124,15 @@ - πŸ”‘ Key: [benchmark], [dataset], [planning and reasoning] - πŸ“– TLDR: AssistantBench is a benchmark designed to test the abilities of web agents in completing time-intensive, realistic web-based tasks. Covering 214 tasks across various domains, the benchmark introduces the SPA (See-Plan-Act) framework to handle multi-step planning and memory retention. AssistantBench emphasizes realistic task completion, showing that current agents achieve only modest success, with significant improvements needed for complex information synthesis and execution across multiple web domains. +- [Dissecting Adversarial Robustness of Multimodal LM Agents](https://openreview.net/forum?id=LjVIGva5Ct) + - Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, Aditi Raghunathan + - πŸ›οΈ Institutions: CMU, Stanford + - πŸ“… Date: October 21, 2024 + - πŸ“‘ Publisher: NeurIPS 2024 Workshop + - πŸ’» Env: [Web] + - πŸ”‘ Key: [dataset], [attack], [ARE], [safety] + - πŸ“– TLDR: This paper introduces the Agent Robustness Evaluation (ARE) framework to assess the adversarial robustness of multimodal language model agents in web environments. By creating 200 targeted adversarial tasks within VisualWebArena, the study reveals that minimal perturbations can significantly compromise agent performance, even in advanced systems utilizing reflection and tree-search mechanisms. The findings highlight the need for enhanced safety measures in deploying such agents. + - [Harnessing Webpage UIs for Text-Rich Visual Understanding](https://arxiv.org/abs/2410.13824) - Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue - πŸ›οΈ Institutions: CMU diff --git a/paper_by_key/paper_benchmark.md b/paper_by_key/paper_benchmark.md index 17ad38b..c96e765 100644 --- a/paper_by_key/paper_benchmark.md +++ b/paper_by_key/paper_benchmark.md @@ -18,15 +18,6 @@ - πŸ”‘ Key: [framework], [dataset], [benchmark], [AndroidLab] - πŸ“– TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates. -- [Evaluating Cultural and Social Awareness of LLM Web Agents](https://arxiv.org/abs/2410.23252) - - Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu - - πŸ›οΈ Institutions: UCLA, Salesforce AI Research - - πŸ“… Date: October 30, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [CASA], [cultural awareness], [social awareness], [fine-tuning], [prompting] - - πŸ“– TLDR: This paper introduces CASA, a benchmark designed to assess the cultural and social awareness of LLM web agents in tasks like online shopping and social discussion forums. It evaluates agents' abilities to detect and appropriately respond to norm-violating user queries and observations. The study finds that current LLM agents have limited cultural and social awareness, with less than 10% awareness coverage and over 40% violation rates. To enhance performance, the authors explore prompting and fine-tuning methods, demonstrating that combining both can offer complementary advantages. - - [OS-ATLAS: A Foundation Action Model For Generalist GUI Agents](https://osatlas.github.io/) - Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao - πŸ›οΈ Institutions: Shanghai AI Lab, Shanghai Jiaotong University, HKU, MIT @@ -36,14 +27,23 @@ - πŸ”‘ Key: [model], [dataset], [benchmark], [OS-Atlas] - πŸ“– TLDR: This paper introduces OS-Atlas, a foundational GUI action model designed to enhance GUI grounding and out-of-distribution tasks. The authors developed a toolkit to synthesize multi-platform GUI grounding data, resulting in a cross-platform corpus of over 13 million GUI elements. OS-Atlas demonstrates significant performance improvements across six benchmarks spanning mobile, desktop, and web platforms. -- [AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant](https://arxiv.org/abs/2410.18603) - - Chengyou Jia, Minnan Luo, Zhuohang Dang, Qiushi Sun, Fangzhi Xu, Junlin Hu, Tianbao Xie, Zhiyong Wu - - πŸ›οΈ Institutions: XJTU, Shanghai AI Lab, HKU +- [Evaluating Cultural and Social Awareness of LLM Web Agents](https://arxiv.org/abs/2410.23252) + - Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu + - πŸ›οΈ Institutions: UCLA, Salesforce AI Research + - πŸ“… Date: October 30, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [CASA], [cultural awareness], [social awareness], [fine-tuning], [prompting] + - πŸ“– TLDR: This paper introduces CASA, a benchmark designed to assess the cultural and social awareness of LLM web agents in tasks like online shopping and social discussion forums. It evaluates agents' abilities to detect and appropriately respond to norm-violating user queries and observations. The study finds that current LLM agents have limited cultural and social awareness, with less than 10% awareness coverage and over 40% violation rates. To enhance performance, the authors explore prompting and fine-tuning methods, demonstrating that combining both can offer complementary advantages. + +- [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100) + - Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida + - πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft - πŸ“… Date: October 24, 2024 - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [GUI] - - πŸ”‘ Key: [framework], [multi-agent systems], [specialized generalist agent], [OSWorld benchmark] - - πŸ“– TLDR: AgentStore introduces a scalable platform to integrate and manage heterogeneous agents, designed to enhance generalist assistant capabilities for diverse computer tasks. Using a MetaAgent and AgentToken strategy, AgentStore shows improved generalization on the OSWorld benchmark. + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA] + - πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements. - [Beyond Browsing: API-Based Web Agents](https://arxiv.org/pdf/2410.16464) - Yueqi Song, Frank Xu, Shuyan Zhou, Graham Neubig @@ -54,14 +54,14 @@ - πŸ”‘ Key: [API-based agent], [hybrid agent], [benchmark], [WebArena], [SOTA performance] - πŸ“– TLDR: This paper introduces API-based and hybrid agents designed to execute online tasks by accessing both APIs and traditional web browsing interfaces. In evaluations using WebArena, a benchmark for web navigation, the API-based agent achieves higher performance than browser-based agents, and the hybrid model achieves a success rate of 35.8%, setting a new state-of-the-art (SOTA) in task-agnostic web navigation. The findings highlight the efficiency and reliability gains of API interactions for web agents. -- [VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](https://doi.org/10.48550/arXiv.2410.19100) - - Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida - - πŸ›οΈ Institutions: CMU, MIT, NYU, Microsoft +- [AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant](https://arxiv.org/abs/2410.18603) + - Chengyou Jia, Minnan Luo, Zhuohang Dang, Qiushi Sun, Fangzhi Xu, Junlin Hu, Tianbao Xie, Zhiyong Wu + - πŸ›οΈ Institutions: XJTU, Shanghai AI Lab, HKU - πŸ“… Date: October 24, 2024 - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA] - - πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements. + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [framework], [multi-agent systems], [specialized generalist agent], [OSWorld benchmark] + - πŸ“– TLDR: AgentStore introduces a scalable platform to integrate and manage heterogeneous agents, designed to enhance generalist assistant capabilities for diverse computer tasks. Using a MetaAgent and AgentToken strategy, AgentStore shows improved generalization on the OSWorld benchmark. - [MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control](https://arxiv.org/abs/2410.17520) - Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee @@ -108,15 +108,6 @@ - πŸ”‘ Key: [framework], [dataset], [benchmark], [reinforcement learning] - πŸ“– TLDR: AutoWebGLM introduces a web navigation agent based on ChatGLM3-6B, designed to autonomously navigate and interact with webpages for complex tasks. The paper highlights a two-phase data construction approach using a hybrid human-AI methodology for diverse, curriculum-based web task training. It also presents AutoWebBench, a benchmark for evaluating agent performance in web tasks, and uses reinforcement learning to fine-tune operations, addressing complex webpage interaction and grounding. -- [ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents](https://sites.google.com/view/st-webagentbench/home) - - Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov - - πŸ›οΈ Institutions: IBM Research - - πŸ“… Date: October 9, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [safety], [trustworthiness], [ST-WebAgentBench] - - πŸ“– TLDR: This paper introduces **ST-WebAgentBench**, a benchmark designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. It defines safe and trustworthy agent behavior, outlines the structure of safety policies, and introduces the "Completion under Policies" metric to assess agent performance. The study reveals that current state-of-the-art agents struggle with policy adherence, highlighting the need for improved policy awareness and compliance in web agents. - - [ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents](https://arxiv.org/abs/2410.11872) - Jakub Hoscilowicz, Bartosz Maj, Bartosz Kozakiewicz, Oleksii Tymoschuk, Artur Janicki - πŸ›οΈ Institutions: Samsung R&D Poland, Warsaw University of Technology @@ -126,6 +117,15 @@ - πŸ”‘ Key: [framework], [model], [SeeClick], [AITW benchmark] - πŸ“– TLDR: The paper introduces *ClickAgent*, a framework that enhances autonomous agents' interaction with mobile UIs by improving their ability to locate interface elements accurately. This is achieved through a dual-component system where an MLLM performs reasoning and action planning, while a dedicated UI location model (e.g., SeeClick) handles element identification. ClickAgent, evaluated on the AITW benchmark and tested on both emulators and real Android devices, surpasses other agents like CogAgent and AppAgent in task success rate, advancing automation reliability on mobile platforms. +- [ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents](https://sites.google.com/view/st-webagentbench/home) + - Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov + - πŸ›οΈ Institutions: IBM Research + - πŸ“… Date: October 9, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Web] + - πŸ”‘ Key: [benchmark], [safety], [trustworthiness], [ST-WebAgentBench] + - πŸ“– TLDR: This paper introduces **ST-WebAgentBench**, a benchmark designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. It defines safe and trustworthy agent behavior, outlines the structure of safety policies, and introduces the "Completion under Policies" metric to assess agent performance. The study reveals that current state-of-the-art agents struggle with policy adherence, highlighting the need for improved policy awareness and compliance in web agents. + - [Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale](https://microsoft.github.io/WindowsAgentArena/) - Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui - πŸ›οΈ Institutions: Microsoft @@ -189,15 +189,6 @@ - πŸ”‘ Key: [benchmark], [planning], [reasoning], [WorkArena++] - πŸ“– TLDR: This paper introduces **WorkArena++**, a benchmark comprising 682 tasks that simulate realistic workflows performed by knowledge workers. It evaluates web agents' capabilities in planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding. The study reveals challenges faced by current large language models and vision-language models in serving as effective workplace assistants, providing a resource to advance autonomous agent development. [oai_citation_attribution:0‑arXiv](https://arxiv.org/abs/2407.05291?utm_source=chatgpt.com) -- [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents](https://yuxiangchai.github.io/AMEX/) - - Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng Li - - πŸ›οΈ Institutions: CUHK, SJTU, Shanghai AI Lab, vivo AI Lab - - πŸ“… Date: July 3, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [dataset], [benchmark], [AMEX] - - πŸ“– TLDR: This paper introduces the **Android Multi-annotation EXpo (AMEX)**, a comprehensive dataset designed for training and evaluating mobile GUI-control agents. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, annotated at multiple levels, including GUI interactive element grounding, functionality descriptions, and complex natural language instructions. The dataset aims to advance research on AI agents capable of completing complex tasks by interacting directly with mobile device GUIs. - - [CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents](https://arxiv.org/abs/2407.01511) - Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li - πŸ›οΈ Institutions: KAUST, UTokyo, CMU, Stanford, Harvard, Tsinghua University, SUSTech, Oxford @@ -207,6 +198,15 @@ - πŸ”‘ Key: [benchmark], [framework], [evaluation], [CRAB] - πŸ“– TLDR: The authors present *CRAB*, a benchmark framework designed to evaluate Multimodal Language Model agents across multiple environments. It features a graph-based fine-grained evaluation method and supports automatic task generation, addressing limitations in existing benchmarks. +- [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents](https://yuxiangchai.github.io/AMEX/) + - Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng Li + - πŸ›οΈ Institutions: CUHK, SJTU, Shanghai AI Lab, vivo AI Lab + - πŸ“… Date: July 3, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [dataset], [benchmark], [AMEX] + - πŸ“– TLDR: This paper introduces the **Android Multi-annotation EXpo (AMEX)**, a comprehensive dataset designed for training and evaluating mobile GUI-control agents. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, annotated at multiple levels, including GUI interactive element grounding, functionality descriptions, and complex natural language instructions. The dataset aims to advance research on AI agents capable of completing complex tasks by interacting directly with mobile device GUIs. + - [E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion](https://arxiv.org/abs/2406.14250) - Ke Wang, Tianyu Xia, Zhangxuan Gu, Yi Zhao, Shuheng Shen, Changhua Meng, Weiqiang Wang, Ke Xu - πŸ›οΈ Institutions: Ant Group, Tsinghua University diff --git a/paper_by_key/paper_dataset.md b/paper_by_key/paper_dataset.md index 03a05a9..67a52c7 100644 --- a/paper_by_key/paper_dataset.md +++ b/paper_by_key/paper_dataset.md @@ -54,15 +54,6 @@ - πŸ”‘ Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA] - πŸ“– TLDR: This paper introduces **VideoWebArena (VideoWA)**, a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements. -- [Dissecting Adversarial Robustness of Multimodal LM Agents](https://openreview.net/forum?id=LjVIGva5Ct) - - Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, Aditi Raghunathan - - πŸ›οΈ Institutions: CMU, Stanford - - πŸ“… Date: October 21, 2024 - - πŸ“‘ Publisher: NeurIPS 2024 Workshop - - πŸ’» Env: [Web] - - πŸ”‘ Key: [dataset], [attack], [ARE], [safety] - - πŸ“– TLDR: This paper introduces the Agent Robustness Evaluation (ARE) framework to assess the adversarial robustness of multimodal language model agents in web environments. By creating 200 targeted adversarial tasks within VisualWebArena, the study reveals that minimal perturbations can significantly compromise agent performance, even in advanced systems utilizing reflection and tree-search mechanisms. The findings highlight the need for enhanced safety measures in deploying such agents. - - [AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?](https://arxiv.org/abs/2407.15711) - Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant - πŸ›οΈ Institutions: Tel Aviv University @@ -72,6 +63,15 @@ - πŸ”‘ Key: [benchmark], [dataset], [planning and reasoning] - πŸ“– TLDR: AssistantBench is a benchmark designed to test the abilities of web agents in completing time-intensive, realistic web-based tasks. Covering 214 tasks across various domains, the benchmark introduces the SPA (See-Plan-Act) framework to handle multi-step planning and memory retention. AssistantBench emphasizes realistic task completion, showing that current agents achieve only modest success, with significant improvements needed for complex information synthesis and execution across multiple web domains. +- [Dissecting Adversarial Robustness of Multimodal LM Agents](https://openreview.net/forum?id=LjVIGva5Ct) + - Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, Aditi Raghunathan + - πŸ›οΈ Institutions: CMU, Stanford + - πŸ“… Date: October 21, 2024 + - πŸ“‘ Publisher: NeurIPS 2024 Workshop + - πŸ’» Env: [Web] + - πŸ”‘ Key: [dataset], [attack], [ARE], [safety] + - πŸ“– TLDR: This paper introduces the Agent Robustness Evaluation (ARE) framework to assess the adversarial robustness of multimodal language model agents in web environments. By creating 200 targeted adversarial tasks within VisualWebArena, the study reveals that minimal perturbations can significantly compromise agent performance, even in advanced systems utilizing reflection and tree-search mechanisms. The findings highlight the need for enhanced safety measures in deploying such agents. + - [Harnessing Webpage UIs for Text-Rich Visual Understanding](https://arxiv.org/abs/2410.13824) - Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue - πŸ›οΈ Institutions: CMU diff --git a/paper_by_key/paper_framework.md b/paper_by_key/paper_framework.md index 77d98ff..17df7fa 100644 --- a/paper_by_key/paper_framework.md +++ b/paper_by_key/paper_framework.md @@ -81,15 +81,6 @@ - πŸ”‘ Key: [framework], [reinforcement learning], [self-evolving curriculum], [WebRL], [outcome-supervised reward model] - πŸ“– TLDR: This paper introduces *WebRL*, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open large language models (LLMs). WebRL addresses challenges such as the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. It incorporates a self-evolving curriculum that generates new tasks from unsuccessful attempts, a robust outcome-supervised reward model (ORM), and adaptive reinforcement learning strategies to ensure consistent improvements. Applied to Llama-3.1 and GLM-4 models, WebRL significantly enhances their performance on web-based tasks, surpassing existing state-of-the-art web agents. -- [AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents](https://arxiv.org/abs/2410.24024) - - Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong - - πŸ›οΈ Institutions: Tsinghua University, Peking University - - πŸ“… Date: October 31, 2024 - - πŸ“‘ Publisher: arXiv - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [framework], [dataset], [benchmark], [AndroidLab] - - πŸ“– TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates. - - [From Context to Action: Analysis of the Impact of State Representation and Context on the Generalization of Multi-Turn Web Navigation Agents](https://arxiv.org/abs/2409.13701) - Nalin Tiwary, Vardhan Dongre, Sanil Arun Chawla, Ashwin Lamani, Dilek Hakkani-TΓΌr - πŸ›οΈ Institutions: UIUC @@ -99,6 +90,15 @@ - πŸ”‘ Key: [framework], [context management], [generalization], [multi-turn navigation], [CWA] - πŸ“– TLDR: This study examines how different contextual elements affect the performance and generalization of Conversational Web Agents (CWAs) in multi-turn web navigation tasks. By optimizing context managementβ€”specifically interaction history and web page representationβ€”the research demonstrates enhanced agent performance across various out-of-distribution scenarios, including unseen websites, categories, and geographic locations. +- [AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents](https://arxiv.org/abs/2410.24024) + - Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong + - πŸ›οΈ Institutions: Tsinghua University, Peking University + - πŸ“… Date: October 31, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [framework], [dataset], [benchmark], [AndroidLab] + - πŸ“– TLDR: This paper introduces **AndroidLab**, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates. + - [Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents](https://arxiv.org/abs/2410.22552) - Jaekyeom Kim, Dong-Ki Kim, Lajanugen Logeswaran, Sungryull Sohn, Honglak Lee - πŸ›οΈ Institutions: LG AI Research, Field AI, University of Michigan @@ -288,15 +288,6 @@ - πŸ”‘ Key: [framework], [AppAgent v2] - πŸ“– TLDR: This work presents *AppAgent v2*, a novel LLM-based multimodal agent framework for mobile devices capable of navigating applications by emulating human-like interactions such as tapping and swiping. The agent constructs a flexible action space that enhances adaptability across various applications, including parsing text and vision descriptions. It operates through two main phases: exploration and deployment, utilizing retrieval-augmented generation (RAG) technology to efficiently retrieve and update information from a knowledge base, thereby empowering the agent to perform tasks effectively and accurately. -- [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://aclanthology.org/2024.findings-acl.539) - - Xinbei Ma, Zhuosheng Zhang, Hai Zhao - - πŸ›οΈ Institutions: SJTU - - πŸ“… Date: August 2024 - - πŸ“‘ Publisher: ACL 2024 - - πŸ’» Env: [Mobile] - - πŸ”‘ Key: [model], [framework], [benchmark] - - πŸ“– TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios​:contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}​:contentReference[oaicite:2]{index=2}​:contentReference[oaicite:3]{index=3}. - - [OmniParser for Pure Vision Based GUI Agent](https://microsoft.github.io/OmniParser/) - Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah - πŸ›οΈ Institutions: MSR, Microsoft Gen AI @@ -306,6 +297,15 @@ - πŸ”‘ Key: [framework], [dataset], [OmniParser] - πŸ“– TLDR: This paper introduces **OmniParser**, a method for parsing user interface screenshots into structured elements, enhancing the ability of models like GPT-4V to generate actions accurately grounded in corresponding UI regions. The authors curated datasets for interactable icon detection and icon description, fine-tuning models to parse interactable regions and extract functional semantics of UI elements. +- [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://aclanthology.org/2024.findings-acl.539) + - Xinbei Ma, Zhuosheng Zhang, Hai Zhao + - πŸ›οΈ Institutions: SJTU + - πŸ“… Date: August 2024 + - πŸ“‘ Publisher: ACL 2024 + - πŸ’» Env: [Mobile] + - πŸ”‘ Key: [model], [framework], [benchmark] + - πŸ“– TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios​:contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}​:contentReference[oaicite:2]{index=2}​:contentReference[oaicite:3]{index=3}. + - [Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems](https://arxiv.org/abs/2407.13032) - Aditya Vempaty, [Other authors not provided in the search results] - πŸ›οΈ Institutions: Emergence AI diff --git a/paper_by_key/paper_survey.md b/paper_by_key/paper_survey.md index 96406d7..fb0951a 100644 --- a/paper_by_key/paper_survey.md +++ b/paper_by_key/paper_survey.md @@ -1,5 +1,14 @@ # Papers with Keyword: survey +- [GUI Agents: A Survey](https://arxiv.org/pdf/2412.13501) + - Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt + - πŸ›οΈ Institutions: University of Maryland, SUNY Buffalo, Univ. of Oregon, Adobe Research, Meta AI, Univ. of Rochester, UC San Diego, Carnegie Mellon Univ., Dolby Labs, Intel AI Research, UNSW + - πŸ“… Date: December 18, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [survey] + - πŸ“– TLDR: This survey provides a comprehensive overview of GUI agents powered by Large Foundation Models, detailing their benchmarks, evaluation metrics, architectures, and training methods. It introduces a unified framework outlining their perception, reasoning, planning, and acting capabilities, identifies open challenges, and discusses future research directions, serving as a resource for both practitioners and researchers in the field. + - [Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms](https://arxiv.org/abs/2411.10943) - Minghe Gao, Wendong Bu, Bingchen Miao, Yang Wu, Yunfei Li, Juncheng Li, Siliang Tang, Qi Wu, Yueting Zhuang, Meng Wang - πŸ›οΈ Institutions: Zhejiang University, University of Adelaide, Hefei University of Technology diff --git a/update_template_or_data/statistics/keyword_wordcloud.png b/update_template_or_data/statistics/keyword_wordcloud.png index fb71857..dc36bf7 100644 Binary files a/update_template_or_data/statistics/keyword_wordcloud.png and b/update_template_or_data/statistics/keyword_wordcloud.png differ diff --git a/update_template_or_data/statistics/keyword_wordcloud_long.png b/update_template_or_data/statistics/keyword_wordcloud_long.png index 366f810..6c7ca4c 100644 Binary files a/update_template_or_data/statistics/keyword_wordcloud_long.png and b/update_template_or_data/statistics/keyword_wordcloud_long.png differ