This repo covers a variety of papers related to GUI Agents, such as:
- Datasets
- Benchmarks
- Models
- Agent frameworks
- Vision, language, multimodal foundation models (with explicit support for GUI)
- Works in general domains extensively used by GUI Agents (e.g., SoM prompting)
Web | Mobile | Desktop | GUI | Misc |
---|
(Misc: Papers for general topics that have important applications in GUI agents.)
framework (107) | benchmark (68) | dataset (63) | model (31) | reinforcement learning (13) | safety (11) | visual grounding (8) | planning (7) | reasoning (6) | grounding (5) | vision language model (5) | survey (4) | attack (4) | learning (4) | evaluation (4) | synthetic data (3) | foundation model (3) | UI understanding (3) | self-improvement (3) | programming-by-demonstration (3)
Yu Su (9) | Graham Neubig (8) | Huan Sun (8) | Tianbao Xie (7) | Tao Yu (7) | Boyuan Zheng (7) | Shuyan Zhou (7) | Xiao Liu (6) | Hanyu Lai (6) | Jie Tang (6) | Yuxiao Dong (6) | Difei Gao (5) | Mike Zheng Shou (5) | Zhiyong Wu (5) | Daniel Fried (5) | Toby Jia-Jun Li (5) | Ruslan Salakhutdinov (4) | Caiming Xiong (4) | Boyu Gou (4) | Yu Gu (4)
Papers
-
OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use
- Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shawn Wang, Xinchen Xu, Shuofei Qiao , Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, Fei Wu
- ποΈ Institutions: Zhejiang University, Fudan University, OPPO AI Center, University of Chinese Academy of Sciences, Chinese Academy of Sciences, The Chinese University of Hong Kong, Tsinghua University, 01.AI, The Hong Kong Polytechnic University, SJTU
- π Date: December 20, 2024
- π Publisher: Github Repo
- π» Env: [GUI]
- π Key: [survey]
- π TLDR: This paper conducts a comprehensive survey on OS Agents, which are (M)LLM-based agents that use computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks. The survey begins by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. Methodologies for constructing OS Agents are examined, with a focus on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, current challenges and promising future research directions, including safety and privacy, personalization and self-evolution, are discussed.
-
- Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt
- ποΈ Institutions: UMD, SUNY Buffalo, University of Oregon, Adobe Research, Meta AI, University of Rochester, UCSD, CMU, Dolby Labs, Intel AI Research, UNSW
- π Date: December 18, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [survey]
- π TLDR: This survey provides a comprehensive overview of GUI agents powered by Large Foundation Models, detailing their benchmarks, evaluation metrics, architectures, and training methods. It introduces a unified framework outlining their perception, reasoning, planning, and acting capabilities, identifies open challenges, and discusses future research directions, serving as a resource for both practitioners and researchers in the field.
-
Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
- Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, Erran Li
- ποΈ Institutions: UCB, UIUC, Amazon
- π Date: December 17, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [reinforcement learning], [skill discovery], [PAE]
- π TLDR: This paper introduces the Proposer-Agent-Evaluator (PAE) system, enabling foundation model agents to autonomously discover and practice skills in real-world web environments. PAE comprises a context-aware task proposer, an agent policy for task execution, and a vision-language model-based success evaluator. Validated on vision-based web navigation tasks, PAE significantly enhances zero-shot generalization capabilities of vision-language model Internet agents, achieving over 30% relative improvement on unseen tasks and websites, and surpassing state-of-the-art open-source agents by more than 10%.
-
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining
- Zhiqi Ge, Juncheng Li, Xinglei Pang, Minghe Gao, Kaihang Pan, Wang Lin, Hao Fei, Wenqiao Zhang, Siliang Tang, Yueting Zhuang
- ποΈ Institutions: Zhejiang University, NUS
- π Date: December 13, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [framework], [Information-Sensitive Cropping], [Self-Refining Dual Learning], [visual grounding], [model]
- π TLDR: This paper introduces Iris, a visual agent designed to enhance GUI automation by addressing challenges in high-resolution, complex digital environments. It employs two key innovations: Information-Sensitive Cropping (ISC), which dynamically identifies and prioritizes visually dense regions using an edge detection algorithm for efficient processing, and Self-Refining Dual Learning (SRDL), which enhances the agent's ability to handle complex tasks through a dual-learning loop that iteratively refines its performance without requiring additional annotated data. Empirical evaluations demonstrate that Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations, outperforming methods using ten times more training data.
-
Falcon-UI: Understanding GUI Before Following User Instructions
- Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, Xiangyang Ji
- ποΈ Institutions: Chinese Academy of Sciences, Tsinghua University, Nankai University, BAAI
- π Date: Dec 12, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [model], [dataset], [Falcon-UI], [GUI understanding]
- π TLDR: This paper introduces Falcon-UI, a GUI agent model that emphasizes understanding GUI contexts before following user instructions. The authors present the Insight-UI Dataset, an instruction-free GUI navigation dataset generated from the Common Crawl corpus, simulating various platforms like iOS, Android, Windows, and Linux across multiple resolutions on 312K domains. Falcon-UI is pretrained on this dataset and fine-tuned on Android and Web GUI datasets, achieving performance comparable to larger models, highlighting the importance of decoupling GUI understanding from instruction following in agent performance.
-
The BrowserGym Ecosystem for Web Agent Research
- Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, LΓ©o Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han LΓΉ, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste
- ποΈ Institutions: ServiceNow Research, Mila, Polytechnique MontrΓ©al, CMU, McGill University, Tel Aviv University, UniversitΓ© de MontrΓ©al, iMean AI
- π Date: December 6, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [framework], [LLM], [automation], [BrowserGym], [AgentLab]
- π TLDR: This paper presents BrowserGym, an ecosystem designed to standardize the evaluation and benchmarking of web agents, particularly those leveraging Large Language Models (LLMs). It addresses the challenges posed by fragmented benchmarks and inconsistent methodologies in web agent research. BrowserGym provides a unified, gym-like environment with clearly defined observation and action spaces, enabling reproducible comparisons across various benchmarks. Additionally, AgentLab, a complementary framework, supports agent creation, testing, and analysis. The paper also features a large-scale experiment comparing the performance of 6 leading LLMs, highlighting the strengths and weaknesses of different models in real-world web tasks, while emphasizing the ongoing challenges in building efficient and robust web agents.
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
- Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong
- ποΈ Institutions: HKU, NTU, Salesforce
- π Date: Dec 5, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [model], [dataset], [planning], [reasoning], [Aguvis], [visual grounding]
- π TLDR: This paper introduces Aguvis, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. It leverages image-based observations and grounds natural language instructions to visual elements, employing a consistent action space to ensure cross-platform generalization. The approach integrates explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. A large-scale dataset of GUI agent trajectories is constructed, incorporating multimodal reasoning and grounding. Comprehensive experiments demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. All datasets, models, and training recipes are open-sourced to facilitate future research.
-
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
- Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
- ποΈ Institutions: NUS, Microsoft
- π Date: November 26, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [model], [framework], [dataset], [UI-Guided Visual Token Selection], [Interleaved Vision-Language-Action Streaming], [ShowUI]
- π TLDR: This paper introduces ShowUI, a vision-language-action model designed to enhance GUI automation by addressing challenges in UI visual perception and action modeling. It features innovations like UI-Guided Visual Token Selection to reduce computational costs and Interleaved Vision-Language-Action Streaming for effective management of visual-action history. Trained on a curated dataset, ShowUI achieves 75.1% accuracy in zero-shot screenshot grounding and demonstrates competitive performance across web, mobile, and online environments.
-
AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations
- Gaurav Verma, Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Tucker Balch, Manuela Veloso
- ποΈ Institutions: J.P. Morgan AI Research
- π Date: November 24, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [few-shot learning], [meta-learning], [AdaptAgent]
- π TLDR: This paper introduces AdaptAgent, a framework that enables multimodal web agents to adapt to new websites and domains using few human demonstrations (up to 2). The approach enhances agents' adaptability beyond large-scale pre-training and fine-tuning by leveraging in-context learning and meta-learning techniques. Experiments on benchmarks like Mind2Web and VisualWebArena show that incorporating minimal human demonstrations boosts task success rates significantly, highlighting the effectiveness of multimodal demonstrations over text-only ones and the impact of data selection strategies during meta-learning on agent generalization.
-
Improved GUI Grounding via Iterative Narrowing
- Anthony Nguyen
- ποΈ Institutions: Algoma University
- π Date: November 18, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [framework], [grounding], [visual grounding], [iterative narrowing]
- π TLDR: This paper introduces a visual framework to enhance GUI grounding. By iteratively refining model predictions through progressively focused image crops, the proposed method improves the performance of both general and fine-tuned Vision-Language Models (VLMs) in GUI grounding tasks.
-
Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms
- Minghe Gao, Wendong Bu, Bingchen Miao, Yang Wu, Yunfei Li, Juncheng Li, Siliang Tang, Qi Wu, Yueting Zhuang, Meng Wang
- ποΈ Institutions: Zhejiang University, University of Adelaide, Hefei University of Technology
- π Date: November 17, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [survey], [Generalist Virtual Agent], [GVA], [autonomous agents], [digital platforms]
- π TLDR: This survey introduces the concept of Generalist Virtual Agents (GVAs), autonomous entities designed to operate across various digital platforms and environments to assist users in performing diverse tasks. It traces the evolution of GVAs from early intelligent assistants to modern implementations incorporating large-scale models, discussing their philosophical foundations, development challenges, and current methodologies. The paper provides a detailed taxonomy of GVA environments, tasks, and capabilities, aiming to bridge theoretical and practical aspects and suggesting that agents operating in environments closely mirroring the real world are more likely to exhibit human-like intelligence.
-
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use
- Siyuan Hu, Mingyu Ouyang, Difei Gao, Mike Zheng Shou
- ποΈ Institutions: NUS
- π Date: Nov 15, 2024
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [framework], [Claude 3.5 Computer Use], [GUI automation], [planning], [action], [critic]
- π TLDR: This study evaluates Claude 3.5 Computer Use, an AI model enabling end-to-end language-to-desktop actions, through curated tasks across various domains. It introduces an out-of-the-box framework for deploying API-based GUI automation models, analyzing the model's planning, action execution, and adaptability to dynamic environments.
-
WebOlympus: An Open Platform for Web Agents on Live Websites
- Boyuan Zheng, Boyu Gou, Scott Salisbury, Zheng Du, Huan Sun, Yu Su
- ποΈ Institutions: OSU
- π Date: November 12, 2024
- π Publisher: EMNLP 2024
- π» Env: [Web]
- π Key: [safety], [Chrome extension], [WebOlympus], [SeeAct], [Annotation Tool]
- π TLDR: This paper introduces WebOlympus, an open platform designed to facilitate the research and deployment of web agents on live websites. It features a user-friendly Chrome extension interface, allowing users without programming expertise to operate web agents with minimal effort. The platform incorporates a safety monitor module to prevent harmful actions through human supervision or model-based control, supporting applications such as annotation interfaces for web agent trajectories and data crawling.
-
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents
- Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su
- ποΈ Institutions: OSU, Orby AI
- π Date: November 10, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [WebDreamer], [model-based planning], [world model]
- π TLDR: This paper investigates whether Large Language Models (LLMs) can function as world models within web environments, enabling model-based planning for web agents. Introducing WebDreamer, a framework that leverages LLMs to simulate potential action sequences in web environments, the study demonstrates significant performance improvements over reactive baselines on benchmarks like VisualWebArena and Mind2Web-live. The findings suggest that LLMs possess the capability to model the dynamic nature of the internet, paving the way for advancements in automated web interaction and opening new research avenues in optimizing LLMs for complex, evolving environments.
-
GUI Agents with Foundation Models: A Comprehensive Survey
- Shuai Wang, Weiwen Liu, Jingxuan Chen, Weinan Gan, Xingshan Zeng, Shuai Yu, Xinlong Hao, Kun Shao, Yasheng Wang, Ruiming Tang
- ποΈ Institutions: Huawei Noahβs Ark Lab
- π Date: November 7, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [survey]
- π TLDR: This survey consolidates recent research on GUI agents powered by foundation models, particularly Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). It discusses representative datasets and benchmarks, summarizes a unified framework capturing essential components from prior studies, and explores commercial applications. The paper identifies key challenges and proposes future research directions to inspire further developments in (M)LLM-based GUI agents.
-
Attacking Vision-Language Computer Agents via Pop-ups
- Yanzhe Zhang, Tao Yu, Diyi Yang
- ποΈ Institutions: Georgia Tech, HKU, Stanford
- π Date: Nov 4, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [attack], [adversarial pop-ups], [VLM agents], [safety]
- π TLDR: This paper demonstrates that vision-language model (VLM) agents can be easily deceived by carefully designed adversarial pop-ups, leading them to perform unintended actions such as clicking on these pop-ups instead of completing their assigned tasks. Integrating these pop-ups into environments like OSWorld and VisualWebArena resulted in an average attack success rate of 86% and a 47% decrease in task success rate. Basic defense strategies, such as instructing the agent to ignore pop-ups or adding advertisement notices, were found to be ineffective against these attacks.
-
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
- Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Xinyue Yang, Jiadai Sun, Yu Yang, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, Yuxiao Dong
- ποΈ Institutions: Tsinghua University, BAAI
- π Date: November 4, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [reinforcement learning], [self-evolving curriculum], [WebRL], [outcome-supervised reward model]
- π TLDR: This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open large language models (LLMs). WebRL addresses challenges such as the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. It incorporates a self-evolving curriculum that generates new tasks from unsuccessful attempts, a robust outcome-supervised reward model (ORM), and adaptive reinforcement learning strategies to ensure consistent improvements. Applied to Llama-3.1 and GLM-4 models, WebRL significantly enhances their performance on web-based tasks, surpassing existing state-of-the-art web agents.
-
- Nalin Tiwary, Vardhan Dongre, Sanil Arun Chawla, Ashwin Lamani, Dilek Hakkani-TΓΌr
- ποΈ Institutions: UIUC
- π Date: October 31, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [context management], [generalization], [multi-turn navigation], [CWA]
- π TLDR: This study examines how different contextual elements affect the performance and generalization of Conversational Web Agents (CWAs) in multi-turn web navigation tasks. By optimizing context managementβspecifically interaction history and web page representationβthe research demonstrates enhanced agent performance across various out-of-distribution scenarios, including unseen websites, categories, and geographic locations.
-
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
- Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong
- ποΈ Institutions: Tsinghua University, Peking University
- π Date: October 31, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [dataset], [benchmark], [AndroidLab]
- π TLDR: This paper introduces AndroidLab, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates.
-
OS-ATLAS: A Foundation Action Model For Generalist GUI Agents
- Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao
- ποΈ Institutions: Shanghai AI Lab, Shanghai Jiaotong University, HKU, MIT
- π Date: October 30, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [model], [dataset], [benchmark], [OS-Atlas]
- π TLDR: This paper introduces OS-Atlas, a foundational GUI action model designed to enhance GUI grounding and out-of-distribution tasks. The authors developed a toolkit to synthesize multi-platform GUI grounding data, resulting in a cross-platform corpus of over 13 million GUI elements. OS-Atlas demonstrates significant performance improvements across six benchmarks spanning mobile, desktop, and web platforms.
-
Evaluating Cultural and Social Awareness of LLM Web Agents
- Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu
- ποΈ Institutions: UCLA, Salesforce AI Research
- π Date: October 30, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [CASA], [cultural awareness], [social awareness], [fine-tuning], [prompting]
- π TLDR: This paper introduces CASA, a benchmark designed to assess the cultural and social awareness of LLM web agents in tasks like online shopping and social discussion forums. It evaluates agents' abilities to detect and appropriately respond to norm-violating user queries and observations. The study finds that current LLM agents have limited cultural and social awareness, with less than 10% awareness coverage and over 40% violation rates. To enhance performance, the authors explore prompting and fine-tuning methods, demonstrating that combining both can offer complementary advantages.
-
Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents
- Jaekyeom Kim, Dong-Ki Kim, Lajanugen Logeswaran, Sungryull Sohn, Honglak Lee
- ποΈ Institutions: LG AI Research, Field AI, University of Michigan
- π Date: October 29, 2024
- π Publisher: EMNLP 2024 (Findings)
- π» Env: [Web]
- π Key: [framework], [Auto-Intent]
- π TLDR: The paper presents Auto-Intent, a method to adapt pre-trained large language models for web navigation tasks without direct fine-tuning. It discovers underlying intents from domain demonstrations and trains an intent predictor to enhance decision-making. Auto-Intent improves the performance of GPT-3.5, GPT-4, and Llama-3.1 agents on benchmarks like Mind2Web and WebArena.
-
- Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Hongming Zhang, Tianqing Fang, Zhenzhong Lan, Dong Yu
- ποΈ Institutions: Zhejiang University, Tencent AI Lab, Westlake University
- π Date: October 25, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [learning], [imitation learning], [exploration], [AI feedback]
- π TLDR: The paper presents OpenWebVoyager, an open-source framework for training web agents that explore real-world online environments autonomously. The framework employs a cycle of exploration, feedback, and optimization, enhancing agent capabilities through multimodal perception and iterative learning. Initial skills are acquired through imitation learning, followed by real-world exploration, where the agentβs performance is evaluated and refined through feedback loops.
-
AutoGLM: Autonomous Foundation Agents for GUIs
- Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, Jie Tang
- ποΈ Institutions: Zhipu AI, Tsinghua University
- π Date: October 25, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [framework], [model], [learning], [AutoGLM]
- π TLDR: This paper introduces AutoGLM, a new series in the ChatGLM family, designed as foundation agents for autonomous control of digital devices through GUIs. It addresses the challenges foundation models face in decision-making within dynamic environments by developing agents capable of learning through autonomous interactions. Focusing on web browsers and Android devices, AutoGLM integrates various techniques to create deployable agent systems. Key insights include the importance of designing an appropriate "intermediate interface" for GUI control and a novel progressive training framework for self-evolving online curriculum reinforcement learning. Evaluations demonstrate AutoGLM's effectiveness across multiple domains, achieving notable success rates in web browsing and Android device control tasks.
-
EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data
- Xuetian Chen, Hangcheng Li, Jiaqing Liang, Sihang Jiang, Deqing Yang
- ποΈ Institutions: Fudan University
- π Date: October 25, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [dataset], [framework], [synthetic data]
- π TLDR: The EDGE framework proposes an innovative approach to improve GUI understanding and interaction capabilities in vision-language models through large-scale, multi-granularity synthetic data generation. By leveraging webpage data, EDGE minimizes the need for manual annotations and enhances the adaptability of models across desktop and mobile GUI environments. Evaluations show its effectiveness in diverse GUI-related tasks, contributing significantly to autonomous agent development in GUI navigation and interaction.
-
Beyond Browsing: API-Based Web Agents
- Yueqi Song, Frank Xu, Shuyan Zhou, Graham Neubig
- ποΈ Institutions: CMU
- π Date: October 24, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [API-based agent], [hybrid agent], [benchmark], [WebArena], [SOTA performance]
- π TLDR: This paper introduces API-based and hybrid agents designed to execute online tasks by accessing both APIs and traditional web browsing interfaces. In evaluations using WebArena, a benchmark for web navigation, the API-based agent achieves higher performance than browser-based agents, and the hybrid model achieves a success rate of 35.8%, setting a new state-of-the-art (SOTA) in task-agnostic web navigation. The findings highlight the efficiency and reliability gains of API interactions for web agents.
-
- Chengyou Jia, Minnan Luo, Zhuohang Dang, Qiushi Sun, Fangzhi Xu, Junlin Hu, Tianbao Xie, Zhiyong Wu
- ποΈ Institutions: XJTU, Shanghai AI Lab, HKU
- π Date: October 24, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [framework], [multi-agent systems], [specialized generalist agent], [OSWorld benchmark]
- π TLDR: AgentStore introduces a scalable platform to integrate and manage heterogeneous agents, designed to enhance generalist assistant capabilities for diverse computer tasks. Using a MetaAgent and AgentToken strategy, AgentStore shows improved generalization on the OSWorld benchmark.
-
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
- Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida
- ποΈ Institutions: CMU, MIT, NYU, Microsoft
- π Date: October 24, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA]
- π TLDR: This paper introduces VideoWebArena (VideoWA), a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements.
-
Lightweight Neural App Control
- Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao
- ποΈ Institutions: Huawei Noah's Ark Lab, UCL
- π Date: October 23, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [vision language model], [Action Transformer], [app agent], [Android control], [multi-modal]
- π TLDR: This paper introduces LiMAC, a mobile control framework for Android that integrates an Action Transformer and fine-tuned vision-language models to execute precise actions in mobile apps. Tested on open-source datasets, LiMAC improves action accuracy by up to 42% over traditional prompt engineering baselines, demonstrating enhanced efficiency and accuracy in mobile app control tasks.
-
MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control
- Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee
- ποΈ Institutions: KAIST, UT at Austin
- π Date: October 23, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [safety], [evaluation], [Android emulator]
- π TLDR: MobileSafetyBench introduces a benchmark for evaluating the safety of large language model (LLM)-based autonomous agents in mobile device control. Using Android emulators, the benchmark simulates real-world tasks in apps such as messaging and banking to assess agents' safety and helpfulness. The safety-focused tasks test for privacy risk management and robustness against adversarial prompt injections. Experiments show agents perform well in helpful tasks but struggle with safety-related challenges, underscoring the need for continued advancements in mobile safety mechanisms for autonomous agents.
-
Large Language Models Empowered Personalized Web Agents
- Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, Tat-Seng Chua
- ποΈ Institutions: HK PolyU, NTU Singapore
- π Date: Oct 22, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [benchmark], [personalized web agent], [user behavior alignment], [memory-enhanced alignment]
- π TLDR: This paper proposes a novel framework, Personalized User Memory-enhanced Alignment (PUMA), enabling large language models to serve as personalized web agents by incorporating user-specific data and historical web interactions. The authors also introduce a benchmark, PersonalWAB, to evaluate these agents on various personalized web tasks. Results show that PUMA improves web agent performance by optimizing action execution based on user-specific preferences.
-
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
- Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant
- ποΈ Institutions: Tel Aviv University
- π Date: October 21, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [dataset], [planning and reasoning]
- π TLDR: AssistantBench is a benchmark designed to test the abilities of web agents in completing time-intensive, realistic web-based tasks. Covering 214 tasks across various domains, the benchmark introduces the SPA (See-Plan-Act) framework to handle multi-step planning and memory retention. AssistantBench emphasizes realistic task completion, showing that current agents achieve only modest success, with significant improvements needed for complex information synthesis and execution across multiple web domains.
-
Dissecting Adversarial Robustness of Multimodal LM Agents
- Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, Aditi Raghunathan
- ποΈ Institutions: CMU, Stanford
- π Date: October 21, 2024
- π Publisher: NeurIPS 2024 Workshop
- π» Env: [Web]
- π Key: [dataset], [attack], [ARE], [safety]
- π TLDR: This paper introduces the Agent Robustness Evaluation (ARE) framework to assess the adversarial robustness of multimodal language model agents in web environments. By creating 200 targeted adversarial tasks within VisualWebArena, the study reveals that minimal perturbations can significantly compromise agent performance, even in advanced systems utilizing reflection and tree-search mechanisms. The findings highlight the need for enhanced safety measures in deploying such agents.
-
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
- Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao
- ποΈ Institutions: Huawei Noahβs Ark Lab, Harbin Institute of Technology, Shenzhen, UCL
- π Date: October 19, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [AI agent], [smartphone control], [framework]
- π TLDR: SPA-Bench is introduced as a benchmark designed to evaluate multimodal large language model (MLLM)-based smartphone agents, offering a task set that spans common smartphone functionalities across system and third-party applications. It includes a plug-and-play framework for real-time agent interactions on Android, integrating over ten agents with an adaptable evaluation pipeline measuring success across diverse metrics. Through this, the benchmark exposes challenges such as UI interpretation, action grounding, and memory retention in mobile environments, advancing research in smartphone-based agent applications.
-
Harnessing Webpage UIs for Text-Rich Visual Understanding
- Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue
- ποΈ Institutions: CMU
- π Date: October 17, 2024
- π Publisher: arXiv
- π» Env: [Web], [Doc]
- π Key: [dataset], [model], [text-rich visual understanding], [web UI comprehension]
- π TLDR: This paper introduces MultiUI, a large-scale dataset containing 7.3 million annotated samples from 1 million websites, specifically designed to enhance multimodal large language modelsβ (MLLMs) capabilities in text-rich visual understanding. Utilizing webpage UI structures as a training resource, MultiUI provides robust accessibility tree data paired with UI screenshots, significantly improving MLLMsβ grounding, OCR, and interaction performance. Models trained with MultiUI achieve up to a 48% performance boost on VisualWebBench and demonstrate enhanced generalization across non-web tasks, setting a new standard for structured, visually integrated web data modeling.
-
AutoWebGLM: A Large Language Model-based Web Navigating Agent
- Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang
- ποΈ Institutions: THU, OSU
- π Date: October 12, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [dataset], [benchmark], [reinforcement learning]
- π TLDR: AutoWebGLM introduces a web navigation agent based on ChatGLM3-6B, designed to autonomously navigate and interact with webpages for complex tasks. The paper highlights a two-phase data construction approach using a hybrid human-AI methodology for diverse, curriculum-based web task training. It also presents AutoWebBench, a benchmark for evaluating agent performance in web tasks, and uses reinforcement learning to fine-tune operations, addressing complex webpage interaction and grounding.
-
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
- Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Scale Red Team, Elaine Chang, Vaughn Robinson, Sean Hendryx, Shuyan Zhou, Matt Fredrikson, Summer Yue, Zifan Wang
- ποΈ Institutions: CMU, GraySwan AI, Scale AI
- π Date: October 11, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [attack], [BrowserART], [jailbreaking], [safety]
- π TLDR: This paper introduces Browser Agent Red teaming Toolkit (BrowserART), a comprehensive test suite for evaluating the safety of LLM-based browser agents. The study reveals that while refusal-trained LLMs decline harmful instructions in chat settings, their corresponding browser agents often comply with such instructions, indicating a significant safety gap. The authors call for collaboration among developers and policymakers to enhance agent safety.
-
Agent S: An Open Agentic Framework that Uses Computers Like a Human
- Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, Xin Eric Wang
- ποΈ Institutions: Simular Research
- π Date: October 10, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [framework], [autonomous GUI interaction], [experience-augmented hierarchical planning]
- π TLDR: This paper introduces Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI). The system addresses key challenges in automating computer tasks through experience-augmented hierarchical planning and an Agent-Computer Interface (ACI). Agent S demonstrates significant improvements over baselines on the OSWorld benchmark, achieving a 20.58% success rate (83.6% relative improvement). The framework shows generalizability across different operating systems and provides insights for developing more effective GUI agents.
-
TinyClick: Single-Turn Agent for Empowering GUI Automation
- Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Marcin Skorupa, Adam Wiacek, Sebastien Postansque, Jakub Hoscilowicz
- ποΈ Institutions: Samsung R&D Poland, Warsaw University of Technology
- π Date: October 9, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [framework], [vision language model], [Screenspot], [OmniAct]
- π TLDR: TinyClick is a compact, single-turn agent designed to automate GUI tasks by precisely locating screen elements via the Vision-Language Model Florence-2-Base. Trained with multi-task strategies and MLLM-based data augmentation, TinyClick achieves high accuracy on Screenspot and OmniAct, outperforming specialized GUI interaction models and general MLLMs like GPT-4V. The model's lightweight design (0.27B parameters) ensures fast processing and minimal latency, making it efficient for real-world applications on multiple platforms.
-
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
- Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov
- ποΈ Institutions: IBM Research
- π Date: October 9, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [safety], [trustworthiness], [ST-WebAgentBench]
- π TLDR: This paper introduces ST-WebAgentBench, a benchmark designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. It defines safe and trustworthy agent behavior, outlines the structure of safety policies, and introduces the "Completion under Policies" metric to assess agent performance. The study reveals that current state-of-the-art agents struggle with policy adherence, highlighting the need for improved policy awareness and compliance in web agents.
-
ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents
- Jakub Hoscilowicz, Bartosz Maj, Bartosz Kozakiewicz, Oleksii Tymoschuk, Artur Janicki
- ποΈ Institutions: Samsung R&D Poland, Warsaw University of Technology
- π Date: October 9, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [model], [SeeClick], [AITW benchmark]
- π TLDR: The paper introduces ClickAgent, a framework that enhances autonomous agents' interaction with mobile UIs by improving their ability to locate interface elements accurately. This is achieved through a dual-component system where an MLLM performs reasoning and action planning, while a dedicated UI location model (e.g., SeeClick) handles element identification. ClickAgent, evaluated on the AITW benchmark and tested on both emulators and real Android devices, surpasses other agents like CogAgent and AppAgent in task success rate, advancing automation reliability on mobile platforms.
-
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
- Boyu Gou, Ruochen Wang, Boyuan Zheng, Yucheng Xie, Cheng Chang, Yiheng Shu, Haotian Sun, Yu Su
- ποΈ Institutions: OSU, Orby AI
- π Date: October 7, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [framework], [visual grounding], [GUI agents], [cross-platform generalization], [UGround], [SeeAct-V], [synthetic data]
- π TLDR: This paper introduces UGround, a universal visual grounding model for GUI agents that enables human-like navigation of digital interfaces. The authors advocate for GUI agents with human-like embodiment that perceive the environment entirely visually and take pixel-level actions. UGround is trained on a large-scale synthetic dataset of 10M GUI elements across 1.3M screenshots. Evaluated on six benchmarks spanning grounding, offline, and online agent tasks, UGround significantly outperforms existing visual grounding models by up to 20% absolute. Agents using UGround achieve comparable or better performance than state-of-the-art agents that rely on additional textual input, demonstrating the feasibility of vision-only GUI agents.
-
ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning
- Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, Zhou Yu
- ποΈ Institutions: Columbia Univ., MSR
- π Date: Oct 2, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [learning], [R-MCTS], [Exploratory Learning], [VisualWebArena]
- π TLDR: This paper introduces ExACT, an approach that combines Reflective Monte Carlo Tree Search (R-MCTS) and Exploratory Learning to enhance AI agents' exploration and decision-making capabilities in complex web environments. R-MCTS incorporates contrastive reflection and multi-agent debate for improved search efficiency and reliable state evaluation. Evaluated on the VisualWebArena benchmark, the GPT-4o-based R-MCTS agent demonstrates significant performance improvements over previous state-of-the-art methods. Additionally, knowledge gained from test-time search is effectively transferred back to GPT-4o through fine-tuning, enabling the model to explore, evaluate, and backtrack without external search algorithms, achieving 87% of R-MCTS's performance with reduced computational resources.
-
Dynamic Planning for LLM-based Graphical User Interface Automation
- Shaoqing Zhang, Zhuosheng Zhang, Kehai Chen, Xinbei Ma, Muyun Yang, Tiejun Zhao, Min Zhang
- ποΈ Institutions: SJTU
- π Date: October 1, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [dynamic planning]
- π TLDR: This paper introduces a novel method called Dynamic Planning of Thoughts (D-PoT) aimed at enhancing LLM-based agents for GUI tasks. It addresses the challenges of task execution by dynamically adjusting planning based on environmental feedback and action history, outperforming existing methods such as ReAct by improving accuracy significantly in navigating GUI environments. The study emphasizes the importance of integrating execution history and contextual cues to optimize decision-making processes for autonomous agents.
-
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
- Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang
- ποΈ Institutions: Apple
- π Date: September 30, 2024
- π Publisher: arXiv
- π» Env: [Misc]
- π Key: [model], [MM1.5], [vision language model], [visual grounding], [reasoning], [data-centric], [analysis]
- π TLDR: This paper introduces MM1.5, a family of multimodal large language models (MLLMs) ranging from 1B to 30B parameters, including dense and mixture-of-experts variants. MM1.5 enhances capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. The authors employ a data-centric training approach, utilizing high-quality OCR data and synthetic captions for continual pre-training, alongside an optimized visual instruction-tuning data mixture for supervised fine-tuning. Specialized variants, MM1.5-Video and MM1.5-UI, are designed for video understanding and mobile UI comprehension, respectively. Extensive empirical studies provide insights into the training processes, offering guidance for future MLLM development.
-
AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents
- Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, Bo Li
- ποΈ Institutions: UIUC, OSU
- π Date: September 27, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [safety], [black-box attack], [adversarial prompter model], [Direct Policy Optimization]
- π TLDR: This paper presents AdvWeb, a black-box attack framework that exploits vulnerabilities in vision-language model (VLM)-powered web agents by injecting adversarial prompts directly into web pages. Using Direct Policy Optimization (DPO), AdvWeb trains an adversarial prompter model that can mislead agents into executing harmful actions, such as unauthorized financial transactions, while maintaining high stealth and control. Extensive evaluations reveal that AdvWeb achieves high success rates across multiple real-world tasks, emphasizing the need for stronger security measures in web agent deployments.
-
Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale
- Tianyue Ou, Frank F. Xu, Aman Madaan, Jiarui Liu, Robert Lo, Abishek Sridhar, Sudipta Sengupta, Dan Roth, Graham Neubig, Shuyan Zhou
- ποΈ Institutions: CMU, Amazon AWS AI
- π Date: September 27, 2024
- π Publisher: arXiv
- π» Env: [Misc]
- π Key: [synthetic data]
- π TLDR: Synatra introduces a scalable framework for digital agents, enabling them to convert indirect knowledge sources into actionable demonstrations. This approach enhances the ability of agents to learn tasks without extensive labeled data, leveraging insights from indirect observations to scale practical implementations in digital environments.
-
- Junting Lu, Zhiyang Zhang, Fangkai Yang, Jue Zhang, Lu Wang, Chao Du, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
- ποΈ Institutions: Peking University, Microsoft
- π Date: September 26, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [framework], [API interaction], [HACI], [Agent OS]
- π TLDR: This paper proposes an API-centered framework called AXIS, enhancing the efficiency and reliability of LLM-based agents by prioritizing API interactions over UI-based actions. This approach aims to reduce the high latency and error rates of traditional UI-interaction models. AXIS not only supports the rapid creation and extension of APIs through automated application exploration but also contributes to a new Human-Agent-Computer Interaction (HACI) framework. The paper outlines the development of an agent-centric operating system (Agent OS), which improves task completion times by up to 70% and reduces cognitive load on users while maintaining high accuracy across complex multi-application tasks.
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
- Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi
- ποΈ Institutions: AI2, UW
- π Date: September 25, 2024
- π Publisher: arXiv
- π» Env: [Misc]
- π Key: [model], [dataset], [PixMo], [Molmo], [vision language model], [foundation model]
- π TLDR: This paper introduces Molmo, a family of state-of-the-art open vision-language models (VLMs), and PixMo, a collection of new datasets including detailed image captions, free-form image Q&A, and innovative 2D pointing data, all collected without reliance on proprietary VLMs. The authors demonstrate that careful model design, a well-tuned training pipeline, and high-quality open datasets can produce VLMs that outperform existing open models and rival proprietary systems. The model weights, datasets, and source code are made publicly available to advance research in this field.
-
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
- Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Shuo Shang
- ποΈ Institutions: XiaoMi AI Lab, University of Electronic Science and Technology of China, Renmin University of China
- π Date: September 23, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [model], [dataset], [MobileVLM], [Mobile3M], [UI understanding]
- π TLDR: This paper introduces MobileVLM, a vision-language model designed to enhance both intra- and inter-UI understanding for mobile applications. The authors propose two additional pre-training stages with four specific UI-based tasks to improve the model's perception of fine-grained elements and capture page transition actions. To support this, they constructed Mobile3M, a large-scale Chinese mobile dataset comprising 3 million UI pages and real-world transition actions, organized into directed graphs. Experimental results demonstrate that MobileVLM outperforms existing vision-language models on both in-house test sets and public mobile benchmarks.
-
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
- Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin
- ποΈ Institutions: Alibaba Cloud
- π Date: September 18, 2024
- π Publisher: arXiv
- π» Env: [Misc]
- π Key: [foundation model], [MLLM], [Qwen2-VL]
- π TLDR: Qwen2-VL introduces an advanced vision-language framework that enables dynamic resolution handling for images and videos through its Naive Dynamic Resolution mechanism and Multimodal Rotary Position Embedding (M-RoPE). This structure allows the model to convert images of varying resolutions into diverse token counts for improved visual comprehension. With model sizes up to 72B parameters, Qwen2-VL demonstrates competitive performance across multiple benchmarks, achieving results on par with or better than prominent multimodal models like GPT-4o and Claude3.5-Sonnet. This work represents a significant step forward in scalable vision-language learning for multimodal tasks.
-
EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage
- Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, Huan Sun
- ποΈ Institutions: OSU, UCLA, UChicago, UIUC, UW-Madison
- π Date: September 17, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [safety], [privacy attack], [environmental injection], [stealth attack]
- π TLDR: This paper introduces the Environmental Injection Attack (EIA), a privacy attack targeting generalist web agents by embedding malicious yet concealed web elements to trick agents into leaking users' PII. Utilizing 177 action steps within realistic web scenarios, EIA demonstrates a high success rate in extracting specific PII and whole user requests. Through its detailed threat model and defense suggestions, the work underscores the challenge of detecting and mitigating privacy risks in autonomous web agents.
-
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
- Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui
- ποΈ Institutions: Microsoft
- π Date: September 13, 2024
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [framework], [benchmark], [Navi]
- π TLDR: This paper introduces the Windows Agent Arena (WAA), a scalable platform for testing and benchmarking multi-modal AI agents within a realistic Windows OS environment. WAA enables researchers to evaluate agentic workflows across diverse tasks and supports large-scale deployment using Azure ML. The study also presents Navi, a multi-modal agent achieving a 19.5% success rate on Windows tasks, highlighting the platform's potential for advancing AI agent development.
-
- Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig
- ποΈ Institutions: CMU, MIT
- π Date: September 11, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [memory], [AWM]
- π TLDR: The paper proposes Agent Workflow Memory (AWM), a method enabling language model-based agents to induce and utilize reusable workflows from past experiences to guide future actions in web navigation tasks. AWM operates in both offline and online settings, significantly improving performance on benchmarks like Mind2Web and WebArena, and demonstrating robust generalization across tasks, websites, and domains.
-
From Grounding to Planning: Benchmarking Bottlenecks in Web Agents
- Segev Shlomov, Ben Wiesel, Aviad Sela, Ido Levy, Liane Galanti, Roy Abitbol
- ποΈ Institutions: IBM
- π Date: September 3, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [planning], [grounding], [Mind2Web dataset], [web navigation]
- π TLDR: This paper analyzes performance bottlenecks in web agents by separately evaluating grounding and planning tasks, isolating their individual impacts on navigation efficacy. Using an enhanced version of the Mind2Web dataset, the study reveals planning as a significant bottleneck, with advancements in grounding and task-specific benchmarking for elements like UI component recognition. Through experimental adjustments, the authors propose a refined evaluation framework, aiming to enhance web agents' contextual adaptability and accuracy in complex web environments.
-
TinyAgent: Function Calling at the Edge
- Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami
- ποΈ Institutions: UC Berkeley, ICSI
- π Date: September 1, 2024
- π Publisher: EMNLP 2024
- π» Env: [Desktop]
- π Key: [framework], [dataset], [quantization], [LLMCompiler], [TinyAgent-1.1B], [TinyAgent-7B]
- π TLDR: This paper introduces TinyAgent, an end-to-end framework for training and deploying task-specific small language model agents capable of function calling at the edge. By fine-tuning small models with curated datasets and employing techniques like quantization and a novel tool retrieval method, TinyAgent enables efficient, real-time execution of user commands on local devices without relying on cloud infrastructure. The framework demonstrates that these small models can match or even surpass the function-calling capabilities of larger models like GPT-4-Turbo while operating entirely on edge devices.
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
- Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
- ποΈ Institutions: MultiOn, Stanford
- π Date: August 13, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [MCTS], [Tree Search], [DPO], [Reinforcement Learning]
- π TLDR: TBD
-
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
- Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang
- ποΈ Institutions: Tsinghua University, MSRA, The Ohio State University
- π Date: August 12, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [benchmark], [dataset], [VisualAgentBench], [VAB]
- π TLDR: The authors introduce VisualAgentBench (VAB), a comprehensive benchmark designed to train and evaluate large multimodal models (LMMs) as visual foundation agents across diverse scenarios, including embodied tasks, graphical user interfaces, and visual design. VAB comprises five distinct environments that systematically challenge LMMs' understanding and interaction capabilities. Additionally, the benchmark offers supervised fine-tuning trajectory data for behavior cloning training, demonstrating the potential to improve open LMMs for serving as visual foundation agents.
-
AppAgent v2: Advanced Agent for Flexible Mobile Interactions
- Yanda Li, Chi Zhang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, Yunchao Wei
- ποΈ Institutions: University of Technology Sydney, Tencent, Beijing Jiaotong University, Westlake University
- π Date: August 5, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [AppAgent v2]
- π TLDR: This work presents AppAgent v2, a novel LLM-based multimodal agent framework for mobile devices capable of navigating applications by emulating human-like interactions such as tapping and swiping. The agent constructs a flexible action space that enhances adaptability across various applications, including parsing text and vision descriptions. It operates through two main phases: exploration and deployment, utilizing retrieval-augmented generation (RAG) technology to efficiently retrieve and update information from a knowledge base, thereby empowering the agent to perform tasks effectively and accurately.
-
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation
- Xinbei Ma, Zhuosheng Zhang, Hai Zhao
- ποΈ Institutions: SJTU
- π Date: August 2024
- π Publisher: ACL 2024
- π» Env: [Mobile]
- π Key: [model], [framework], [benchmark]
- π TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios.
-
Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions
- Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao
- ποΈ Institutions: SJTU, Meta
- π Date: August 2024
- π Publisher: arXiv
- π» Env: [Misc]
- π Key: [multimodal agents], [environmental distractions], [robustness]
- π TLDR: This paper highlights the vulnerability of multimodal agents to environmental distractions. The researchers demonstrate that these agents, which process multiple types of input (e.g., text, images, audio), can be significantly impacted by irrelevant or misleading environmental cues. The study provides insights into the limitations of current multimodal systems and emphasizes the need for more robust architectures that can filter out distractions and maintain focus on relevant information in complex, real-world environments.
-
OmniParser for Pure Vision Based GUI Agent
- Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah
- ποΈ Institutions: MSR, Microsoft Gen AI
- π Date: August 1, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [framework], [dataset], [OmniParser]
- π TLDR: This paper introduces OmniParser, a method for parsing user interface screenshots into structured elements, enhancing the ability of models like GPT-4V to generate actions accurately grounded in corresponding UI regions. The authors curated datasets for interactable icon detection and icon description, fine-tuning models to parse interactable regions and extract functional semantics of UI elements.
-
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
- Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, Jingbo Shang
- ποΈ Institutions: UCSD, UCLA, AI2
- π Date: July 26, 2024
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [benchmark], [multi-application], [office automation]
- π TLDR: OfficeBench introduces a benchmark that evaluates language models' ability to automate office tasks across a range of applications like Word, Excel, and email. The benchmark tests agentsβ skills in task-switching, planning, and decision-making by simulating realistic office workflows. Current models, including GPT-4, demonstrate significant gaps in task accuracy and efficiency, revealing areas for improvement in managing complex, multi-application tasks in office environments.
-
Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems
- Aditya Vempaty, [Other authors not provided in the search results]
- ποΈ Institutions: Emergence AI
- π Date: July 17, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [autonomous web navigation], [hierarchical architecture], [DOM distillation]
- π TLDR: This paper presents Agent-E, a novel web agent that introduces several architectural improvements over previous state-of-the-art systems. Key features include a hierarchical architecture, flexible DOM distillation and denoising methods, and a "change observation" concept for improved performance. Agent-E outperforms existing text and multi-modal web agents by 10-30% on the WebVoyager benchmark. The authors synthesize their findings into general design principles for developing agentic systems, including the use of domain-specific primitive skills, hierarchical architectures, and agentic self-improvement.
-
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
- Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongsheng Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu
- ποΈ Institutions: HKU, SJTU, Google Cloud AI Research, Google DeepMind, Salesforce Research, Yale University, Sea AI Lab, University of Waterloo
- π Date: July 15, 2024
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [benchmark], [dataset], [data science], [engineering workflows], [Spider2-V]
- π TLDR: This paper introduces Spider2-V, a multimodal agent benchmark designed to evaluate the capability of agents in automating professional data science and engineering workflows. It comprises 494 real-world tasks across 20 enterprise-level applications, assessing agents' proficiency in code generation and GUI operations within authentic computer environments.
-
AUITestAgent: Automatic Requirements Oriented GUI Function Testing
- Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, Yangfan Zhou
- ποΈ Institutions: Fudan University, Meituan
- π Date: July 12, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [GUI testing], [AUITestAgent]
- π TLDR: This paper presents AUITestAgent, the first automatic, natural language-driven GUI testing tool for mobile apps. It automates the entire process of GUI interaction and function verification by extracting GUI interactions from test requirements via dynamically organized agents and employing a multi-dimensional data extraction strategy for verification.
-
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence
- Weize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, Maosong Sun
- ποΈ Institutions: Tsinghua University, Peking University, BUPT, Tencent
- π Date: July 7, 2024
- π Publisher: arXiv
- π» Env: [Misc]
- π Key: [framework], [IoA]
- π TLDR: The paper proposes the Internet of Agents (IoA), a framework inspired by the Internet to facilitate collaboration among diverse autonomous agents. IoA introduces an agent integration protocol, dynamic teaming mechanisms, and conversation flow control, enabling flexible and scalable multi-agent collaboration. Experiments demonstrate IoA's superior performance across various tasks, highlighting its effectiveness in integrating heterogeneous agents.
-
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks
- LΓ©o Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier De Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, Alexandre Drouin
- ποΈ Institutions: ServiceNow Research, Mila, Polytechnique MontrΓ©al, UniversitΓ© de MontrΓ©al
- π Date: July 7, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [planning], [reasoning], [WorkArena++]
- π TLDR: This paper introduces WorkArena++, a benchmark comprising 682 tasks that simulate realistic workflows performed by knowledge workers. It evaluates web agents' capabilities in planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding. The study reveals challenges faced by current large language models and vision-language models in serving as effective workplace assistants, providing a resource to advance autonomous agent development. oai_citation_attribution:0β‘arXiv
-
MobileFlow: A Multimodal LLM For Mobile GUI Agent
- Songqin Nong, Jiali Zhu, Rui Wu, Jiongchao Jin, Shuo Shan, Xiutian Huang, Wenhao Xu
- ποΈ Institutions: Ant Group
- π Date: July 5, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [model], [framework], [MobileFlow]
- π TLDR: This paper introduces MobileFlow, a multimodal large language model tailored for mobile GUI agents. With approximately 21 billion parameters and hybrid visual encoders, it supports variable image resolutions and multilingual GUIs, enhancing the model's ability to interpret image data and comprehend user instructions for GUI interaction tasks.
-
MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices
- Jiayi Zhang, Chuang Zhao, Yihan Zhao, Zhaoyang Yu, Ming He, Jianping Fan
- ποΈ Institutions: HKUST, Ant Group
- π Date: July 4, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [tool formulation], [multi-agent collaboration], [MobileExperts]
- π TLDR: This paper introduces MobileExperts, a framework that enhances autonomous operations on mobile devices by dynamically assembling agent teams based on user requirements. Each agent independently explores and formulates tools to evolve into an expert, improving efficiency and reducing reasoning costs.
-
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
- Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li
- ποΈ Institutions: KAUST, UTokyo, CMU, Stanford, Harvard, Tsinghua University, SUSTech, Oxford
- π Date: July 3, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [benchmark], [framework], [evaluation], [CRAB]
- π TLDR: The authors present CRAB, a benchmark framework designed to evaluate Multimodal Language Model agents across multiple environments. It features a graph-based fine-grained evaluation method and supports automatic task generation, addressing limitations in existing benchmarks.
-
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
- Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng Li
- ποΈ Institutions: CUHK, SJTU, Shanghai AI Lab, vivo AI Lab
- π Date: July 3, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [dataset], [benchmark], [AMEX]
- π TLDR: This paper introduces the Android Multi-annotation EXpo (AMEX), a comprehensive dataset designed for training and evaluating mobile GUI-control agents. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, annotated at multiple levels, including GUI interactive element grounding, functionality descriptions, and complex natural language instructions. The dataset aims to advance research on AI agents capable of completing complex tasks by interacting directly with mobile device GUIs.
-
Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model
- Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Boyu Wu, Yawen Wang, Jun Hu, Qing Wang
- ποΈ Institutions: Institute of Software, Chinese Academy of Sciences, Monash University, Beijing Institute of Technology, University of Chinese Academy of Sciences
- π Date: July 3, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [VisionDroid]
- π TLDR: The paper presents VisionDroid, a vision-driven automated GUI testing approach utilizing Multimodal Large Language Models (MLLM) to detect non-crash functional bugs in mobile applications. By extracting GUI text information and aligning it with screenshots, VisionDroid enables MLLM to understand GUI context, facilitating deeper and function-oriented exploration. The approach segments exploration history into logically cohesive parts, prompting MLLM for bug detection, demonstrating superior performance over existing methods.
-
Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
- Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang
- ποΈ Institutions: UCSC, MSR
- π Date: June 27, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [framework], [dataset], [ToL], [screen reading], [accessibility]
- π TLDR: The authors propose the Tree-of-Lens (ToL) agent to address the Screen Point-and-Read (ScreenPR) task, which involves generating natural language descriptions of screen regions based on user-indicated points. The ToL agent constructs a Hierarchical Layout Tree to comprehend the content and articulate the layout and spatial relationships between elements. The authors also introduce the ScreenPR benchmark, consisting of 650 screenshots from web, mobile, and operating system GUIs, manually annotated with 1,500 target points and regions.
-
Octo-planner: On-device Language Model for Planner-Action Agents
- Nexa AI Team
- ποΈ Institutions: Nexa AI
- π Date: June 26, 2024
- π Publisher: arXiv
- π» Env: [Misc]
- π Key: [model], [framework], [Octo-planner], [on-device], [planning]
- π TLDR: This paper presents Octo-planner, an on-device planning model designed for the Planner-Action Agents Framework. Octo-planner utilizes a fine-tuned model based on Phi-3 Mini (3.8 billion parameters) for high efficiency and low power consumption. It separates planning and action execution into two distinct components: a planner agent optimized for edge devices and an action agent using the Octopus model for function execution. The model achieves a planning success rate of 98.1% on benchmark datasets, providing reliable and effective performance.
-
VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning
- Ziyang Meng, Yu Dai, Zezheng Gong, Shaoxiong Guo, Minglong Tang, Tongquan Wei
- ποΈ Institutions: SJTU
- π Date: June 20, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [model], [dataset], [framework], [VGA], [hallucination]
- π TLDR: This paper introduces VGA, a fine-tuned model designed to enhance GUI comprehension by reducing hallucinations. The authors constructed a Vision Question Answering (VQA) dataset of 63.8k high-quality examples using a Referent Method, ensuring model responses are highly dependent on visual content. They also propose a two-stage fine-tuning method called Foundation and Advanced Comprehension (FAC) to improve the model's ability to extract information from images and align with human intent.
-
E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion
- Ke Wang, Tianyu Xia, Zhangxuan Gu, Yi Zhao, Shuheng Shen, Changhua Meng, Weiqiang Wang, Ke Xu
- ποΈ Institutions: Ant Group, Tsinghua University
- π Date: June 20, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [dataset], [benchmark], [E-ANT]
- π TLDR: This paper introduces E-ANT, the first large-scale Chinese GUI navigation dataset comprising over 40,000 real human interaction traces across more than 5,000 tiny apps. The dataset includes high-quality screenshots with annotations, facilitating the evaluation and development of GUI navigation and decision-making capabilities in multimodal large language models (MLLMs). The authors also assess various MLLMs on E-ANT, providing insights into their performance and potential improvements.
-
Identifying User Goals from UI Trajectories
- Omri Berkovitch, Sapir Caduri, Noam Kahlon, Anatoly Efros, Avi Caciularu, Ido Dagan
- ποΈ Institutions: Google Research, Bar-Ilan University
- π Date: June 20, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [evaluation metric], [intent identification]
- π TLDR: This paper introduces the task of goal identification from observed UI trajectories, aiming to infer the user's intended task based on their GUI interactions. It proposes a novel evaluation metric to assess whether two task descriptions are paraphrases within a specific UI environment. Experiments utilizing the Android-In-The-Wild and Mind2Web datasets reveal that state-of-the-art models, such as GPT-4 and Gemini-1.5 Pro, underperform compared to humans, indicating significant room for improvement.
-
VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought
- Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki
- ποΈ Institutions: CMU, Google DeepMind
- π Date: June 20, 2024
- π Publisher: NeurIPS 2024
- π» Env: [GUI]
- π Key: [framework], [memory], [in-context learning], [ICAL]
- π TLDR: This paper introduces In-Context Abstraction Learning (ICAL), a method enabling Vision-Language Models (VLMs) to generate their own examples from sub-optimal demonstrations and human feedback. By abstracting trajectories into generalized programs of thought, ICAL enhances decision-making in retrieval-augmented LLM and VLM agents, reducing reliance on manual prompt engineering and improving performance across various tasks.
-
GUI Action Narrator: Where and When Did That Action Take Place?
- Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou
- ποΈ Institutions: NUS, Chinese Academy of Sciences
- π Date: June 19, 2024
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [dataset], [framework], [Act2Cap], [GUI Narrator]
- π TLDR: The authors present Act2Cap, a GUI action dataset containing 4,189 video-caption pairs depicting various GUI actions such as clicks, drags, and typing across multiple software environments. They also propose GUI Narrator, a framework that leverages cursor detection as a visual prompt to enhance the interpretation of high-resolution screenshots for GUI video captioning. Evaluations reveal that even advanced multimodal models face challenges in this domain, highlighting the need for specialized approaches to improve performance.
-
WebCanvas: Benchmarking Web Agents in Online Environments
- Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu
- ποΈ Institutions: iMean AI, CMU
- π Date: June 18, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [dataset], [benchmark], [Mind2Web-Live], [key-node evaluation]
- π TLDR: This paper presents WebCanvas, an online evaluation framework for web agents designed to address the dynamic nature of web interactions. It introduces a key-node-based evaluation metric to capture critical actions or states necessary for task completion while disregarding noise from insignificant events or changed web elements. The framework includes the Mind2Web-Live dataset, a refined version of the original Mind2Web static dataset, containing 542 tasks with 2,439 intermediate evaluation states. Despite advancements, the best-performing model achieves a task success rate of 23.1%, highlighting substantial room for improvement.
-
Adversarial Attacks on Multimodal Agents
- Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan
- ποΈ Institutions: CMU
- π Date: Jun 18, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [safety], [VisualWebArena-Adv]
- π TLDR: This paper investigates the safety risks posed by multimodal agents built on vision-enabled language models (VLMs). The authors introduce two adversarial attack methods: a captioner attack targeting white-box captioners and a CLIP attack that transfers to proprietary VLMs. To evaluate these attacks, they curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena. The study demonstrates that within a limited perturbation norm, the captioner attack can achieve a 75% success rate in making a captioner-augmented GPT-4V agent execute adversarial goals. The paper also discusses the robustness of agents based on other VLMs and provides insights into factors contributing to attack success and potential defenses. oai_citation_attribution:0β‘ArXiv
-
GUICourse: From General Vision Language Models to Versatile GUI Agents
- Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun
- ποΈ Institutions: Tsinghua University, Rhapsody AI, University of Electronic Science and Technology of China
- π Date: June 17, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [dataset], [framework], [GUICourse]
- π TLDR: This paper introduces GUICourse, a suite of datasets aimed at training visual-based GUI agents from general vision-language models. It addresses challenges in OCR, grounding, and GUI knowledge, enhancing the models' capabilities in GUI navigation tasks.
-
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
- Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, Lichao Sun
- ποΈ Institutions: Huazhong University of Science and Technology (HUST), MSR, University of Illinois at Chicago (UIC)
- π Date: June 16, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [dataset], [benchmark], [GUI-World], [GUI-Vid]
- π TLDR: This paper introduces GUI-World, a comprehensive dataset designed to evaluate Multimodal Large Language Models (MLLMs) in dynamic and complex GUI environments. It includes over 12,000 annotated GUI interaction videos covering diverse applications and scenarios. The study highlights the limitations of current MLLMs in handling dynamic and multi-step tasks and presents GUI-Vid, a fine-tuned VideoLLM, demonstrating improved understanding of various GUI tasks.
-
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
- Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar
- ποΈ Institutions: UC Berkeley, UIUC, Google DeepMind
- π Date: June 14, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [reinforcement learning], [DigiRL]
- π TLDR: The authors present DigiRL, an autonomous reinforcement learning approach for training device-control agents. By fine-tuning a pre-trained vision-language model in two stagesβoffline and offline-to-online RLβDigiRL achieves a significant improvement in success rates on the Android-in-the-Wild dataset, establishing a new state-of-the-art for digital agents in device control.
-
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
- Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, Ping Luo
- ποΈ Institutions: OpenGVLab, Shanghai AI Laboratory, HKU, Nanjing University, Harbin Institute of Technology, Shenzhen, SJTU
- π Date: June 13, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [dataset], [model], [OdysseyAgent], [cross-app navigation]
- π TLDR: This paper presents GUI Odyssey, a dataset comprising 7,735 episodes from six mobile devices, designed to train and evaluate cross-app navigation agents. It spans six types of cross-app tasks across 201 apps and 1,399 app combinations. Leveraging this dataset, the authors developed OdysseyAgent, a multimodal cross-app navigation agent fine-tuned from the Qwen-VL model, demonstrating superior accuracy over existing models in both in-domain and out-of-domain scenarios.
-
Practical, Automated Scenario-based Mobile App Testing
- Shengcheng Yu, Chunrong Fang, Mingzhe Du, Zimin Ding, Zhenyu Chen, Zhendong Su
- ποΈ Institutions: Nanjing University, ETH Zurich
- π Date: June 12, 2024
- π Publisher: IEEE Transactions on Software Engineering
- π» Env: [Mobile]
- π Key: [framework], [ScenTest], [event knowledge graph], [GUI image understanding]
- π TLDR: This paper introduces ScenTest, a novel approach for scenario-based mobile app testing that integrates event knowledge graphs (EKGs) with GUI image understanding. By extracting entities and relationships from crowdsourced test reports, ScenTest constructs EKGs for specific scenarios, guiding automated testing processes. This method bridges the gap between testing execution and app business logic, achieving fully automated testing on target scenarios for the first time.
-
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
- Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, Shoufa Chen
- ποΈ Institutions: CMU, University of Michigan, Northeastern University, HKU
- π Date: June 12, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [MobileAgentBench]
- π TLDR: This paper introduces MobileAgentBench, a benchmark designed to evaluate the performance of large language model-based mobile agents. It defines 100 tasks across 10 open-source apps, categorized by difficulty levels, and assesses existing agents like AppAgent and MobileAgent to facilitate systematic comparisons.
-
On the Effects of Data Scale on UI Control Agents
- Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, Oriana Riva
- ποΈ Institutions: Google DeepMind, Google
- π Date: June 6, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [dataset], [AndroidControl], [fine-tuning], [scalability]
- π TLDR: This study investigates how the performance of computer control agents scales with the amount of fine-tuning data. The authors introduce AndroidControl, a dataset comprising 15,283 demonstrations across 833 Android applications. Findings indicate that while in-domain performance improves with more data, out-of-domain performance, especially on high-level tasks, scales more slowly, suggesting that fine-tuning alone may be insufficient for robust out-of-domain performance.
-
- Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
- ποΈ Institutions: Alibaba Group, Beijing University of Posts and Telecommunications
- π Date: June 3, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [multi-agent], [planning], [decision-making], [reflection]
- π TLDR: The paper presents Mobile-Agent-v2, a multi-agent architecture designed to assist with mobile device operations. It comprises three agents: a planning agent that generates task progress, a decision agent that navigates tasks using a memory unit, and a reflection agent that corrects erroneous operations. This collaborative approach addresses challenges in navigation and long-context input scenarios, achieving over a 30% improvement in task completion compared to single-agent architectures.
-
WebSuite: Systematically Evaluating Why Web Agents Fail
- Eric Li, Jim Waldo
- ποΈ Institutions: Harvard
- π Date: June 1, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [framework], [failure analysis], [analysis], [task disaggregation]
- π TLDR: This paper introduces WebSuite, a diagnostic benchmark to investigate the causes of web agent failures. By categorizing agent tasks using a taxonomy of operational, informational, and navigational actions, WebSuite offers granular insights into the specific actions where agents struggle, like filtering or form completion. It enables detailed comparison across agents, identifying areas for architectural and UX adaptation to improve agent reliability and task success on the web.
-
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
- Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
- ποΈ Institutions: NUS, Microsoft Gen AI
- π Date: June 2024
- π Publisher: NeurIPS 2024
- π» Env: [Desktop, Web]
- π Key: [benchmark], [instructional videos], [visual planning], [hierarchical task decomposition], [complex software interaction]
- π TLDR: VideoGUI presents a benchmark for evaluating GUI automation on tasks derived from instructional videos, focusing on visually intensive applications like Adobe Photoshop and video editing software. The benchmark includes 178 tasks, with a hierarchical evaluation method distinguishing high-level planning, mid-level procedural steps, and precise action execution. VideoGUI reveals current model limitations in complex visual tasks, marking a significant step toward improved visual planning in GUI automation.
-
Visual Grounding for User Interfaces
- Yijun Qian, Yujie Lu, Alexander Hauptmann, Oriana Riva
- ποΈ Institutions: CMU, UCSB
- π Date: June 2024
- π Publisher: NAACL 2024
- π» Env: [GUI]
- π Key: [framework], [visual grounding], [UI element localization], [LVG]
- π TLDR: This work introduces the task of visual UI grounding, which unifies detection and grounding by enabling models to identify UI elements referenced by natural language commands solely from visual input. The authors propose LVG, a model that outperforms baselines pre-trained on larger datasets by over 4.9 points in top-1 accuracy, demonstrating its effectiveness in localizing referenced UI elements without relying on UI metadata.
-
Large Language Models Can Self-Improve At Web Agent Tasks
- Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter
- ποΈ Institutions: University of Pennsylvania, ExtensityAI, Johannes Kepler University Linz, NXAI
- π Date: May 30, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [self-improvement], [self-improve]
- π TLDR: This paper investigates the ability of large language models (LLMs) to enhance their performance as web agents through self-improvement. Utilizing the WebArena benchmark, the authors fine-tune LLMs on synthetic training data, achieving a 31% improvement in task completion rates. They also introduce novel evaluation metrics to assess the performance, robustness, and quality of the fine-tuned agents' trajectories.
-
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
- Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, Oriana Riva
- ποΈ Institutions: Google DeepMind, Google
- π Date: May 23, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [Android-based agents], [task diversity], [reinforcement learning], [dynamic environment]
- π TLDR: AndroidWorld introduces a dynamic Android environment for benchmarking autonomous agents across 116 tasks spanning 20 Android apps. These tasks vary through parameterized and natural language prompts, fostering a realistic testing ground for agents designed to operate in complex mobile environments. The benchmark supports millions of task variations, allowing agents to respond to the Android system's changing states and improving real-world applicability.
-
Unveiling Disparities in Web Task Handling Between Human and Web Agent
- Kihoon Son, Jinhyeon Kwon, DaEun Choi, Tae Soo Kim, Young-Ho Kim, Sangdoo Yun, Juho Kim
- ποΈ Institutions: KAIST, Seoul National University
- π Date: May 7, 2024
- π Publisher: CHI 2024 Workshop
- π» Env: [Web]
- π Key: [framework], [cognitive comparison], [task analysis]
- π TLDR: This paper examines how humans and web agents differ in handling web-based tasks, focusing on key aspects such as planning, action-taking, and reflection. Using a think-aloud protocol, the study highlights the cognitive processes humans employ, like exploration and adjustment, versus the more rigid task execution patterns observed in web agents. The authors identify several limitations in current web agents, proposing the need for improved frameworks to enhance adaptability and knowledge update mechanisms in agent-based systems.
-
- Lucas-AndreΓ― Thil, Mirela Popa, Gerasimos Spanakis
- ποΈ Institutions: Maastricht University the Netherlands
- π Date: May 1, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [large language models], [reinforcement learning]
- π TLDR: This paper proposes a novel approach combining supervised learning (SL) and reinforcement learning (RL) techniques to train web navigation agents using large language models. The authors address limitations in previous models' understanding of HTML content and introduce methods to enhance true comprehension. Their approach, evaluated on the MiniWoB benchmark, outperforms previous SL methods on certain tasks using less data and narrows the performance gap with RL models. The study achieves 43.58% average accuracy in SL and 36.69% when combined with a multimodal RL approach, setting a new direction for future web navigation research.
-
UIClip: A Data-driven Model for Assessing User Interface Design
- Jason Wu, Yi-Hao Peng, Amanda Li, Amanda Swearngin, Jeffrey P. Bigham, Jeffrey Nichols
- ποΈ Institutions: CMU, Apple
- π Date: Apr 18, 2024
- π Publisher: UIST 2024
- π» Env: [GUI]
- π Key: [model], [UIClip], [vision foundation model], [foundation model]
- π TLDR: This paper introduces UIClip, a machine-learned model that evaluates the design quality and visual relevance of user interfaces by analyzing screenshots and corresponding natural language descriptions. Trained on a large-scale dataset combining automated crawling, synthetic augmentation, and human ratings, UIClip assigns numerical scores representing a UI's relevance and quality, and offers design suggestions. Evaluations show that UIClip's assessments align closely with human designer rankings. The paper also demonstrates UIClip's utility in applications like UI code generation, design tips generation, and quality-aware UI example search.
-
Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent
- Wei Chen, Zhiyuan Li
- ποΈ Institutions: Stanford University
- π Date: April 17, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [model], [functional token], [on-device AI], [Octopus v3]
- π TLDR: This paper introduces Octopus v3, a compact multimodal AI agent with less than 1 billion parameters, designed for efficient on-device operation. It processes both English and Chinese inputs, integrating visual and textual data to perform tasks such as sending emails, messaging, and online shopping. The model employs a functional token approach to translate image-based data into actionable outcomes, demonstrating high accuracy and efficiency on edge devices, including Raspberry Pi.
-
- Moghis Fereidouni, A.B. Siddique
- ποΈ Institutions: University of Kentucky
- π Date: April 16, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [reinforcement learning], [grounded language agent], [Flan-T5], [unsupervised domain adaptation]
- π TLDR: This paper introduces GLAINTEL, a grounded language agent framework designed to enhance web interaction using instruction-finetuned language models, particularly Flan-T5, with reinforcement learning (PPO) to tackle interactive web navigation challenges. The study explores unsupervised and supervised training methods, evaluating the effects of human demonstration on agent performance. Results indicate that combining human feedback with reinforcement learning yields effective outcomes, rivaling larger models like GPT-4 on web navigation tasks.
-
MMInA: Benchmarking Multihop Multimodal Internet Agents
- Ziniu Zhang, Shulin Tian, Liangyu Chen, Ziwei Liu
- ποΈ Institutions: NTU
- π Date: April 15, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [framework], [multihop web browsing], [multimodal tasks], [long-range reasoning]
- π TLDR: The MMInA benchmark is designed to evaluate agents' capacity to complete complex, multihop web tasks by navigating and extracting information across evolving real-world websites. Composed of 1,050 tasks across diverse domains, MMInA challenges agents with realistic, multimodal information retrieval and reasoning tasks, such as comparative shopping and travel inquiries. Despite recent advances, agents show difficulties in handling tasks requiring sequential steps across multiple sites, underscoring the need for enhanced multimodal and memory-augmented models.
-
LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation
- Li Zhang, Shihe Wang, Xianqing Jia, Zhihan Zheng, Yunhe Yan, Longxi Gao, Yuanchun Li, Mengwei Xu
- ποΈ Institutions: BUPT, Tsinghua University
- π Date: April 12, 2024
- π Publisher: UIST 2024
- π» Env: [Mobile]
- π Key: [framework], [dataset], [benchmark], [UI automation], [mobile agent evaluation]
- π TLDR: LlamaTouch is an evaluation testbed designed for mobile UI automation, enabling reliable task assessment across 495 annotated tasks. It provides a scalable solution to evaluate agents in real-world mobile settings, comparing agent actions to essential UI states for accurate task completion. LlamaTouch supports dynamic environments, advancing mobile agent reliability and scalability in task automation.
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
- Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu
- ποΈ Institutions: HKU, CMU, Salesforce, University of Waterloo
- π Date: April 11, 2024
- π Publisher: NeurIPS 2024
- π» Env: [GUI]
- π Key: [benchmark], [real computer tasks], [online environment], [online benchmark]
- π TLDR: OSWorld introduces a groundbreaking benchmark for multimodal agents to perform open-ended tasks within real computer environments across platforms like Ubuntu, Windows, and macOS. It includes 369 real-world tasks involving web and desktop apps, file management, and multi-app workflows, with custom evaluation scripts for reproducibility. The results reveal current agentsβ limitations in GUI interaction and operational knowledge, as they achieve just 12.24% task success compared to humans' 72.36%, highlighting critical gaps for future model improvement.
-
Autonomous Evaluation and Refinement of Digital Agents
- Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr
- ποΈ Institutions: UCB, UMich
- π Date: April 9, 2024
- π Publisher: COLM 2024
- π» Env: [GUI]
- π Key: [framework], [benchmark], [evaluation model], [domain transfer]
- π TLDR: This paper presents an autonomous evaluation framework for digital agents to enhance performance on web navigation and device control. The study introduces modular, cost-effective evaluators achieving up to 92.9% accuracy in benchmarks like WebArena and outlines their use in fine-tuning agents, improving state-of-the-art by 29% without additional supervision.
-
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
- Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue
- ποΈ Institutions: CMU
- π Date: April 9, 2024
- π Publisher: COLM 2024
- π» Env: [Web]
- π Key: [benchmark], [dataset], [web page understanding], [grounding]
- π TLDR: VisualWebBench introduces a comprehensive benchmark for evaluating multimodal large language models (MLLMs) on web-based tasks. It includes 1.5K human-curated instances across 139 websites in 87 sub-domains. The benchmark spans seven tasksβsuch as OCR, grounding, and web-based QAβaiming to test MLLMs' capabilities in fine-grained web page understanding. Results reveal significant performance gaps, particularly in grounding tasks, highlighting the need for advancement in MLLM web understanding.
-
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
- Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan
- ποΈ Institutions: Apple
- π Date: April 8, 2024
- π Publisher: ECCV 2024
- π» Env: [Mobile]
- π Key: [model], [framework], [dataset], [benchmark], [mobile UI understanding]
- π TLDR: This paper presents Ferret-UI, a multimodal large language model (MLLM) designed to understand and interact with mobile user interfaces. The model incorporates advanced capabilities for referring, grounding, and reasoning about UI elements. By training on a variety of UI tasks, Ferret-UI achieves high performance in tasks such as icon recognition and text extraction. The authors introduce a unique architecture that allows for improved visual feature extraction from mobile screens, paving the way for applications in accessibility and user interaction.
-
Enhancing Mobile "How-to" Queries with Automated Search Results Verification and Reranking
- Lei Ding, Jeshwanth Bheemanpally, Yi Zhang
- ποΈ Institutions: UCSC
- π Date: April 2024
- π Publisher: SIGIR 2024
- π» Env: [Mobile]
- π Key: [framework], [benchmark], [reranking], [verification], [mobile task automation]
- π TLDR: This paper presents a system that enhances mobile "how-to" queries by verifying and reranking search results through automated instruction extraction, on-device action execution, and reranking based on relevance. The method improves on traditional ranking by analyzing device-specific execution success. The approach comprises a three-stage pipeline: 1) extracting step-by-step instructions from top search results, 2) validating these instructions on mobile devices, and 3) reranking based on performance. The system leverages a pre-trained GPT model for initial processing, ensuring adaptability across diverse apps and systems.
-
Benchmarking Mobile Device Control Agents across Diverse Configurations
- Juyong Lee, Taywon Min, Minyong An, Dongyoon Hahm, Haeone Lee, Changyeon Kim, Kimin Lee
- ποΈ Institutions: KAIST, Seoul National University, Yonsei University
- π Date: April 2024
- π Publisher: ICLR 2024
- π» Env: [Mobile]
- π Key: [benchmark], [dataset], [mobile device control], [agent performance]
- π TLDR: This paper presents B-MoCA, a comprehensive benchmark for evaluating mobile device control agents using an Android-based testbed with 131 tasks and various device configurations. The benchmark assesses agents' abilities across tasks that include device-specific variations, navigation, and human-like dual-gesture interactions. B-MoCA highlights that current agents perform well on basic tasks but struggle with complex configurations, pointing to opportunities for future improvements in mobile automation capabilities.
-
AgentStudio: A Toolkit for Building General Virtual Agents
- Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, Shuicheng Yan
- ποΈ Institutions: NTU, Skywork AI, ETH Zurich
- π Date: March 26, 2024
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [framework], [dataset], [general virtual agents], [open-ended learning], [tool creation], [GroundUI], [benchmark]
- π TLDR: AgentStudio is a robust toolkit for developing virtual agents with versatile actions, such as GUI automation and code execution. It unifies real-world human-computer interactions across OS platforms and includes diverse observation and action spaces, facilitating comprehensive training and benchmarking in complex settings. The toolkit's flexibility promotes agent generalization across varied tasks, supporting tool creation and a multimodal interaction interface to advance agent adaptability and learning.
-
WebVLN: Vision-and-Language Navigation on Websites
- Qi Chen, Dileepa Pitawela, Chongyang Zhao, Gengze Zhou, Hsiang-Ting Chen, Qi Wu
- ποΈ Institutions: The University of Adelaide
- π Date: March 24, 2024
- π Publisher: AAAI 2024
- π» Env: [Web]
- π Key: [framework], [dataset], [web-based VLN], [HTML content integration], [multimodal navigation]
- π TLDR: This paper introduces the WebVLN task, where agents navigate websites by following natural language instructions that include questions and descriptions. Aimed at emulating real-world browsing behavior, the task allows the agent to interact with elements not directly visible in the rendered content by integrating HTML-specific information. A new WebVLN-Net model, based on the VLN BERT framework, is introduced alongside the WebVLN-v1 dataset, supporting question-answer navigation across web pages. This framework demonstrated significant improvement over existing web-based navigation methods, marking a new direction in vision-and-language navigation research.
-
Tur[k]ingBench: A Challenge Benchmark for Web Agents
- Kevin Xu, Yeganeh Kordi, Kate Sanders, Yizhong Wang, Adam Byerly, Jingyu Zhang, Benjamin Van Durme, Daniel Khashabi
- ποΈ Institutions: JHU, Brown, UW
- π Date: March 18, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [dataset], [multi-modal reasoning], [TurkingBench], [Turking]
- π TLDR: This paper introduces Tur[k]ingBench, a benchmark comprising 158 web-grounded tasks designed to evaluate AI agents' capabilities in complex web-based environments. Unlike prior benchmarks that utilize synthetic web pages, Tur[k]ingBench leverages natural HTML pages from crowdsourcing platforms, presenting tasks with rich multi-modal contexts. The benchmark includes 32.2K instances, each with diverse inputs, challenging models to interpret and interact with web pages effectively. Evaluations of state-of-the-art models reveal significant room for improvement, highlighting the need for advanced web-based agents capable of handling real-world web interactions.
-
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
- Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, LΓ©o Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, Alexandre Lacoste
- ποΈ Institutions: ServiceNow Research, Mila, Polytechnique Montreal, McGill University, University de Montreal
- π Date: March 11, 2024
- π Publisher: ICML 2024
- π» Env: [Web]
- π Key: [benchmark], [enterprise task automation], [ServiceNow], [knowledge work automation]
- π TLDR: WorkArena introduces a robust benchmark hosted on the ServiceNow platform to assess the effectiveness of large language model-based agents in performing 33 knowledge tasks common to enterprise environments. Leveraging BrowserGym, an environment that simulates complex browser interactions, WorkArena provides web agents with realistic challenges like data entry, form completion, and information retrieval in knowledge bases. Despite promising initial results, open-source models show a 42.7% success rate compared to closed-source counterparts, underlining the current gap in task automation for enterprise applications and highlighting key areas for improvement.
-
Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study
- Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, BΓΆrje F. Karlsson, Bo An, Zongqing Lu
- ποΈ Institutions: NTU, BAAI, PKU
- π Date: March 5, 2024
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [framework], [Cradle], [General Computer Control], [multimodal], [keyboard and mouse control], [long-term memory], [reasoning], [self-improvement]
- π TLDR: This paper introduces Cradle, a framework designed to achieve General Computer Control (GCC) by enabling agents to perform any computer task using only screen images (and possibly audio) as input and producing keyboard and mouse operations as output. The authors deploy Cradle in the complex AAA game Red Dead Redemption II, demonstrating its capability to follow the main storyline and complete real missions with minimal reliance on prior knowledge or resources.
-
Cradle: Empowering Foundation Agents Towards General Computer Control
- Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian, Chaojie Wang, Xinrun Wang, BΓΆrje F. Karlsson, Bo An, Shuicheng Yan, Zongqing Lu
- ποΈ Institutions: Skywork AI, BAAI, NTU, PKU, Institute of Software - Chinese Academy of Sciences, HKU, CUHK
- π Date: March 5, 2024
- π Publisher: TBD
- π» Env: [Desktop]
- π Key: [framework], [model], [general computer control], [skill curation], [self-improvement]
- π TLDR: This paper introduces the Cradle framework, designed to enable general computer control (GCC) through multimodal input (e.g., screen images and optional audio) and outputs (keyboard and mouse). Cradleβs six core modules, including self-reflection, skill curation, and memory, allow for generalized task handling in complex environments like AAA games. Demonstrated in Red Dead Redemption II, the framework exhibits adaptability by performing real missions and following the storyline with minimal prior knowledge, showcasing its potential as a generalist agent for diverse computer tasks.
-
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, Duyu Tang
- ποΈ Institutions: Fudan University, Huawei
- π Date: March 5, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [dataset], [Android GUI], [Chain-of-Action-Thought], [autonomous GUI agents]
- π TLDR: This paper introduces Chain-of-Action-Thought (CoAT), a novel paradigm to improve GUI agent task completion by enabling agents to interpret previous actions, current screen content, and action rationale for next steps. The authors present the Android-In-The-Zoo (AitZ) dataset, which includes 18,643 screen-action pairs with detailed annotations, supporting CoAT's development and evaluation. The study demonstrates that fine-tuning with the AitZ dataset improves performance of a baseline large language model in predicting correct action sequences in Android tasks.
-
On the Multi-turn Instruction Following for Conversational Web Agents
- Yang Deng, Xuan Zhang, Wenxuan Zhang, Yifei Yuan, See-Kiong Ng, Tat-Seng Chua
- ποΈ Institutions: NUS, DAMO Academy, University of Copenhagen
- π Date: February 23, 2024
- π Publisher: ACL 2024
- π» Env: [Web]
- π Key: [benchmark], [dataset], [multi-turn dialogue], [memory utilization], [self-reflective planning]
- π TLDR: This paper explores multi-turn conversational web navigation, introducing the MT-Mind2Web dataset to support instruction-following tasks for web agents. The proposed Self-MAP (Self-Reflective Memory-Augmented Planning) framework enhances agent performance by integrating memory with self-reflection for sequential decision-making in complex interactions. Extensive evaluations using MT-Mind2Web demonstrate Self-MAP's efficacy in addressing the limitations of current models in multi-turn interactions, providing a novel dataset and framework for evaluating and training agents on detailed, multi-step web-based tasks.
-
Improving Language Understanding from Screenshots
- Tianyu Gao, Zirui Wang, Adithya Bhaskar, Danqi Chen
- ποΈ Institutions: Princeton University
- π Date: February 22, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [model], [framework], [screenshot language models], [patch-and-text prediction]
- π TLDR: This paper introduces a novel approach to improve the language understanding capabilities of screenshot language models (LMs). The authors propose a Patch-and-Text Prediction (PTP) objective, which masks and recovers both image patches and text within screenshots. The method significantly narrows the performance gap between screenshot LMs and text-only models on language understanding tasks, achieving comparable results to BERT on most GLUE tasks. The research also extends PTP to train autoregressive screenshot LMs, demonstrating improved perplexity by utilizing screenshot context.
-
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
- Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, Xu Sun
- ποΈ Institutions: Renming University of China, PKU, Tencent
- π Date: Feb 17, 2024
- π Publisher: arXiv
- π» Env: [GUI], [Misc]
- π Key: [attack], [backdoor], [safety]
- π TLDR: This paper investigates backdoor attacks on LLM-based agents, introducing a framework that categorizes attacks based on outcomes and trigger locations. The study demonstrates the vulnerability of such agents to backdoor attacks and emphasizes the need for targeted defenses.
-
A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents
- Lingbo Mo, Zeyi Liao, Boyuan Zheng, Yu Su, Chaowei Xiao, Huan Sun
- ποΈ Institutions: OSU, UWM
- π Date: February 15, 2024
- π Publisher: arXiv
- π» Env: [Misc]
- π Key: [safety], [adversarial attacks], [security risks], [language agents], [Perception-Brain-Action]
- π TLDR: This paper introduces a conceptual framework to assess and understand adversarial vulnerabilities in language agents, dividing the agent structure into three componentsβPerception, Brain, and Action. It discusses 12 specific adversarial attack types that exploit these components, ranging from input manipulation to complex backdoor and jailbreak attacks. The framework provides a basis for identifying and mitigating risks before the widespread deployment of these agents in real-world applications.
-
UFO: A UI-Focused Agent for Windows OS Interaction
- Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
- ποΈ Institutions: Microsoft
- π Date: February 14, 2024
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [framework], [UI automation], [Windows], [UFO]
- π TLDR: This paper presents UFO, a pioneering multimodal LLM-based agent designed to fulfill user requests on Windows OS. UFO employs a dual-agent architectureβcomprising AppAgent and ActAgentβthat can interpret and execute complex tasks across multiple Windows applications by observing UI elements and utilizing control interactions. The framework allows UFO to handle intricate, cross-application workflows and execute commands seamlessly based on natural language prompts. It integrates GPT-Vision to recognize and interact with graphical elements, enabling flexible, autonomous task completion within and across diverse Windows applications.
-
ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model
- Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, Qi Wang
- ποΈ Institutions: Jilin University
- π Date: February 13, 2024
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [framework], [visual language model], [computer control agent]
- π TLDR: This paper introduces ScreenAgent, a computer control agent powered by a visual language large model. The system can interpret natural language instructions and execute them on various computer applications by analyzing screen content. ScreenAgent employs a novel action grounding mechanism to map high-level instructions to specific UI interactions. Evaluated on a diverse set of tasks across different applications, ScreenAgent demonstrates superior performance in task completion and generalization compared to existing methods.
-
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, Lingpeng Kong
- ποΈ Institutions: Shanghai AI Lab, East China Normal University, Princeton University, University of Hong Kong
- π Date: February 12, 2024
- π Publisher: ICLR 2024 Workshop LLMAgents
- π» Env: [Desktop]
- π Key: [framework], [self-directed learning], [GAIA], [FRIDAY], [OS-Copilot]
- π TLDR: The OS-Copilot framework supports building generalist agents capable of performing diverse tasks across an operating system (OS). This work introduces FRIDAY, an embodied agent using OS-Copilot to self-improve by learning from task outcomes. It operates with a memory-based architecture to tackle OS-level tasks across applications like terminals, web browsers, and third-party tools. Tested on the GAIA benchmark, FRIDAY achieved 35% higher performance than prior methods, proving effective in adapting to unfamiliar applications and refining its capabilities with minimal guidance.
-
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
- Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor CΔrbune, Jason Lin, Jindong Chen, Abhanshu Sharma
- ποΈ Institutions: Google DeepMind
- π Date: February 7, 2024
- π Publisher: IJCAI 2024
- π» Env: [GUI]
- π Key: [model], [dataset], [UI understanding], [infographics understanding], [vision language model]
- π TLDR: This paper introduces ScreenAI, a vision-language model specializing in UI and infographics understanding. The model combines the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. ScreenAI achieves state-of-the-art results on several UI and infographics-based tasks, outperforming larger models. The authors also release three new datasets for screen annotation and question answering tasks.
-
Dual-View Visual Contextualization for Web Navigation
- Jihyung Kil, Chan Hee Song, Boyuan Zheng, Xiang Deng, Yu Su, Wei-Lun Chao
- ποΈ Institutions: OSU
- π Date: February 6, 2024
- π Publisher: CVPR 2024
- π» Env: [Web]
- π Key: [framework], [visual contextualization]
- π TLDR: This paper proposes a novel approach to web navigation by contextualizing HTML elements through their "dual views" in webpage screenshots. The method leverages both the textual content of HTML elements and their visual representation in the screenshot to create more informative representations for web agents. Evaluated on the Mind2Web dataset, the approach demonstrates consistent improvements over baseline methods across various scenarios, including cross-task, cross-website, and cross-domain navigation tasks.
-
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
- Xing Han Lu, ZdenΔk Kasner, Siva Reddy
- ποΈ Institutions: Mila, McGill University
- π Date: February 2024
- π Publisher: ICML 2024
- π» Env: [Web]
- π Key: [framework], [dataset], [benchmark], [multi-turn dialogue], [real-world navigation], [WebLINX]
- π TLDR: WebLINX addresses the complexity of real-world website navigation for conversational agents, with a benchmark featuring over 2,300 demonstrations across 150+ websites. The benchmark allows agents to handle multi-turn instructions and interact dynamically across diverse domains, including geographic and thematic categories. The study proposes a retrieval-inspired model that selectively extracts key HTML elements and browser actions, achieving efficient task-specific representations. Experiments reveal that smaller finetuned decoders outperform larger zero-shot multimodal models, though generalization to new environments remains challenging.
-
- Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov
- ποΈ Institutions: CMU
- π Date: February 2024
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [dataset], [benchmark]
- π TLDR: OmniACT introduces a dataset and benchmark to train and evaluate multimodal agents capable of autonomously performing diverse tasks across desktop and web environments. Using annotated UI elements across applications, it combines visual grounding with natural language instructions, providing 9,802 data points for developing agents that integrate high-level reasoning with UI interactions. The study highlights the limited proficiency of current models, with baselines like GPT-4 only achieving 15% of human performance on executable scripts, emphasizing OmniACT's potential as a testbed for advancing multimodal AI.
-
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
- Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
- ποΈ Institutions: Beijing Jiaotong University, Alibaba
- π Date: January 29, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [benchmark]
- π TLDR: This paper presents Mobile-Agent, an autonomous multi-modal agent designed for mobile device interaction. The system integrates visual perception, natural language processing, and action prediction to navigate and operate mobile applications. The authors introduce a new dataset and benchmark for evaluating mobile agents, demonstrating Mobile-Agent's superior performance in task completion and generalization across various apps compared to existing methods.
-
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
- Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried
- ποΈ Institutions: CMU
- π Date: January 24, 2024
- π Publisher: ACL 2024
- π» Env: [Web]
- π Key: [framework], [benchmark], [dataset], [multimodal agent evaluation], [visually grounded tasks]
- π TLDR: VisualWebArena is a benchmark designed for testing multimodal web agents on complex, visually grounded web tasks. It provides a reproducible framework with 910 task scenarios across real-world web applications, emphasizing open-ended, visually guided interactions. The tasks are modeled within a partially observable Markov decision process to assess agentsβ capacity to interpret multimodal inputs, execute navigation, and accomplish user-defined objectives across complex visual and textual information on websites.
-
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
- Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, Dong Yu
- ποΈ Institutions: Zhejiang University, Tencent AI Lab, Westlake University
- π Date: January 24, 2024
- π Publisher: ACL 2024
- π» Env: [Web]
- π Key: [benchmark], [evaluation]
- π TLDR: This paper introduces WebVoyager, an innovative web agent powered by Large Multimodal Models (LMMs) that can complete user instructions end-to-end by interacting with real-world websites. The authors establish a new benchmark with tasks from 15 popular websites and propose an automatic evaluation protocol using GPT-4V. WebVoyager achieves a 59.1% task success rate, significantly outperforming GPT-4 (All Tools) and text-only setups. The study demonstrates the effectiveness of multimodal approaches in web automation and provides insights into developing more intelligent web interaction solutions.
-
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, Zhiyong Wu
- ποΈ Institutions: Nanjing University, Shanghai AI Lab
- π Date: January 19, 2024
- π Publisher: ACL 2024
- π» Env: [GUI]
- π Key: [model], [benchmark], [GUI grounding], [visual grounding]
- π TLDR: TBD.
-
AgentBench: Evaluating LLMs as Agents
- Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang
- ποΈ Institutions: THU, OSU, ByteDance
- π Date: January 1, 2024
- π Publisher: ICLR 2024
- π» Env: [GUI], [General]
- π Key: [benchmark], [evaluation]
- π TLDR: AgentBench provides a comprehensive benchmark for evaluating LLMs as autonomous agents in various environments. It includes eight distinct scenarios, testing the LLMs' reasoning and decision-making capabilities in tasks such as OS interaction, database querying, knowledge graph traversal, and more. This benchmark compares the effectiveness of multiple commercial and open-source LLMs, revealing areas of improvement in instruction-following and long-term reasoning, essential for practical agent development.
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded
- Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
- ποΈ Institutions: OSU
- π Date: January 1, 2024
- π Publisher: ICML 2024
- π» Env: [Web]
- π Key: [framework], [dataset], [benchmark], [grounding], [SeeAct], [Multimodal-Mind2web]
- π TLDR: This paper explores the capability of GPT-4V(ision), a multimodal model, as a web agent that can perform tasks across various websites by following natural language instructions. It introduces the SEEACT framework, enabling GPT-4V to navigate, interpret, and interact with elements on websites. Evaluated using the Mind2Web benchmark and an online test environment, the framework demonstrates high performance on complex web tasks by integrating grounding strategies like element attributes and image annotations to improve HTML element targeting. However, grounding remains challenging, presenting opportunities for further improvement.
-
Multimodal Web Navigation with Instruction-Finetuned Foundation Models
- Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, Izzeddin Gur
- ποΈ Institutions: Univ. of Tokyo, Google DeepMind
- π Date: Jan 1, 2024
- π Publisher: ICLR 2024
- π» Env: [Web]
- π Key: [benchmark], [model], [dataset], [web navigation], [instruction-following], [WebShop]
- π TLDR: This paper introduces WebGUM, an instruction-following multimodal agent for autonomous web navigation that leverages both visual (webpage screenshots) and textual (HTML) inputs to perform actions such as click and type. The model is trained on a vast corpus of demonstrations and shows improved capabilities in visual perception, HTML comprehension, and multi-step decision-making, achieving state-of-the-art performance on benchmarks like MiniWoB and WebShop. WebGUM provides a scalable approach to web-based tasks without task-specific architectures, enabling high-performance web navigation with generalizable, multimodal foundation models.
-
AppAgent: Multimodal Agents as Smartphone Users
- Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, Gang Yu
- ποΈ Institutions: Tencent
- π Date: December 21, 2023
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [smartphone interaction], [autonomous exploration], [self-improve]
- π TLDR: This paper introduces AppAgent, a novel multimodal agent framework designed to operate smartphone applications. The agent uses a simplified action space to mimic human-like interactions such as tapping and swiping. AppAgent learns to navigate and use new apps through autonomous exploration or by observing human demonstrations, creating a knowledge base for executing complex tasks across different applications. The framework's effectiveness is demonstrated through extensive testing on 50 tasks across 10 diverse applications.
-
AssistGUI: Task-Oriented Desktop Graphical User Interface Automation
- Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou
- ποΈ Institutions: NUS
- π Date: December 20, 2023
- π Publisher: CVPR 2024
- π» Env: [Desktop]
- π Key: [framework], [dataset], [benchmark], [desktop productivity tasks]
- π TLDR: This study presents AssistGUI, a benchmark and framework for desktop GUI automation, featuring an LLM-based agent capable of completing complex user requests by analyzing instructional videos and performing actions on the desktop. Utilizing a novel Actor-Critic framework and GUI parser, AssistGUI was tested on 100 tasks across nine applications, such as MS Word and After Effects. Despite advances, the top-performing model achieved only a 46% success rate, illustrating the challenge of comprehensive desktop automation and underscoring areas for future research in agent-driven GUI tasks.
-
CogAgent: A Visual Language Model for GUI Agents
- Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhao Chen, Yuxuan Wang, Yining Ye, Jiayi Zhang, Hao Dong, Wenhu Chen, Yizhou Wang, Kai-Wei Chang
- ποΈ Institutions: Tsinghua University, Zhipu AI
- π Date: December 15, 2023
- π Publisher: CVPR 2024
- π» Env: [GUI]
- π Key: [model], [dataset], [benchmark], [visual language model], [GUI agent]
- π TLDR: This paper presents CogAgent, a visual language model designed for GUI agents. The authors introduce a new dataset, CogBench, featuring 1,430 GUI tasks across various applications. CogAgent employs a novel training approach combining supervised fine-tuning and decision-making fine-tuning. The model demonstrates superior performance on CogBench and generalizes well to unseen applications, outperforming existing models like GPT-4V in GUI task completion.
-
GAIA: a benchmark for General AI Assistants
- GrΓ©goire Mialon, Yassine Nakkach, Aslan Tchamkerten, Albert Thomas, Laurent Dinh, and a research team from Meta AI and Hugging Face.
- ποΈ Institutions: Meta AI, Hugging Face
- π Date: November 21, 2023
- π Publisher: arXiv
- π» Env: [Misc]
- π Key: [benchmark], [multi-modality], [tool use], [reasoning]
- π TLDR: GAIA is a benchmark developed for evaluating general-purpose AI assistants. It aims to test assistant models across multiple modalities and complex reasoning tasks in real-world settings, including scenarios that require tool usage and open-ended question answering. With a dataset comprising 466 questions across various domains, GAIA highlights gaps between current AI performance and human capability, presenting a significant challenge for large language models such as GPT-4.
-
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
- An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, Lijuan Wang
- ποΈ Institutions: UCSD, Microsoft, UCSB, UWM
- π Date: November 13, 2023
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [benchmark], [zero-shot GUI navigation], [multimodal LLMs]
- π TLDR: This paper explores the capabilities of GPT-4V in navigating smartphone GUIs without prior training. The authors introduce a novel framework for GUI navigation and a new benchmark, MobileNav, featuring 1,000 navigation tasks across 100 mobile apps. The study demonstrates GPT-4V's impressive zero-shot performance in understanding and interacting with mobile interfaces, outperforming previous methods and even approaching human-level performance on some tasks.
-
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models
- Fangzhi Xu, Qiushi Sun, Kanzhi Cheng, Jun Liu, Yu Qiao, Zhiyong Wu
- ποΈ Institutions: Xi'an Jiaotong University, Shanghai AI Lab, HKU, Nanjing University
- π Date: November 2023
- π Publisher: arXiv
- π» Env: [GUI (evaluated on web, math reasoning, and logic reasoning environments)]
- π Key: [framework], [dataset], [neural-symbolic self-training], [online exploration], [self-refinement]
- π TLDR: This paper introduces ENVISIONS, a neural-symbolic self-training framework designed to improve large language models (LLMs) by enabling self-training through interaction with a symbolic environment. The framework addresses symbolic data scarcity and enhances LLMs' symbolic reasoning proficiency by iteratively exploring, refining, and learning from symbolic tasks without reinforcement learning. Extensive evaluations across web navigation, math, and logical reasoning tasks highlight ENVISIONS as a promising approach for enhancing LLM symbolic processing.
-
UI Layout Generation with LLMs Guided by UI Grammar
- Yuwen Lu, Ziang Tong, Qinyi Zhao, Chengzhi Zhang, Toby Jia-Jun Li
- ποΈ Institutions: ICML 2023 Workshop on AI and HCI
- π Date: October 24, 2023
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [UI grammar], [UI Layout Generation]
- π TLDR: This position paper explores the use of Large Language Models (LLMs) for generating mobile user interface (UI) layouts. It introduces UI grammar, a novel approach to represent the hierarchical structure of UI screens, aiming to guide LLMs' generative capabilities more effectively and enhance the explainability and controllability of the process. Initial experiments with GPT-4 demonstrate the potential of LLMs to produce high-quality UIs through in-context learning, with the grammar-based approach improving certain aspects of generation quality.
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
- Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao
- ποΈ Institutions: MSR
- π Date: October 17, 2023
- π Publisher: arXiv
- π» Env: [Misc]
- π Key: [visual prompting], [framework], [benchmark], [visual grounding], [zero-shot]
- π TLDR: This paper introduces Set-of-Mark (SoM), a novel visual prompting approach designed to enhance the visual grounding capabilities of multimodal models like GPT-4V. By overlaying images with spatially and semantically distinct marks, SoM enables fine-grained object recognition and interaction within visual data, surpassing conventional zero-shot segmentation methods in accuracy. The framework is validated on tasks requiring detailed spatial reasoning, demonstrating a significant improvement over existing visual-language models without fine-tuning.
-
OpenAgents: An Open Platform for Language Agents in the Wild
- Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, Tao Yu
- ποΈ Institutions: HKU, XLang Lab, Sea AI Lab, Salesforce Research
- π Date: October 16, 2023
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [Data Agent], [Plugins Agent], [Web Agent]
- π TLDR: This paper introduces OpenAgents, an open-source platform designed to facilitate the use and hosting of language agents in real-world scenarios. It features three agents: Data Agent for data analysis using Python and SQL, Plugins Agent with access to over 200 daily API tools, and Web Agent for autonomous web browsing. OpenAgents aims to provide a user-friendly web interface for general users and a seamless deployment experience for developers and researchers, promoting the development and evaluation of innovative language agents in practical applications.
-
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API
- Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, Yan Lu
- ποΈ Institutions: MSRA
- π Date: October 7, 2023
- π Publisher: arXiv
- π» Env: [GUI]
- π Key: [model], [framework], [reinforcement learning], [UI task automation], [instruction grounding]
- π TLDR: This paper introduces a multimodal model, termed RUIG (Reinforced UI Instruction Grounding), for automating UI tasks through natural language instructions. By leveraging a pixel-to-sequence approach, the model directly decodes UI element locations from screenshots based on user commands, removing the need for metadata like element coordinates. The framework uses a transformer-based encoder-decoder setup optimized through reinforcement learning to improve spatial accuracy. This novel approach outperforms prior methods, offering a generalized solution for UI task automation.
-
SteP: Stacked LLM Policies for Web Actions
- Paloma Sodhi, S.R.K. Branavan, Yoav Artzi, Ryan McDonald
- ποΈ Institutions: ASAPP Research, Cornell University
- π Date: October 5, 2023
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [policy composition], [dynamic control], [SteP]
- π TLDR: This paper introduces SteP (Stacked LLM Policies), a framework that dynamically composes policies to tackle diverse web tasks. By defining a Markov Decision Process where the state is a stack of policies, SteP enables adaptive control that adjusts to task complexity. Evaluations on WebArena, MiniWoB++, and a CRM simulator demonstrate that SteP significantly outperforms existing methods, achieving a success rate improvement from 14.9% to 35.8% over state-of-the-art GPT-4 policies.
-
You Only Look at Screens: Multimodal Chain-of-Action Agents
- Zhuosheng Zhang, Aston Zhang
- ποΈ Institutions: SJTU
- π Date: September 20, 2023
- π Publisher: ICLR 2024
- π» Env: [GUI]
- π Key: [framework], [dataset], [benchmark], [multimodal agent], [chain-of-action technique]
- π TLDR: This paper presents Auto-GUI, a multimodal agent capable of directly interacting with graphical user interfaces without relying on environment parsing or application-specific APIs. The authors introduce a novel chain-of-action technique that leverages previous action histories and future action plans to improve decision-making. Auto-GUI is evaluated on a new device-control benchmark, AITW, demonstrating state-of-the-art performance in action prediction and task completion across various applications and web-based tasks.
-
LASER: LLM Agent with State-Space Exploration for Web Navigation
- Kaixin Ma, Hongming Zhang, Hongwei Wang, Xiaoman Pan, Dong Yu, Jianshu Chen
- ποΈ Institutions: Tencent AI Lab
- π Date: September 15, 2023
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [state-space exploration], [backtracking]
- π TLDR: This paper introduces LASER, an LLM agent that models interactive web navigation tasks as state-space exploration. The approach defines a set of high-level states and associated actions, allowing the agent to transition between states and backtrack from errors. LASER significantly outperforms previous methods on the WebShop task without using in-context examples, demonstrating improved handling of novel situations and mistakes during task execution.
-
AutoDroid: LLM-powered Task Automation in Android
- Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu
- ποΈ Institutions: Tsinghua University, Shanghai AI Lab, University of Notre Dame, MSR
- π Date: August 29, 2023
- π Publisher: MobiCom 2024
- π» Env: [Mobile]
- π Key: [framework], [dataset], [benchmark], [Android task automation], [LLM-powered agent]
- π TLDR: This paper introduces AutoDroid, a novel mobile task automation system capable of handling arbitrary tasks on any Android application without manual efforts. The framework combines the commonsense knowledge of LLMs with domain-specific knowledge of apps through automated dynamic analysis. AutoDroid features a functionality-aware UI representation method, exploration-based memory injection techniques, and a multi-granularity query optimization module. Evaluated on a new benchmark with 158 common tasks, AutoDroid achieves a 90.9% action generation accuracy and a 71.3% task completion rate, significantly outperforming GPT-4-powered baselines.
-
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher
- Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, Feng Zhao
- ποΈ Institutions: USTC, Shanghai AI Lab
- π Date: July 29, 2023
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [information seeking], [planning], [AI search], [MindSearch]
- π TLDR: This paper presents MindSearch, a novel approach to web information seeking and integration that mimics human cognitive processes. The system uses a multi-agent framework consisting of a WebPlanner and WebSearcher. The WebPlanner models multi-step information seeking as a dynamic graph construction process, decomposing complex queries into sub-questions. The WebSearcher performs hierarchical information retrieval for each sub-question. MindSearch demonstrates significant improvements in response quality and depth compared to existing AI search solutions, processing information from over 300 web pages in just 3 minutes.
-
WebArena: A Realistic Web Environment for Building Autonomous Agents
- Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig
- ποΈ Institutions: CMU
- π Date: July 26, 2023
- π Publisher: NeurIPS 2023
- π» Env: [Web]
- π Key: [framework], [benchmark], [multi-tab navigation], [web-based interaction], [agent simulation]
- π TLDR: WebArena provides a standalone, realistic web simulation environment where autonomous agents can perform complex web-based tasks. The platform offers functionalities such as multi-tab browsing, element interaction, and customized user profiles. Its benchmark suite contains 812 tasks grounded in high-level natural language commands. WebArena uses multi-modal observations, including HTML and accessibility tree views, supporting advanced tasks that require contextual understanding across diverse web pages, making it suitable for evaluating generalist agents in real-world web environments.
-
Android in the Wild: A Large-Scale Dataset for Android Device Control
- Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, Timothy Lillicrap
- ποΈ Institutions: Google Research, Google DeepMind
- π Date: July 19, 2023
- π Publisher: NeurIPS 2023
- π» Env: [Mobile]
- π Key: [dataset], [benchmark], [device control], [natural language interaction], [gesture-based actions]
- π TLDR: The Android in the Wild (AitW) dataset introduces a significant benchmark for Android device control, encompassing over 715,000 human-labeled episodes with natural language commands and corresponding UI actions. Collected from Android devices across versions 10-13, it captures complex multi-step tasks requiring both visual and contextual understanding. The dataset is structured to test the robustness of device-control systems under varying conditions, such as new tasks or applications, and includes data to evaluate gesture-based interactions, providing a unique foundation for mobile interface automation and task execution research.
-
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
- Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, Aleksandra Faust
- ποΈ Institutions: Google DeepMind, The University of Tokyo
- π Date: July 2023
- π Publisher: ICLR 2024
- π» Env: [Web]
- π Key: [framework], [program synthesis], [HTML comprehension], [web automation], [self-supervised learning]
- π TLDR: WebAgent leverages two LLMsβHTML-T5 for HTML comprehension and Flan-U-PaLM for program synthesisβto complete web automation tasks. It combines planning, HTML summarization, and code generation to navigate and interact with real-world web environments, improving success rates on HTML-based tasks and achieving state-of-the-art performance in benchmarks like MiniWoB and Mind2Web. The modular architecture adapts well to open-domain tasks, using local-global attention mechanisms to manage long HTML contexts.
-
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control
- Longtao Zheng, Rundong Wang, Xinrun Wang, Bo An
- ποΈ Institutions: NTU
- π Date: June 13, 2023
- π Publisher: ICLR 2024
- π» Env: [Desktop]
- π Key: [framework], [benchmark], [trajectory prompting], [state abstraction], [memory retrieval]
- π TLDR: Synapse introduces a novel framework for computer control tasks, leveraging trajectory-as-exemplar prompting and memory to enhance LLM performance in complex, multi-step computer tasks. The system combines state abstraction, trajectory-based prompts, and memory retrieval, overcoming LLM limitations by filtering task-irrelevant data, storing exemplar trajectories, and retrieving relevant instances for improved decision-making. Synapse achieves significant performance gains on benchmarks such as MiniWoB++ and Mind2Web, demonstrating enhanced task success rates and generalization across diverse web-based tasks.
-
Mind2Web: Towards a Generalist Agent for the Web
- Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, Yu Su
- ποΈ Institutions: OSU
- π Date: June 9, 2023
- π Publisher: NeurIPS 2023
- π» Env: [Web]
- π Key: [dataset], [benchmark], [model], [Mind2Web], [MindAct]
- π TLDR: Mind2Web presents a dataset and benchmark specifically crafted for generalist web agents capable of performing language-guided tasks across varied websites. Featuring over 2,000 tasks from 137 sites, it spans 31 domains and emphasizes open-ended, realistic tasks in authentic, unsimplified web settings. The study proposes the MindAct framework, which optimizes LLMs for handling complex HTML elements by using small LMs to rank elements before full processing, thereby enhancing the efficiency and versatility of web agents in diverse contexts.
-
SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models
- Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, Zhaoxiang Zhang
- ποΈ Institutions: UCAS, HKISI-CAS, PolyU, Shanghai AI Lab
- π Date: May 30, 2023
- π Publisher: NeurIPS 2023
- π» Env: [GUI]
- π Key: [framework], [spreadsheet automation], [natural language interface]
- π TLDR: This paper introduces SheetCopilot, an innovative system that leverages large language models to automate spreadsheet tasks through natural language interactions. The framework includes a novel prompt design for task decomposition and execution, and a feedback loop for error correction. SheetCopilot demonstrates significant improvements in task completion rates and efficiency across various spreadsheet operations, outperforming existing methods and showing potential for enhancing productivity in spreadsheet software.
-
Augmenting Autotelic Agents with Large Language Models
- CΓ©dric Colas, Laetitia Teodorescu, Pierre-Yves Oudeyer, Xingdi Yuan, Marc-Alexandre CΓ΄tΓ©
- ποΈ Institutions: MIT, Inria, Microsoft
- π Date: May 22, 2023
- π Publisher: CoLLAs 2023
- π» Env: [GUI]
- π Key: [framework], [reinforcement learning], [goal generation], [large language models], [autotelic learning]
- π TLDR: This study introduces the Language Model-Augmented Autotelic Agent (LMA3), a framework leveraging large language models to help agents autonomously generate, represent, and learn diverse goals in a task-agnostic, text-based environment. LMA3 integrates pretrained language models to emulate human cultural knowledge, aiming to dynamically relabel goals, generate new goals, and create goal-driven reward functions without manual inputs. This approach supports skill development by autonomously expanding goal repertoires in ways that resemble human open-ended learning, showcasing potential for achieving complex, self-directed learning in AI.
-
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction
- Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, Kai Yu
- ποΈ Institutions: SJTU, HKU
- π Date: May 14, 2023
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [dataset], [interaction platform], [multistep interaction], [InfoUI]
- π TLDR: This paper introduces Mobile-Env, a novel interaction platform and benchmark aimed at assessing large language models' (LLMs) capabilities in interactive environments. It builds on the InfoUI task set, derived from WikiHow, to create structured text-based challenges that simulate real-world mobile interactions. The platform is designed to support task expansions from the community, aiming to drive advancements in LLM-based interactive agents.
-
Language Models can Solve Computer Tasks
- Geunwoo Kim, Pierre Baldi, Stephen McAleer
- ποΈ Institutions: UCI
- π Date: March 30, 2023
- π Publisher: NeurIPS 2023
- π» Env: [Desktop]
- π Key: [framework], [benchmark], [Recursive Critique and Improve], [RCI], [MiniWoB++], [general computer tasks]
- π TLDR: This study demonstrates that large language models (LLMs) can effectively automate computer tasks using a Recursive Critique and Improve (RCI) prompting method, enabling agents to handle complex desktop tasks like email and file management. By combining RCI with existing Chain of Thought (CoT) prompting, the method outperforms prior LLM approaches and traditional supervised and reinforcement learning models on the MiniWoB++ benchmark, showing potential for broad computer task automation.
-
Reflexion: Language Agents with Verbal Reinforcement Learning
- Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao
- ποΈ Institutions: Northeastern University, MIT, Princeton University
- π Date: March 20, 2023
- π Publisher: NeurIPS 2023
- π» Env: [Misc]
- π Key: [framework], [learning], [verbal reinforcement learning], [Reflexion]
- π TLDR: This paper introduces Reflexion, a framework that enhances language agents by enabling them to reflect on task feedback linguistically, storing these reflections in an episodic memory to improve decision-making in future trials. Reflexion allows agents to learn from various feedback types without traditional weight updates, achieving significant performance improvements across tasks like decision-making, coding, and reasoning. For instance, Reflexion attains a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4's 80%.
-
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
- Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova
- ποΈ Institutions: Google
- π Date: February 1, 2023
- π Publisher: ICML 2023
- π» Env: [Web], [Doc]
- π Key: [model], [framework], [vision encoder], [visual language understanding], [screenshot parsing], [image-to-text]
- π TLDR: This paper introduces Pix2Struct, a model pre-trained to parse masked screenshots into simplified HTML for tasks requiring visual language understanding. By leveraging the structure of HTML and diverse web page elements, Pix2Struct captures pretraining signals like OCR and image captioning, achieving state-of-the-art performance across tasks in domains including documents, user interfaces, and illustrations.
-
WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics
- Jason Wu, Siyan Wang, Siman Shen, Yi-Hao Peng, Jeffrey Nichols, Jeffrey P. Bigham
- ποΈ Institutions: CMU, Wellesley College, Grinnell College, Snooty Bird LLC
- π Date: January 30, 2023
- π Publisher: CHI 2023
- π» Env: [Web]
- π Key: [dataset], [element detection], [screen classification], [screen similarity], [UI modeling]
- π TLDR: The WebUI dataset includes 400,000 web UIs captured to enhance UI modeling by integrating visual UI metadata. This dataset supports tasks such as element detection, screen classification, and screen similarity, especially for accessibility, app automation, and testing applications. Through transfer learning and semi-supervised methods, WebUI addresses the challenge of training robust models with limited labeled mobile data, proving effective in tasks beyond web contexts, such as mobile UIs.
-
ReAct: Synergizing Reasoning and Acting in Language Models
- Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
- ποΈ Institutions: Princeton University, Google Research
- π Date: October 6, 2022
- π Publisher: ICLR 2023
- π» Env: [Misc]
- π Key: [framework], [reasoning], [ReAct]
- π TLDR: This paper introduces ReAct, a framework that enables large language models to generate reasoning traces and task-specific actions in an interleaved manner. By combining reasoning and acting, ReAct enhances the model's ability to perform complex tasks in language understanding and interactive decision making. The approach is validated across various benchmarks, demonstrating improved performance and interpretability over existing methods.
-
Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus
- Gang Li, Yang Li
- ποΈ Institutions: Google Research
- π Date: September 29, 2022
- π Publisher: ICLR 2023
- π» Env: [Mobile]
- π Key: [framework], [model], [dataset], [mobile UI tasks], [region-based focus]
- π TLDR: This paper introduces "Spotlight," a vision-language model for mobile UI understanding that operates solely on visual inputs (screenshots) and a specified focus region on the screen. By leveraging a large-scale dataset and training strategies tailored to mobile interfaces, Spotlight performs multiple UI-related tasks, including widget captioning, screen summarization, command grounding, and tappability prediction. It utilizes a vision-only approach, avoiding reliance on view hierarchies to achieve greater robustness and scalability across different mobile UI environments.
-
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
- Shunyu Yao, Howard Chen, John Yang, Karthik Narasimhan
- ποΈ Institutions: Princeton University
- π Date: July 2022
- π Publisher: NeurIPS 2022
- π» Env: [Web]
- π Key: [framework], [dataset], [benchmark], [e-commerce web interaction], [language grounding]
- π TLDR: This paper introduces WebShop, a simulated web-based shopping environment with over 1 million real-world products and 12,087 annotated instructions. It allows language agents to navigate, search, and make purchases based on natural language commands. The study explores how agents handle compositional instructions and noisy web data, providing a robust environment for reinforcement learning and imitation learning. The best models show effective sim-to-real transfer on websites like Amazon, illustrating WebShopβs potential for training grounded agents.
-
META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI
- Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, Kai Yu
- ποΈ Institutions: SJTU
- π Date: May 23, 2022
- π Publisher: EMNLP 2022
- π» Env: [Mobile]
- π Key: [benchmark], [dataset], [task-oriented dialogue], [GUI-based interaction], [multi-modal agent]
- π TLDR: This paper presents META-GUI, a dataset and framework for training multi-modal conversational agents capable of interacting directly with mobile app interfaces without the need for backend APIs. META-GUI includes over 1,100 dialogues with annotated action sequences on various tasks such as booking and scheduling. The authors propose a GUI-based task-oriented dialogue system that allows agents to navigate mobile interfaces via direct GUI actions, with performance shown to improve in multi-modal task-oriented dialogue contexts.
-
A Data-Driven Approach for Learning to Control Computers
- Peter C. Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Alex Goldin, Adam Santoro, Timothy Lillicrap
- ποΈ Institutions: DeepMind
- π Date: February 16, 2022
- π Publisher: ICML 2022
- π» Env: [Desktop]
- π Key: [dataset], [framework], [computer control], [reinforcement learning], [multimodal transformer]
- π TLDR: This study presents a reinforcement learning-based approach to train agents for computer control tasks, using keyboard and mouse interactions guided by natural language. By leveraging human demonstration data, agents trained in this environment achieved strong cross-task generalization across the MiniWob++ benchmark. This framework demonstrates how agents can control computers as humans would, enabling enhanced performance in complex computer tasks with high transferability.
-
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility
- Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, Bryan A. Plummer
- ποΈ Institutions: Boston University, UIUC
- π Date: February 4, 2022
- π Publisher: ECCV 2022
- π» Env: [Mobile]
- π Key: [dataset], [feasibility prediction], [vision-language navigation], [mobile interaction]
- π TLDR: This paper introduces the Mobile App Tasks with Iterative Feedback (MoTIF) dataset, which addresses vision-language navigation (VLN) with a focus on task feasibility uncertainty in mobile applications. MoTIF provides commands paired with mobile actions and feasibility annotations, allowing researchers to examine the impact of command feasibility on task completion. The dataset includes 125 apps and emphasizes diverse app environments, action sequences, and follow-up questions to improve task ambiguity resolution, making it a valuable resource for feasibility prediction research.
-
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
- Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, Yang Li
- ποΈ Institutions: University of Toronto
- π Date: August 6, 2021
- π Publisher: UIST 2021
- π» Env: [Mobile]
- π Key: [framework], [dataset], [mobile UI summarization], [multimodal learning], [Transformer model]
- π TLDR: The paper introduces Screen2Words, an approach that utilizes multimodal learning to generate descriptive language summaries for mobile UI screens, combining textual, visual, and structural data from screens. The study created a large-scale dataset with 112,085 annotated screen summaries for 22,417 unique UIs, aiming to support model training for mobile UI understanding. The dataset facilitates a Transformer-based model trained to summarize screens by highlighting main functionalities, and the approach is validated with benchmarks in the mobile environment.
-
UIBert: Learning Generic Multimodal Representations for UI Understanding
- Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, Blaise AgΓΌera y Arcas
- ποΈ Institutions: Google Research
- π Date: July 29, 2021
- π Publisher: IJCAI 2021
- π» Env: [Mobile]
- π Key: [framework], [model], [dataset], [multimodal representation learning], [UI understanding]
- π TLDR: This paper presents UIBert, a multimodal model aimed at understanding user interfaces (UIs) by combining visual, textual, and structural metadata. UIBert is designed for tasks such as component retrieval and expression resolution, using a transformer-based joint image-text model. The authors introduce five novel pre-training tasks to leverage UI-specific features, enhancing accessibility and task completion in mobile applications. UIBert demonstrates superior performance on nine downstream UI tasks, highlighting the potential of multimodal pre-training in UI understanding.
-
AndroidEnv: A Reinforcement Learning Platform for Android
- Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, Doina Precup
- ποΈ Institutions: DeepMind
- π Date: May 27, 2021
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [reinforcement learning], [Android interface], [RL environment], [task flexibility], [touchscreen action space]
- π TLDR: AndroidEnv provides a reinforcement learning (RL) platform for Android that lets RL agents interact with a realistic Android simulation via touchscreen events. The platform supports diverse applications, enabling agents to interact with over 100 predefined tasks across a variety of apps. With hybrid continuous and discrete action spaces, AndroidEnv is well-suited for training agents in complex, real-world Android scenarios where actions must be contextually sequenced, such as in UI navigation, gaming, and productivity apps. This environment encourages further RL research by offering task flexibility and realistic Android emulation.
-
Grounding Open-Domain Instructions to Automate Web Support Tasks
- Nancy Xu, Sam Masling, Michael Du, Giovanni Campagna, Larry Heck, James Landay, Monica Lam
- ποΈ Institutions: Stanford
- π Date: March 30, 2021
- π Publisher: NAACL 2021
- π» Env: [Web]
- π Key: [benchmark], [framework], [grounding], [task automation], [open-domain instructions], [RUSS]
- π TLDR: This paper introduces RUSS (Rapid Universal Support Service), a framework designed to interpret and execute open-domain, step-by-step web instructions automatically. RUSS uses a BERT-LSTM model for semantic parsing into a custom language, ThingTalk, which allows the system to map language to actions across various web elements. The framework, including a dataset of instructions, facilitates agent-based web support task automation by grounding natural language to interactive commands.
-
WebSRC: A Dataset for Web-Based Structural Reading Comprehension
- Lu Chen, Zihan Zhao, Xingyu Chen, Danyang Zhang, Jiabao Ji, Ao Luo, Yuxuan Xiong, Kai Yu
- ποΈ Institutions: SJTU
- π Date: January 23, 2021
- π Publisher: EMNLP 2021
- π» Env: [Web]
- π Key: [dataset], [structural reading comprehension], [web page QA], [structural information], [HTML element alignment]
- π TLDR: This paper introduces WebSRC, a dataset specifically designed for web-based structural reading comprehension, which requires understanding not only textual content but also the structural layout of web pages. WebSRC consists of 0.44 million question-answer pairs derived from 6,500 complex web pages. Each question challenges models to identify answers from HTML structures or to respond with yes/no, requiring a nuanced grasp of HTML and layout features. The authors benchmark several models on this dataset, highlighting its difficulty and the critical role of structural comprehension in improving machine understanding of web content.
-
Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements
- Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, Zhiwei Guan
- ποΈ Institutions: Google Research
- π Date: November 2020
- π Publisher: EMNLP 2020
- π» Env: [Mobile]
- π Key: [dataset], [benchmark], [model], [accessibility], [natural language generation], [WidgetCaption]
- π TLDR: This paper introduces the task of widget captioning, which aims to automatically generate natural language descriptions for UI elements in mobile apps to enhance accessibility. Using both visual and structural data from UI components, the study presents a novel dataset of 162,859 captions across 61,285 UI elements. Multiple deep learning models were tested on this dataset, with findings suggesting the potential for improving screen reader usability for visually impaired users by generating descriptive captions of UI elements.
-
Interactive Task Learning from GUI-Grounded Natural Language Instructions and Demonstrations
- Toby Jia-Jun Li, Tom Mitchell, Brad Myers
- ποΈ Institutions: CMU
- π Date: July 2020
- π Publisher: ACL 2020
- π» Env: [Mobile]
- π Key: [framework], [Sugilite], [programming-by-demonstration]
- π TLDR: This paper introduces Sugilite, an intelligent task automation agent that learns new tasks and associated concepts interactively from users' natural language instructions and demonstrations on third-party mobile app GUIs. The system allows users to teach procedures and concepts through verbal instructions combined with GUI demonstrations, supports intent clarification for demonstrated actions, infers task parameters using hierarchical app GUI structures, and generalizes taught concepts across different contexts and domains. A prototype is presented as a conversational assistant on Android. oai_citation_attribution:0β‘ACL Anthology
-
Mapping Natural Language Instructions to Mobile UI Action Sequences
- Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, Jason Baldridge
- ποΈ Institutions: Google Researc
- π Date: July 2020
- π Publisher: ACL 2020
- π» Env: [Mobile]
- π Key: [framework], [dataset], [mobile UI automation], [natural language instructions], [action grounding], [RicoSCA]
- π TLDR: This paper introduces a method for grounding natural language instructions to mobile UI actions, aiming to automate mobile task execution through user interface manipulation. It introduces three key datasets: PixelHelp for task instruction-performance mappings on a Pixel emulator, AndroidHowTo for detailed phrase extraction, and RicoSCA for synthetic UI command training. The system utilizes a Transformer model to extract action phrase tuples, aligning them to UI elements with contextual screen positioning. Achieving over 70% accuracy in task completion, this approach is foundational for natural language-driven mobile UI automation.
-
- Toby Jia-Jun Li, Marissa Radensky, Justin Jia, Kirielle Singarajah, Tom M. Mitchell, Brad A. Myers
- ποΈ Institutions: CMU, Amherst College
- π Date: August 30, 2019
- π Publisher: UIST 2019
- π» Env: [Mobile]
- π Key: [programming-by-demonstration], [PUMICE]
- π TLDR: This paper introduces PUMICE, a multi-modal agent that combines natural language programming and programming-by-demonstration to enable end users to instruct intelligent agents in performing new tasks. By allowing users to describe tasks and conditions naturally and then collaboratively resolving ambiguities through conversation and demonstration, PUMICE facilitates the teaching of new concepts and procedures within existing mobile app GUIs. A lab study with 10 users demonstrated its usability and effectiveness.
-
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
- Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, Percy Liang
- ποΈ Institutions: Stanford
- π Date: February 24, 2018
- π Publisher: ICLR 2018
- π» Env: [Web]
- π Key: [framework], [benchmark], [reinforcement learning], [web tasks], [workflow-guided exploration]
- π TLDR: This paper presents a novel RL approach using workflow-guided exploration to efficiently train agents on web-based tasks, where actions are restricted based on demonstrated workflows to streamline learning. Evaluated on MiniWoB and MiniWoB++ benchmarks, the method significantly outperforms traditional RL techniques in sparse reward settings by structuring exploration according to high-level action constraints.
-
Rico: A Mobile App Dataset for Building Data-Driven Design Applications
- Genevieve Patterson, Joseph Gonzalez, Jeffrey Heer, Daniel H. Haim, Keyur Govani, Andrew Hertzmann, Noah Snavely, Neel Joshi
- ποΈ Institutions: UIUC, Northwestern University, Google
- π Date: October 20, 2017
- π Publisher: UIST 2017
- π» Env: [Mobile]
- π Key: [dataset], [mobile UI], [UI design analysis], [interaction mining], [RICO]
- π TLDR: This paper introduces Rico, a large-scale dataset comprising UI screens and view hierarchies from over 9,000 Android apps, designed to aid in understanding mobile app design. Rico supports a variety of tasks, including UI design analysis and interaction mining, by providing labeled UI components, screenshots, and interaction traces.
-
World of Bits: An Open-Domain Platform for Web-Based Agents
- Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, Percy Liang
- ποΈ Institutions: Stanford, OpenAI
- π Date: August 2017
- π Publisher: ICML 2017
- π» Env: [Web]
- π Key: [framework], [dataset], [reinforcement learning], [open-domain]
- π TLDR: This paper introduces World of Bits (WoB), a platform enabling agents to perform complex web-based tasks using low-level keyboard and mouse actions, addressing the lack of open-domain realism in existing reinforcement learning environments. WoB leverages a novel framework where crowdworkers create tasks with structured rewards and reproducibility by caching web interactions, forming a stable training environment. The authors validate WoB by training agents via behavioral cloning and reinforcement learning to accomplish various real-world tasks, showcasing its potential as an effective platform for reinforcement learning on web tasks.
-
SUGILITE: Creating Multimodal Smartphone Automation by Demonstration
- Toby Jia-Jun Li, Amos Azaria, Brad A. Myers
- ποΈ Institutions: CMU, Ariel University
- π Date: May 6, 2017
- π Publisher: CHI 2017
- π» Env: [Mobile]
- π Key: [framework], [PBD], [multimodal interaction], [SUGILITE], [programming-by-demonstration], [demonstration]
- π TLDR: This paper introduces SUGILITE, a programming-by-demonstration (PBD) system that enables users to automate tasks on smartphones through multimodal interactions. By leveraging Android's accessibility API, SUGILITE allows users to create generalized automation scripts for arbitrary third-party apps by demonstrating tasks using the regular app UI. The system combines verbal instructions, user demonstrations, and app UI hierarchies to generalize scripts from single demonstrations, facilitating task variations and parameterization. Extensive error handling and context checking enhance robustness against app UI changes. A lab study indicates that users with minimal programming knowledge can successfully automate smartphone tasks using SUGILITE.
Please fork and update:
π€ You can use this GPTs to quickly search and get a formatted paper entry automatically by inputting a paper name. Or you can simply leave a comment in an issue.
Format example and explanation
- [title](paper link)
- List authors directly without a "key" identifier (e.g., author1, author2)
- ποΈ Institutions: List the institutions concisely, using abbreviations (e.g., university names, like OSU).
- π
Date: e.g., Oct 30, 2024
- π Publisher: ICLR 2025
- π» Env: Indicate the research environment within brackets, such as [Web], [Mobile], or [Desktop]. Use [GUI] if the research spans multiple environments. Use [Misc] if it is researching in general domains.
- π Key: Label each keyword within brackets, e.g., [model], [framework],[dataset],[benchmark].
- π TLDR: Brief summary of the paper.
Regarding the π Key:
Key | Definition |
---|---|
model | Indicates a newly trained model. |
framework | If the paper proposes a new agent framework. |
dataset | If a new (training) dataset is created and published. |
benchmark | If a new benchmark is established (also add "dataset" if there's a new training set). |
primary studies | List the main focus or innovation in the study. |
Abbreviations | Include commonly used abbreviations associated with the paper (model names, framework names, etc.). |
For missing information, use "Unknown."