-
Falcon-UI: Understanding GUI Before Following User Instructions
- Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, Xiangyang Ji
- 🏛️ Institutions: Chinese Academy of Sciences, Tsinghua University, Nankai University, BAAI
- 📅 Date: Dec 12, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [model], [dataset], [Falcon-UI], [GUI understanding]
- 📖 TLDR: This paper introduces Falcon-UI, a GUI agent model that emphasizes understanding GUI contexts before following user instructions. The authors present the Insight-UI Dataset, an instruction-free GUI navigation dataset generated from the Common Crawl corpus, simulating various platforms like iOS, Android, Windows, and Linux across multiple resolutions on 312K domains. Falcon-UI is pretrained on this dataset and fine-tuned on Android and Web GUI datasets, achieving performance comparable to larger models, highlighting the importance of decoupling GUI understanding from instruction following in agent performance.
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
- Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong
- 🏛️ Institutions: HKU, NTU, Salesforce
- 📅 Date: Dec 5, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [model], [dataset], [planning], [reasoning], [Aguvis], [visual grounding]
- 📖 TLDR: This paper introduces Aguvis, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. It leverages image-based observations and grounds natural language instructions to visual elements, employing a consistent action space to ensure cross-platform generalization. The approach integrates explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. A large-scale dataset of GUI agent trajectories is constructed, incorporating multimodal reasoning and grounding. Comprehensive experiments demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. All datasets, models, and training recipes are open-sourced to facilitate future research.
-
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
- Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
- 🏛️ Institutions: NUS, Microsoft
- 📅 Date: November 26, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [model], [framework], [dataset], [UI-Guided Visual Token Selection], [Interleaved Vision-Language-Action Streaming], [ShowUI]
- 📖 TLDR: This paper introduces ShowUI, a vision-language-action model designed to enhance GUI automation by addressing challenges in UI visual perception and action modeling. It features innovations like UI-Guided Visual Token Selection to reduce computational costs and Interleaved Vision-Language-Action Streaming for effective management of visual-action history. Trained on a curated dataset, ShowUI achieves 75.1% accuracy in zero-shot screenshot grounding and demonstrates competitive performance across web, mobile, and online environments.
-
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
- Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong
- 🏛️ Institutions: Tsinghua University, Peking University
- 📅 Date: October 31, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dataset], [benchmark], [AndroidLab]
- 📖 TLDR: This paper introduces AndroidLab, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates.
-
OS-ATLAS: A Foundation Action Model For Generalist GUI Agents
- Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao
- 🏛️ Institutions: Shanghai AI Lab, Shanghai Jiaotong University, HKU, MIT
- 📅 Date: October 30, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [model], [dataset], [benchmark], [OS-Atlas]
- 📖 TLDR: This paper introduces OS-Atlas, a foundational GUI action model designed to enhance GUI grounding and out-of-distribution tasks. The authors developed a toolkit to synthesize multi-platform GUI grounding data, resulting in a cross-platform corpus of over 13 million GUI elements. OS-Atlas demonstrates significant performance improvements across six benchmarks spanning mobile, desktop, and web platforms.
-
EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data
- Xuetian Chen, Hangcheng Li, Jiaqing Liang, Sihang Jiang, Deqing Yang
- 🏛️ Institutions: Fudan University
- 📅 Date: October 25, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [dataset], [framework], [synthetic data]
- 📖 TLDR: The EDGE framework proposes an innovative approach to improve GUI understanding and interaction capabilities in vision-language models through large-scale, multi-granularity synthetic data generation. By leveraging webpage data, EDGE minimizes the need for manual annotations and enhances the adaptability of models across desktop and mobile GUI environments. Evaluations show its effectiveness in diverse GUI-related tasks, contributing significantly to autonomous agent development in GUI navigation and interaction.
-
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
- Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida
- 🏛️ Institutions: CMU, MIT, NYU, Microsoft
- 📅 Date: October 24, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [dataset], [video understanding], [long-context], [VideoWA]
- 📖 TLDR: This paper introduces VideoWebArena (VideoWA), a benchmark assessing multimodal agents in video-based tasks. It features over 2,000 tasks focused on skill and factual retention, using video tutorials to simulate long-context environments. Results highlight current challenges in agentic abilities, providing a critical testbed for long-context video understanding improvements.
-
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
- Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant
- 🏛️ Institutions: Tel Aviv University
- 📅 Date: October 21, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [dataset], [planning and reasoning]
- 📖 TLDR: AssistantBench is a benchmark designed to test the abilities of web agents in completing time-intensive, realistic web-based tasks. Covering 214 tasks across various domains, the benchmark introduces the SPA (See-Plan-Act) framework to handle multi-step planning and memory retention. AssistantBench emphasizes realistic task completion, showing that current agents achieve only modest success, with significant improvements needed for complex information synthesis and execution across multiple web domains.
-
Dissecting Adversarial Robustness of Multimodal LM Agents
- Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, Aditi Raghunathan
- 🏛️ Institutions: CMU, Stanford
- 📅 Date: October 21, 2024
- 📑 Publisher: NeurIPS 2024 Workshop
- 💻 Env: [Web]
- 🔑 Key: [dataset], [attack], [ARE], [safety]
- 📖 TLDR: This paper introduces the Agent Robustness Evaluation (ARE) framework to assess the adversarial robustness of multimodal language model agents in web environments. By creating 200 targeted adversarial tasks within VisualWebArena, the study reveals that minimal perturbations can significantly compromise agent performance, even in advanced systems utilizing reflection and tree-search mechanisms. The findings highlight the need for enhanced safety measures in deploying such agents.
-
Harnessing Webpage UIs for Text-Rich Visual Understanding
- Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue
- 🏛️ Institutions: CMU
- 📅 Date: October 17, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web], [Doc]
- 🔑 Key: [dataset], [model], [text-rich visual understanding], [web UI comprehension]
- 📖 TLDR: This paper introduces MultiUI, a large-scale dataset containing 7.3 million annotated samples from 1 million websites, specifically designed to enhance multimodal large language models’ (MLLMs) capabilities in text-rich visual understanding. Utilizing webpage UI structures as a training resource, MultiUI provides robust accessibility tree data paired with UI screenshots, significantly improving MLLMs’ grounding, OCR, and interaction performance. Models trained with MultiUI achieve up to a 48% performance boost on VisualWebBench and demonstrate enhanced generalization across non-web tasks, setting a new standard for structured, visually integrated web data modeling.
-
AutoWebGLM: A Large Language Model-based Web Navigating Agent
- Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang
- 🏛️ Institutions: THU, OSU
- 📅 Date: October 12, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [dataset], [benchmark], [reinforcement learning]
- 📖 TLDR: AutoWebGLM introduces a web navigation agent based on ChatGLM3-6B, designed to autonomously navigate and interact with webpages for complex tasks. The paper highlights a two-phase data construction approach using a hybrid human-AI methodology for diverse, curriculum-based web task training. It also presents AutoWebBench, a benchmark for evaluating agent performance in web tasks, and uses reinforcement learning to fine-tune operations, addressing complex webpage interaction and grounding.
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
- Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi
- 🏛️ Institutions: AI2, UW
- 📅 Date: September 25, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Misc]
- 🔑 Key: [model], [dataset], [PixMo], [Molmo], [vision language model], [foundation model]
- 📖 TLDR: This paper introduces Molmo, a family of state-of-the-art open vision-language models (VLMs), and PixMo, a collection of new datasets including detailed image captions, free-form image Q&A, and innovative 2D pointing data, all collected without reliance on proprietary VLMs. The authors demonstrate that careful model design, a well-tuned training pipeline, and high-quality open datasets can produce VLMs that outperform existing open models and rival proprietary systems. The model weights, datasets, and source code are made publicly available to advance research in this field.
-
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
- Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Shuo Shang
- 🏛️ Institutions: XiaoMi AI Lab, University of Electronic Science and Technology of China, Renmin University of China
- 📅 Date: September 23, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [model], [dataset], [MobileVLM], [Mobile3M], [UI understanding]
- 📖 TLDR: This paper introduces MobileVLM, a vision-language model designed to enhance both intra- and inter-UI understanding for mobile applications. The authors propose two additional pre-training stages with four specific UI-based tasks to improve the model's perception of fine-grained elements and capture page transition actions. To support this, they constructed Mobile3M, a large-scale Chinese mobile dataset comprising 3 million UI pages and real-world transition actions, organized into directed graphs. Experimental results demonstrate that MobileVLM outperforms existing vision-language models on both in-house test sets and public mobile benchmarks.
-
From Grounding to Planning: Benchmarking Bottlenecks in Web Agents
- Segev Shlomov, Ben Wiesel, Aviad Sela, Ido Levy, Liane Galanti, Roy Abitbol
- 🏛️ Institutions: IBM
- 📅 Date: September 3, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [planning], [grounding], [Mind2Web dataset], [web navigation]
- 📖 TLDR: This paper analyzes performance bottlenecks in web agents by separately evaluating grounding and planning tasks, isolating their individual impacts on navigation efficacy. Using an enhanced version of the Mind2Web dataset, the study reveals planning as a significant bottleneck, with advancements in grounding and task-specific benchmarking for elements like UI component recognition. Through experimental adjustments, the authors propose a refined evaluation framework, aiming to enhance web agents' contextual adaptability and accuracy in complex web environments.
-
TinyAgent: Function Calling at the Edge
- Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami
- 🏛️ Institutions: UC Berkeley, ICSI
- 📅 Date: September 1, 2024
- 📑 Publisher: EMNLP 2024
- 💻 Env: [Desktop]
- 🔑 Key: [framework], [dataset], [quantization], [LLMCompiler], [TinyAgent-1.1B], [TinyAgent-7B]
- 📖 TLDR: This paper introduces TinyAgent, an end-to-end framework for training and deploying task-specific small language model agents capable of function calling at the edge. By fine-tuning small models with curated datasets and employing techniques like quantization and a novel tool retrieval method, TinyAgent enables efficient, real-time execution of user commands on local devices without relying on cloud infrastructure. The framework demonstrates that these small models can match or even surpass the function-calling capabilities of larger models like GPT-4-Turbo while operating entirely on edge devices.
-
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
- Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang
- 🏛️ Institutions: Tsinghua University, MSRA, The Ohio State University
- 📅 Date: August 12, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [benchmark], [dataset], [VisualAgentBench], [VAB]
- 📖 TLDR: The authors introduce VisualAgentBench (VAB), a comprehensive benchmark designed to train and evaluate large multimodal models (LMMs) as visual foundation agents across diverse scenarios, including embodied tasks, graphical user interfaces, and visual design. VAB comprises five distinct environments that systematically challenge LMMs' understanding and interaction capabilities. Additionally, the benchmark offers supervised fine-tuning trajectory data for behavior cloning training, demonstrating the potential to improve open LMMs for serving as visual foundation agents.
-
OmniParser for Pure Vision Based GUI Agent
- Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah
- 🏛️ Institutions: MSR, Microsoft Gen AI
- 📅 Date: August 1, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [framework], [dataset], [OmniParser]
- 📖 TLDR: This paper introduces OmniParser, a method for parsing user interface screenshots into structured elements, enhancing the ability of models like GPT-4V to generate actions accurately grounded in corresponding UI regions. The authors curated datasets for interactable icon detection and icon description, fine-tuning models to parse interactable regions and extract functional semantics of UI elements.
-
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
- Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongsheng Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu
- 🏛️ Institutions: HKU, SJTU, Google Cloud AI Research, Google DeepMind, Salesforce Research, Yale University, Sea AI Lab, University of Waterloo
- 📅 Date: July 15, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Desktop]
- 🔑 Key: [benchmark], [dataset], [data science], [engineering workflows], [Spider2-V]
- 📖 TLDR: This paper introduces Spider2-V, a multimodal agent benchmark designed to evaluate the capability of agents in automating professional data science and engineering workflows. It comprises 494 real-world tasks across 20 enterprise-level applications, assessing agents' proficiency in code generation and GUI operations within authentic computer environments.
-
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
- Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng Li
- 🏛️ Institutions: CUHK, SJTU, Shanghai AI Lab, vivo AI Lab
- 📅 Date: July 3, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [benchmark], [AMEX]
- 📖 TLDR: This paper introduces the Android Multi-annotation EXpo (AMEX), a comprehensive dataset designed for training and evaluating mobile GUI-control agents. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, annotated at multiple levels, including GUI interactive element grounding, functionality descriptions, and complex natural language instructions. The dataset aims to advance research on AI agents capable of completing complex tasks by interacting directly with mobile device GUIs.
-
Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
- Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang
- 🏛️ Institutions: UCSC, MSR
- 📅 Date: June 27, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [framework], [dataset], [ToL], [screen reading], [accessibility]
- 📖 TLDR: The authors propose the Tree-of-Lens (ToL) agent to address the Screen Point-and-Read (ScreenPR) task, which involves generating natural language descriptions of screen regions based on user-indicated points. The ToL agent constructs a Hierarchical Layout Tree to comprehend the content and articulate the layout and spatial relationships between elements. The authors also introduce the ScreenPR benchmark, consisting of 650 screenshots from web, mobile, and operating system GUIs, manually annotated with 1,500 target points and regions.
-
VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning
- Ziyang Meng, Yu Dai, Zezheng Gong, Shaoxiong Guo, Minglong Tang, Tongquan Wei
- 🏛️ Institutions: SJTU
- 📅 Date: June 20, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [model], [dataset], [framework], [VGA], [hallucination]
- 📖 TLDR: This paper introduces VGA, a fine-tuned model designed to enhance GUI comprehension by reducing hallucinations. The authors constructed a Vision Question Answering (VQA) dataset of 63.8k high-quality examples using a Referent Method, ensuring model responses are highly dependent on visual content. They also propose a two-stage fine-tuning method called Foundation and Advanced Comprehension (FAC) to improve the model's ability to extract information from images and align with human intent.
-
E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion
- Ke Wang, Tianyu Xia, Zhangxuan Gu, Yi Zhao, Shuheng Shen, Changhua Meng, Weiqiang Wang, Ke Xu
- 🏛️ Institutions: Ant Group, Tsinghua University
- 📅 Date: June 20, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [benchmark], [E-ANT]
- 📖 TLDR: This paper introduces E-ANT, the first large-scale Chinese GUI navigation dataset comprising over 40,000 real human interaction traces across more than 5,000 tiny apps. The dataset includes high-quality screenshots with annotations, facilitating the evaluation and development of GUI navigation and decision-making capabilities in multimodal large language models (MLLMs). The authors also assess various MLLMs on E-ANT, providing insights into their performance and potential improvements.
-
GUI Action Narrator: Where and When Did That Action Take Place?
- Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou
- 🏛️ Institutions: NUS, Chinese Academy of Sciences
- 📅 Date: June 19, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Desktop]
- 🔑 Key: [dataset], [framework], [Act2Cap], [GUI Narrator]
- 📖 TLDR: The authors present Act2Cap, a GUI action dataset containing 4,189 video-caption pairs depicting various GUI actions such as clicks, drags, and typing across multiple software environments. They also propose GUI Narrator, a framework that leverages cursor detection as a visual prompt to enhance the interpretation of high-resolution screenshots for GUI video captioning. Evaluations reveal that even advanced multimodal models face challenges in this domain, highlighting the need for specialized approaches to improve performance.
-
WebCanvas: Benchmarking Web Agents in Online Environments
- Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu
- 🏛️ Institutions: iMean AI, CMU
- 📅 Date: June 18, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [dataset], [benchmark], [Mind2Web-Live], [key-node evaluation]
- 📖 TLDR: This paper presents WebCanvas, an online evaluation framework for web agents designed to address the dynamic nature of web interactions. It introduces a key-node-based evaluation metric to capture critical actions or states necessary for task completion while disregarding noise from insignificant events or changed web elements. The framework includes the Mind2Web-Live dataset, a refined version of the original Mind2Web static dataset, containing 542 tasks with 2,439 intermediate evaluation states. Despite advancements, the best-performing model achieves a task success rate of 23.1%, highlighting substantial room for improvement.
-
GUICourse: From General Vision Language Models to Versatile GUI Agents
- Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun
- 🏛️ Institutions: Tsinghua University, Rhapsody AI, University of Electronic Science and Technology of China
- 📅 Date: June 17, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [dataset], [framework], [GUICourse]
- 📖 TLDR: This paper introduces GUICourse, a suite of datasets aimed at training visual-based GUI agents from general vision-language models. It addresses challenges in OCR, grounding, and GUI knowledge, enhancing the models' capabilities in GUI navigation tasks.
-
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
- Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, Lichao Sun
- 🏛️ Institutions: Huazhong University of Science and Technology (HUST), MSR, University of Illinois at Chicago (UIC)
- 📅 Date: June 16, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [dataset], [benchmark], [GUI-World], [GUI-Vid]
- 📖 TLDR: This paper introduces GUI-World, a comprehensive dataset designed to evaluate Multimodal Large Language Models (MLLMs) in dynamic and complex GUI environments. It includes over 12,000 annotated GUI interaction videos covering diverse applications and scenarios. The study highlights the limitations of current MLLMs in handling dynamic and multi-step tasks and presents GUI-Vid, a fine-tuned VideoLLM, demonstrating improved understanding of various GUI tasks.
-
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
- Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, Ping Luo
- 🏛️ Institutions: OpenGVLab, Shanghai AI Laboratory, HKU, Nanjing University, Harbin Institute of Technology, Shenzhen, SJTU
- 📅 Date: June 13, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [model], [OdysseyAgent], [cross-app navigation]
- 📖 TLDR: This paper presents GUI Odyssey, a dataset comprising 7,735 episodes from six mobile devices, designed to train and evaluate cross-app navigation agents. It spans six types of cross-app tasks across 201 apps and 1,399 app combinations. Leveraging this dataset, the authors developed OdysseyAgent, a multimodal cross-app navigation agent fine-tuned from the Qwen-VL model, demonstrating superior accuracy over existing models in both in-domain and out-of-domain scenarios.
-
On the Effects of Data Scale on UI Control Agents
- Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, Oriana Riva
- 🏛️ Institutions: Google DeepMind, Google
- 📅 Date: June 6, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [AndroidControl], [fine-tuning], [scalability]
- 📖 TLDR: This study investigates how the performance of computer control agents scales with the amount of fine-tuning data. The authors introduce AndroidControl, a dataset comprising 15,283 demonstrations across 833 Android applications. Findings indicate that while in-domain performance improves with more data, out-of-domain performance, especially on high-level tasks, scales more slowly, suggesting that fine-tuning alone may be insufficient for robust out-of-domain performance.
-
LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation
- Li Zhang, Shihe Wang, Xianqing Jia, Zhihan Zheng, Yunhe Yan, Longxi Gao, Yuanchun Li, Mengwei Xu
- 🏛️ Institutions: BUPT, Tsinghua University
- 📅 Date: April 12, 2024
- 📑 Publisher: UIST 2024
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dataset], [benchmark], [UI automation], [mobile agent evaluation]
- 📖 TLDR: LlamaTouch is an evaluation testbed designed for mobile UI automation, enabling reliable task assessment across 495 annotated tasks. It provides a scalable solution to evaluate agents in real-world mobile settings, comparing agent actions to essential UI states for accurate task completion. LlamaTouch supports dynamic environments, advancing mobile agent reliability and scalability in task automation.
-
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
- Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue
- 🏛️ Institutions: CMU
- 📅 Date: April 9, 2024
- 📑 Publisher: COLM 2024
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [dataset], [web page understanding], [grounding]
- 📖 TLDR: VisualWebBench introduces a comprehensive benchmark for evaluating multimodal large language models (MLLMs) on web-based tasks. It includes 1.5K human-curated instances across 139 websites in 87 sub-domains. The benchmark spans seven tasks—such as OCR, grounding, and web-based QA—aiming to test MLLMs' capabilities in fine-grained web page understanding. Results reveal significant performance gaps, particularly in grounding tasks, highlighting the need for advancement in MLLM web understanding.
-
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
- Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan
- 🏛️ Institutions: Apple
- 📅 Date: April 8, 2024
- 📑 Publisher: ECCV 2024
- 💻 Env: [Mobile]
- 🔑 Key: [model], [framework], [dataset], [benchmark], [mobile UI understanding]
- 📖 TLDR: This paper presents Ferret-UI, a multimodal large language model (MLLM) designed to understand and interact with mobile user interfaces. The model incorporates advanced capabilities for referring, grounding, and reasoning about UI elements. By training on a variety of UI tasks, Ferret-UI achieves high performance in tasks such as icon recognition and text extraction. The authors introduce a unique architecture that allows for improved visual feature extraction from mobile screens, paving the way for applications in accessibility and user interaction.
-
Benchmarking Mobile Device Control Agents across Diverse Configurations
- Juyong Lee, Taywon Min, Minyong An, Dongyoon Hahm, Haeone Lee, Changyeon Kim, Kimin Lee
- 🏛️ Institutions: KAIST, Seoul National University, Yonsei University
- 📅 Date: April 2024
- 📑 Publisher: ICLR 2024
- 💻 Env: [Mobile]
- 🔑 Key: [benchmark], [dataset], [mobile device control], [agent performance]
- 📖 TLDR: This paper presents B-MoCA, a comprehensive benchmark for evaluating mobile device control agents using an Android-based testbed with 131 tasks and various device configurations. The benchmark assesses agents' abilities across tasks that include device-specific variations, navigation, and human-like dual-gesture interactions. B-MoCA highlights that current agents perform well on basic tasks but struggle with complex configurations, pointing to opportunities for future improvements in mobile automation capabilities.
-
AgentStudio: A Toolkit for Building General Virtual Agents
- Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, Shuicheng Yan
- 🏛️ Institutions: NTU, Skywork AI, ETH Zurich
- 📅 Date: March 26, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Desktop]
- 🔑 Key: [framework], [dataset], [general virtual agents], [open-ended learning], [tool creation], [GroundUI], [benchmark]
- 📖 TLDR: AgentStudio is a robust toolkit for developing virtual agents with versatile actions, such as GUI automation and code execution. It unifies real-world human-computer interactions across OS platforms and includes diverse observation and action spaces, facilitating comprehensive training and benchmarking in complex settings. The toolkit's flexibility promotes agent generalization across varied tasks, supporting tool creation and a multimodal interaction interface to advance agent adaptability and learning.
-
WebVLN: Vision-and-Language Navigation on Websites
- Qi Chen, Dileepa Pitawela, Chongyang Zhao, Gengze Zhou, Hsiang-Ting Chen, Qi Wu
- 🏛️ Institutions: The University of Adelaide
- 📅 Date: March 24, 2024
- 📑 Publisher: AAAI 2024
- 💻 Env: [Web]
- 🔑 Key: [framework], [dataset], [web-based VLN], [HTML content integration], [multimodal navigation]
- 📖 TLDR: This paper introduces the WebVLN task, where agents navigate websites by following natural language instructions that include questions and descriptions. Aimed at emulating real-world browsing behavior, the task allows the agent to interact with elements not directly visible in the rendered content by integrating HTML-specific information. A new WebVLN-Net model, based on the VLN BERT framework, is introduced alongside the WebVLN-v1 dataset, supporting question-answer navigation across web pages. This framework demonstrated significant improvement over existing web-based navigation methods, marking a new direction in vision-and-language navigation research.
-
Tur[k]ingBench: A Challenge Benchmark for Web Agents
- Kevin Xu, Yeganeh Kordi, Kate Sanders, Yizhong Wang, Adam Byerly, Jingyu Zhang, Benjamin Van Durme, Daniel Khashabi
- 🏛️ Institutions: JHU, Brown, UW
- 📅 Date: March 18, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [dataset], [multi-modal reasoning], [TurkingBench], [Turking]
- 📖 TLDR: This paper introduces Tur[k]ingBench, a benchmark comprising 158 web-grounded tasks designed to evaluate AI agents' capabilities in complex web-based environments. Unlike prior benchmarks that utilize synthetic web pages, Tur[k]ingBench leverages natural HTML pages from crowdsourcing platforms, presenting tasks with rich multi-modal contexts. The benchmark includes 32.2K instances, each with diverse inputs, challenging models to interpret and interact with web pages effectively. Evaluations of state-of-the-art models reveal significant room for improvement, highlighting the need for advanced web-based agents capable of handling real-world web interactions.
-
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, Duyu Tang
- 🏛️ Institutions: Fudan University, Huawei
- 📅 Date: March 5, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dataset], [Android GUI], [Chain-of-Action-Thought], [autonomous GUI agents]
- 📖 TLDR: This paper introduces Chain-of-Action-Thought (CoAT), a novel paradigm to improve GUI agent task completion by enabling agents to interpret previous actions, current screen content, and action rationale for next steps. The authors present the Android-In-The-Zoo (AitZ) dataset, which includes 18,643 screen-action pairs with detailed annotations, supporting CoAT's development and evaluation. The study demonstrates that fine-tuning with the AitZ dataset improves performance of a baseline large language model in predicting correct action sequences in Android tasks.
-
On the Multi-turn Instruction Following for Conversational Web Agents
- Yang Deng, Xuan Zhang, Wenxuan Zhang, Yifei Yuan, See-Kiong Ng, Tat-Seng Chua
- 🏛️ Institutions: NUS, DAMO Academy, University of Copenhagen
- 📅 Date: February 23, 2024
- 📑 Publisher: ACL 2024
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [dataset], [multi-turn dialogue], [memory utilization], [self-reflective planning]
- 📖 TLDR: This paper explores multi-turn conversational web navigation, introducing the MT-Mind2Web dataset to support instruction-following tasks for web agents. The proposed Self-MAP (Self-Reflective Memory-Augmented Planning) framework enhances agent performance by integrating memory with self-reflection for sequential decision-making in complex interactions. Extensive evaluations using MT-Mind2Web demonstrate Self-MAP's efficacy in addressing the limitations of current models in multi-turn interactions, providing a novel dataset and framework for evaluating and training agents on detailed, multi-step web-based tasks.
-
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
- Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, Abhanshu Sharma
- 🏛️ Institutions: Google DeepMind
- 📅 Date: February 7, 2024
- 📑 Publisher: IJCAI 2024
- 💻 Env: [GUI]
- 🔑 Key: [model], [dataset], [UI understanding], [infographics understanding], [vision language model]
- 📖 TLDR: This paper introduces ScreenAI, a vision-language model specializing in UI and infographics understanding. The model combines the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. ScreenAI achieves state-of-the-art results on several UI and infographics-based tasks, outperforming larger models. The authors also release three new datasets for screen annotation and question answering tasks.
-
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
- Xing Han Lu, Zdeněk Kasner, Siva Reddy
- 🏛️ Institutions: Mila, McGill University
- 📅 Date: February 2024
- 📑 Publisher: ICML 2024
- 💻 Env: [Web]
- 🔑 Key: [framework], [dataset], [benchmark], [multi-turn dialogue], [real-world navigation], [WebLINX]
- 📖 TLDR: WebLINX addresses the complexity of real-world website navigation for conversational agents, with a benchmark featuring over 2,300 demonstrations across 150+ websites. The benchmark allows agents to handle multi-turn instructions and interact dynamically across diverse domains, including geographic and thematic categories. The study proposes a retrieval-inspired model that selectively extracts key HTML elements and browser actions, achieving efficient task-specific representations. Experiments reveal that smaller finetuned decoders outperform larger zero-shot multimodal models, though generalization to new environments remains challenging.
-
- Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov
- 🏛️ Institutions: CMU
- 📅 Date: February 2024
- 📑 Publisher: arXiv
- 💻 Env: [Desktop]
- 🔑 Key: [dataset], [benchmark]
- 📖 TLDR: OmniACT introduces a dataset and benchmark to train and evaluate multimodal agents capable of autonomously performing diverse tasks across desktop and web environments. Using annotated UI elements across applications, it combines visual grounding with natural language instructions, providing 9,802 data points for developing agents that integrate high-level reasoning with UI interactions. The study highlights the limited proficiency of current models, with baselines like GPT-4 only achieving 15% of human performance on executable scripts, emphasizing OmniACT's potential as a testbed for advancing multimodal AI.
-
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
- Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried
- 🏛️ Institutions: CMU
- 📅 Date: January 24, 2024
- 📑 Publisher: ACL 2024
- 💻 Env: [Web]
- 🔑 Key: [framework], [benchmark], [dataset], [multimodal agent evaluation], [visually grounded tasks]
- 📖 TLDR: VisualWebArena is a benchmark designed for testing multimodal web agents on complex, visually grounded web tasks. It provides a reproducible framework with 910 task scenarios across real-world web applications, emphasizing open-ended, visually guided interactions. The tasks are modeled within a partially observable Markov decision process to assess agents’ capacity to interpret multimodal inputs, execute navigation, and accomplish user-defined objectives across complex visual and textual information on websites.
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded
- Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
- 🏛️ Institutions: OSU
- 📅 Date: January 1, 2024
- 📑 Publisher: ICML 2024
- 💻 Env: [Web]
- 🔑 Key: [framework], [dataset], [benchmark], [grounding], [SeeAct], [Multimodal-Mind2web]
- 📖 TLDR: This paper explores the capability of GPT-4V(ision), a multimodal model, as a web agent that can perform tasks across various websites by following natural language instructions. It introduces the SEEACT framework, enabling GPT-4V to navigate, interpret, and interact with elements on websites. Evaluated using the Mind2Web benchmark and an online test environment, the framework demonstrates high performance on complex web tasks by integrating grounding strategies like element attributes and image annotations to improve HTML element targeting. However, grounding remains challenging, presenting opportunities for further improvement.
-
Multimodal Web Navigation with Instruction-Finetuned Foundation Models
- Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, Izzeddin Gur
- 🏛️ Institutions: Univ. of Tokyo, Google DeepMind
- 📅 Date: Jan 1, 2024
- 📑 Publisher: ICLR 2024
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [model], [dataset], [web navigation], [instruction-following], [WebShop]
- 📖 TLDR: This paper introduces WebGUM, an instruction-following multimodal agent for autonomous web navigation that leverages both visual (webpage screenshots) and textual (HTML) inputs to perform actions such as click and type. The model is trained on a vast corpus of demonstrations and shows improved capabilities in visual perception, HTML comprehension, and multi-step decision-making, achieving state-of-the-art performance on benchmarks like MiniWoB and WebShop. WebGUM provides a scalable approach to web-based tasks without task-specific architectures, enabling high-performance web navigation with generalizable, multimodal foundation models.
-
AssistGUI: Task-Oriented Desktop Graphical User Interface Automation
- Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou
- 🏛️ Institutions: NUS
- 📅 Date: December 20, 2023
- 📑 Publisher: CVPR 2024
- 💻 Env: [Desktop]
- 🔑 Key: [framework], [dataset], [benchmark], [desktop productivity tasks]
- 📖 TLDR: This study presents AssistGUI, a benchmark and framework for desktop GUI automation, featuring an LLM-based agent capable of completing complex user requests by analyzing instructional videos and performing actions on the desktop. Utilizing a novel Actor-Critic framework and GUI parser, AssistGUI was tested on 100 tasks across nine applications, such as MS Word and After Effects. Despite advances, the top-performing model achieved only a 46% success rate, illustrating the challenge of comprehensive desktop automation and underscoring areas for future research in agent-driven GUI tasks.
-
CogAgent: A Visual Language Model for GUI Agents
- Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhao Chen, Yuxuan Wang, Yining Ye, Jiayi Zhang, Hao Dong, Wenhu Chen, Yizhou Wang, Kai-Wei Chang
- 🏛️ Institutions: Tsinghua University, Zhipu AI
- 📅 Date: December 15, 2023
- 📑 Publisher: CVPR 2024
- 💻 Env: [GUI]
- 🔑 Key: [model], [dataset], [benchmark], [visual language model], [GUI agent]
- 📖 TLDR: This paper presents CogAgent, a visual language model designed for GUI agents. The authors introduce a new dataset, CogBench, featuring 1,430 GUI tasks across various applications. CogAgent employs a novel training approach combining supervised fine-tuning and decision-making fine-tuning. The model demonstrates superior performance on CogBench and generalizes well to unseen applications, outperforming existing models like GPT-4V in GUI task completion.
-
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models
- Fangzhi Xu, Qiushi Sun, Kanzhi Cheng, Jun Liu, Yu Qiao, Zhiyong Wu
- 🏛️ Institutions: Xi'an Jiaotong University, Shanghai AI Lab, HKU, Nanjing University
- 📅 Date: November 2023
- 📑 Publisher: arXiv
- 💻 Env: [GUI (evaluated on web, math reasoning, and logic reasoning environments)]
- 🔑 Key: [framework], [dataset], [neural-symbolic self-training], [online exploration], [self-refinement]
- 📖 TLDR: This paper introduces ENVISIONS, a neural-symbolic self-training framework designed to improve large language models (LLMs) by enabling self-training through interaction with a symbolic environment. The framework addresses symbolic data scarcity and enhances LLMs' symbolic reasoning proficiency by iteratively exploring, refining, and learning from symbolic tasks without reinforcement learning. Extensive evaluations across web navigation, math, and logical reasoning tasks highlight ENVISIONS as a promising approach for enhancing LLM symbolic processing.
-
You Only Look at Screens: Multimodal Chain-of-Action Agents
- Zhuosheng Zhang, Aston Zhang
- 🏛️ Institutions: SJTU
- 📅 Date: September 20, 2023
- 📑 Publisher: ICLR 2024
- 💻 Env: [GUI]
- 🔑 Key: [framework], [dataset], [benchmark], [multimodal agent], [chain-of-action technique]
- 📖 TLDR: This paper presents Auto-GUI, a multimodal agent capable of directly interacting with graphical user interfaces without relying on environment parsing or application-specific APIs. The authors introduce a novel chain-of-action technique that leverages previous action histories and future action plans to improve decision-making. Auto-GUI is evaluated on a new device-control benchmark, AITW, demonstrating state-of-the-art performance in action prediction and task completion across various applications and web-based tasks.
-
AutoDroid: LLM-powered Task Automation in Android
- Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu
- 🏛️ Institutions: Tsinghua University, Shanghai AI Lab, University of Notre Dame, MSR
- 📅 Date: August 29, 2023
- 📑 Publisher: MobiCom 2024
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dataset], [benchmark], [Android task automation], [LLM-powered agent]
- 📖 TLDR: This paper introduces AutoDroid, a novel mobile task automation system capable of handling arbitrary tasks on any Android application without manual efforts. The framework combines the commonsense knowledge of LLMs with domain-specific knowledge of apps through automated dynamic analysis. AutoDroid features a functionality-aware UI representation method, exploration-based memory injection techniques, and a multi-granularity query optimization module. Evaluated on a new benchmark with 158 common tasks, AutoDroid achieves a 90.9% action generation accuracy and a 71.3% task completion rate, significantly outperforming GPT-4-powered baselines.
-
Android in the Wild: A Large-Scale Dataset for Android Device Control
- Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, Timothy Lillicrap
- 🏛️ Institutions: Google Research, Google DeepMind
- 📅 Date: July 19, 2023
- 📑 Publisher: NeurIPS 2023
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [benchmark], [device control], [natural language interaction], [gesture-based actions]
- 📖 TLDR: The Android in the Wild (AitW) dataset introduces a significant benchmark for Android device control, encompassing over 715,000 human-labeled episodes with natural language commands and corresponding UI actions. Collected from Android devices across versions 10-13, it captures complex multi-step tasks requiring both visual and contextual understanding. The dataset is structured to test the robustness of device-control systems under varying conditions, such as new tasks or applications, and includes data to evaluate gesture-based interactions, providing a unique foundation for mobile interface automation and task execution research.
-
Mind2Web: Towards a Generalist Agent for the Web
- Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, Yu Su
- 🏛️ Institutions: OSU
- 📅 Date: June 9, 2023
- 📑 Publisher: NeurIPS 2023
- 💻 Env: [Web]
- 🔑 Key: [dataset], [benchmark], [model], [Mind2Web], [MindAct]
- 📖 TLDR: Mind2Web presents a dataset and benchmark specifically crafted for generalist web agents capable of performing language-guided tasks across varied websites. Featuring over 2,000 tasks from 137 sites, it spans 31 domains and emphasizes open-ended, realistic tasks in authentic, unsimplified web settings. The study proposes the MindAct framework, which optimizes LLMs for handling complex HTML elements by using small LMs to rank elements before full processing, thereby enhancing the efficiency and versatility of web agents in diverse contexts.
-
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction
- Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, Kai Yu
- 🏛️ Institutions: SJTU, HKU
- 📅 Date: May 14, 2023
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [benchmark], [dataset], [interaction platform], [multistep interaction], [InfoUI]
- 📖 TLDR: This paper introduces Mobile-Env, a novel interaction platform and benchmark aimed at assessing large language models' (LLMs) capabilities in interactive environments. It builds on the InfoUI task set, derived from WikiHow, to create structured text-based challenges that simulate real-world mobile interactions. The platform is designed to support task expansions from the community, aiming to drive advancements in LLM-based interactive agents.
-
WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics
- Jason Wu, Siyan Wang, Siman Shen, Yi-Hao Peng, Jeffrey Nichols, Jeffrey P. Bigham
- 🏛️ Institutions: CMU, Wellesley College, Grinnell College, Snooty Bird LLC
- 📅 Date: January 30, 2023
- 📑 Publisher: CHI 2023
- 💻 Env: [Web]
- 🔑 Key: [dataset], [element detection], [screen classification], [screen similarity], [UI modeling]
- 📖 TLDR: The WebUI dataset includes 400,000 web UIs captured to enhance UI modeling by integrating visual UI metadata. This dataset supports tasks such as element detection, screen classification, and screen similarity, especially for accessibility, app automation, and testing applications. Through transfer learning and semi-supervised methods, WebUI addresses the challenge of training robust models with limited labeled mobile data, proving effective in tasks beyond web contexts, such as mobile UIs.
-
Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus
- Gang Li, Yang Li
- 🏛️ Institutions: Google Research
- 📅 Date: September 29, 2022
- 📑 Publisher: ICLR 2023
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [model], [dataset], [mobile UI tasks], [region-based focus]
- 📖 TLDR: This paper introduces "Spotlight," a vision-language model for mobile UI understanding that operates solely on visual inputs (screenshots) and a specified focus region on the screen. By leveraging a large-scale dataset and training strategies tailored to mobile interfaces, Spotlight performs multiple UI-related tasks, including widget captioning, screen summarization, command grounding, and tappability prediction. It utilizes a vision-only approach, avoiding reliance on view hierarchies to achieve greater robustness and scalability across different mobile UI environments.
-
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
- Shunyu Yao, Howard Chen, John Yang, Karthik Narasimhan
- 🏛️ Institutions: Princeton University
- 📅 Date: July 2022
- 📑 Publisher: NeurIPS 2022
- 💻 Env: [Web]
- 🔑 Key: [framework], [dataset], [benchmark], [e-commerce web interaction], [language grounding]
- 📖 TLDR: This paper introduces WebShop, a simulated web-based shopping environment with over 1 million real-world products and 12,087 annotated instructions. It allows language agents to navigate, search, and make purchases based on natural language commands. The study explores how agents handle compositional instructions and noisy web data, providing a robust environment for reinforcement learning and imitation learning. The best models show effective sim-to-real transfer on websites like Amazon, illustrating WebShop’s potential for training grounded agents.
-
META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI
- Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, Kai Yu
- 🏛️ Institutions: SJTU
- 📅 Date: May 23, 2022
- 📑 Publisher: EMNLP 2022
- 💻 Env: [Mobile]
- 🔑 Key: [benchmark], [dataset], [task-oriented dialogue], [GUI-based interaction], [multi-modal agent]
- 📖 TLDR: This paper presents META-GUI, a dataset and framework for training multi-modal conversational agents capable of interacting directly with mobile app interfaces without the need for backend APIs. META-GUI includes over 1,100 dialogues with annotated action sequences on various tasks such as booking and scheduling. The authors propose a GUI-based task-oriented dialogue system that allows agents to navigate mobile interfaces via direct GUI actions, with performance shown to improve in multi-modal task-oriented dialogue contexts.
-
A Data-Driven Approach for Learning to Control Computers
- Peter C. Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Alex Goldin, Adam Santoro, Timothy Lillicrap
- 🏛️ Institutions: DeepMind
- 📅 Date: February 16, 2022
- 📑 Publisher: ICML 2022
- 💻 Env: [Desktop]
- 🔑 Key: [dataset], [framework], [computer control], [reinforcement learning], [multimodal transformer]
- 📖 TLDR: This study presents a reinforcement learning-based approach to train agents for computer control tasks, using keyboard and mouse interactions guided by natural language. By leveraging human demonstration data, agents trained in this environment achieved strong cross-task generalization across the MiniWob++ benchmark. This framework demonstrates how agents can control computers as humans would, enabling enhanced performance in complex computer tasks with high transferability.
-
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility
- Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, Bryan A. Plummer
- 🏛️ Institutions: Boston University, UIUC
- 📅 Date: February 4, 2022
- 📑 Publisher: ECCV 2022
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [feasibility prediction], [vision-language navigation], [mobile interaction]
- 📖 TLDR: This paper introduces the Mobile App Tasks with Iterative Feedback (MoTIF) dataset, which addresses vision-language navigation (VLN) with a focus on task feasibility uncertainty in mobile applications. MoTIF provides commands paired with mobile actions and feasibility annotations, allowing researchers to examine the impact of command feasibility on task completion. The dataset includes 125 apps and emphasizes diverse app environments, action sequences, and follow-up questions to improve task ambiguity resolution, making it a valuable resource for feasibility prediction research.
-
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
- Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, Yang Li
- 🏛️ Institutions: University of Toronto
- 📅 Date: August 6, 2021
- 📑 Publisher: UIST 2021
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dataset], [mobile UI summarization], [multimodal learning], [Transformer model]
- 📖 TLDR: The paper introduces Screen2Words, an approach that utilizes multimodal learning to generate descriptive language summaries for mobile UI screens, combining textual, visual, and structural data from screens. The study created a large-scale dataset with 112,085 annotated screen summaries for 22,417 unique UIs, aiming to support model training for mobile UI understanding. The dataset facilitates a Transformer-based model trained to summarize screens by highlighting main functionalities, and the approach is validated with benchmarks in the mobile environment.
-
UIBert: Learning Generic Multimodal Representations for UI Understanding
- Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, Blaise Agüera y Arcas
- 🏛️ Institutions: Google Research
- 📅 Date: July 29, 2021
- 📑 Publisher: IJCAI 2021
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [model], [dataset], [multimodal representation learning], [UI understanding]
- 📖 TLDR: This paper presents UIBert, a multimodal model aimed at understanding user interfaces (UIs) by combining visual, textual, and structural metadata. UIBert is designed for tasks such as component retrieval and expression resolution, using a transformer-based joint image-text model. The authors introduce five novel pre-training tasks to leverage UI-specific features, enhancing accessibility and task completion in mobile applications. UIBert demonstrates superior performance on nine downstream UI tasks, highlighting the potential of multimodal pre-training in UI understanding.
-
WebSRC: A Dataset for Web-Based Structural Reading Comprehension
- Lu Chen, Zihan Zhao, Xingyu Chen, Danyang Zhang, Jiabao Ji, Ao Luo, Yuxuan Xiong, Kai Yu
- 🏛️ Institutions: SJTU
- 📅 Date: January 23, 2021
- 📑 Publisher: EMNLP 2021
- 💻 Env: [Web]
- 🔑 Key: [dataset], [structural reading comprehension], [web page QA], [structural information], [HTML element alignment]
- 📖 TLDR: This paper introduces WebSRC, a dataset specifically designed for web-based structural reading comprehension, which requires understanding not only textual content but also the structural layout of web pages. WebSRC consists of 0.44 million question-answer pairs derived from 6,500 complex web pages. Each question challenges models to identify answers from HTML structures or to respond with yes/no, requiring a nuanced grasp of HTML and layout features. The authors benchmark several models on this dataset, highlighting its difficulty and the critical role of structural comprehension in improving machine understanding of web content.
-
Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements
- Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, Zhiwei Guan
- 🏛️ Institutions: Google Research
- 📅 Date: November 2020
- 📑 Publisher: EMNLP 2020
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [benchmark], [model], [accessibility], [natural language generation], [WidgetCaption]
- 📖 TLDR: This paper introduces the task of widget captioning, which aims to automatically generate natural language descriptions for UI elements in mobile apps to enhance accessibility. Using both visual and structural data from UI components, the study presents a novel dataset of 162,859 captions across 61,285 UI elements. Multiple deep learning models were tested on this dataset, with findings suggesting the potential for improving screen reader usability for visually impaired users by generating descriptive captions of UI elements.
-
Mapping Natural Language Instructions to Mobile UI Action Sequences
- Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, Jason Baldridge
- 🏛️ Institutions: Google Researc
- 📅 Date: July 2020
- 📑 Publisher: ACL 2020
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dataset], [mobile UI automation], [natural language instructions], [action grounding], [RicoSCA]
- 📖 TLDR: This paper introduces a method for grounding natural language instructions to mobile UI actions, aiming to automate mobile task execution through user interface manipulation. It introduces three key datasets: PixelHelp for task instruction-performance mappings on a Pixel emulator, AndroidHowTo for detailed phrase extraction, and RicoSCA for synthetic UI command training. The system utilizes a Transformer model to extract action phrase tuples, aligning them to UI elements with contextual screen positioning. Achieving over 70% accuracy in task completion, this approach is foundational for natural language-driven mobile UI automation.
-
Rico: A Mobile App Dataset for Building Data-Driven Design Applications
- Genevieve Patterson, Joseph Gonzalez, Jeffrey Heer, Daniel H. Haim, Keyur Govani, Andrew Hertzmann, Noah Snavely, Neel Joshi
- 🏛️ Institutions: UIUC, Northwestern University, Google
- 📅 Date: October 20, 2017
- 📑 Publisher: UIST 2017
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [mobile UI], [UI design analysis], [interaction mining], [RICO]
- 📖 TLDR: This paper introduces Rico, a large-scale dataset comprising UI screens and view hierarchies from over 9,000 Android apps, designed to aid in understanding mobile app design. Rico supports a variety of tasks, including UI design analysis and interaction mining, by providing labeled UI components, screenshots, and interaction traces.
-
World of Bits: An Open-Domain Platform for Web-Based Agents
- Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, Percy Liang
- 🏛️ Institutions: Stanford, OpenAI
- 📅 Date: August 2017
- 📑 Publisher: ICML 2017
- 💻 Env: [Web]
- 🔑 Key: [framework], [dataset], [reinforcement learning], [open-domain]
- 📖 TLDR: This paper introduces World of Bits (WoB), a platform enabling agents to perform complex web-based tasks using low-level keyboard and mouse actions, addressing the lack of open-domain realism in existing reinforcement learning environments. WoB leverages a novel framework where crowdworkers create tasks with structured rewards and reproducibility by caching web interactions, forming a stable training environment. The authors validate WoB by training agents via behavioral cloning and reinforcement learning to accomplish various real-world tasks, showcasing its potential as an effective platform for reinforcement learning on web tasks.