AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
- Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong
- 🏛️ Institutions: Tsinghua University, Peking University
- 📅 Date: October 31, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dataset], [benchmark], [AndroidLab]
- 📖 TLDR: This paper introduces AndroidLab, a comprehensive framework for training and systematically benchmarking Android autonomous agents. It provides an operational environment with diverse modalities and action spaces, supporting both large language models (LLMs) and multimodal models (LMMs). The benchmark includes 138 tasks across nine apps on predefined Android virtual devices. Utilizing AndroidLab, the authors developed an Android Instruction dataset and trained six open-source LLMs and LMMs, significantly improving their average success rates.
Lightweight Neural App Control
- Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao
- 🏛️ Institutions: Huawei Noah's Ark Lab, UCL
- 📅 Date: October 23, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [vision language model], [Action Transformer], [app agent], [Android control], [multi-modal]
- 📖 TLDR: This paper introduces LiMAC, a mobile control framework for Android that integrates an Action Transformer and fine-tuned vision-language models to execute precise actions in mobile apps. Tested on open-source datasets, LiMAC improves action accuracy by up to 42% over traditional prompt engineering baselines, demonstrating enhanced efficiency and accuracy in mobile app control tasks.
MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control
- Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee
- 🏛️ Institutions: KAIST, UT at Austin
- 📅 Date: October 23, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [benchmark], [safety], [evaluation], [Android emulator]
- 📖 TLDR: MobileSafetyBench introduces a benchmark for evaluating the safety of large language model (LLM)-based autonomous agents in mobile device control. Using Android emulators, the benchmark simulates real-world tasks in apps such as messaging and banking to assess agents' safety and helpfulness. The safety-focused tasks test for privacy risk management and robustness against adversarial prompt injections. Experiments show agents perform well in helpful tasks but struggle with safety-related challenges, underscoring the need for continued advancements in mobile safety mechanisms for autonomous agents.
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
- Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao
- 🏛️ Institutions: Huawei Noah’s Ark Lab, Harbin Institute of Technology, Shenzhen, UCL
- 📅 Date: October 19, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [benchmark], [AI agent], [smartphone control], [framework]
- 📖 TLDR: SPA-Bench is introduced as a benchmark designed to evaluate multimodal large language model (MLLM)-based smartphone agents, offering a task set that spans common smartphone functionalities across system and third-party applications. It includes a plug-and-play framework for real-time agent interactions on Android, integrating over ten agents with an adaptable evaluation pipeline measuring success across diverse metrics. Through this, the benchmark exposes challenges such as UI interpretation, action grounding, and memory retention in mobile environments, advancing research in smartphone-based agent applications.
ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents
- Jakub Hoscilowicz, Bartosz Maj, Bartosz Kozakiewicz, Oleksii Tymoschuk, Artur Janicki
- 🏛️ Institutions: Samsung R&D Poland, Warsaw University of Technology
- 📅 Date: October 9, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [model], [SeeClick], [AITW benchmark]
- 📖 TLDR: The paper introduces ClickAgent, a framework that enhances autonomous agents' interaction with mobile UIs by improving their ability to locate interface elements accurately. This is achieved through a dual-component system where an MLLM performs reasoning and action planning, while a dedicated UI location model (e.g., SeeClick) handles element identification. ClickAgent, evaluated on the AITW benchmark and tested on both emulators and real Android devices, surpasses other agents like CogAgent and AppAgent in task success rate, advancing automation reliability on mobile platforms.
Dynamic Planning for LLM-based Graphical User Interface Automation
- Shaoqing Zhang, Zhuosheng Zhang, Kehai Chen, Xinbei Ma, Muyun Yang, Tiejun Zhao, Min Zhang
- 🏛️ Institutions: SJTU
- 📅 Date: October 1, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dynamic planning]
- 📖 TLDR: This paper introduces a novel method called Dynamic Planning of Thoughts (D-PoT) aimed at enhancing LLM-based agents for GUI tasks. It addresses the challenges of task execution by dynamically adjusting planning based on environmental feedback and action history, outperforming existing methods such as ReAct by improving accuracy significantly in navigating GUI environments. The study emphasizes the importance of integrating execution history and contextual cues to optimize decision-making processes for autonomous agents.
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
- Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Shuo Shang
- 🏛️ Institutions: XiaoMi AI Lab, University of Electronic Science and Technology of China, Renmin University of China
- 📅 Date: September 23, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [model], [dataset], [MobileVLM], [Mobile3M], [UI understanding]
- 📖 TLDR: This paper introduces MobileVLM, a vision-language model designed to enhance both intra- and inter-UI understanding for mobile applications. The authors propose two additional pre-training stages with four specific UI-based tasks to improve the model's perception of fine-grained elements and capture page transition actions. To support this, they constructed Mobile3M, a large-scale Chinese mobile dataset comprising 3 million UI pages and real-world transition actions, organized into directed graphs. Experimental results demonstrate that MobileVLM outperforms existing vision-language models on both in-house test sets and public mobile benchmarks.
AppAgent v2: Advanced Agent for Flexible Mobile Interactions
- Yanda Li, Chi Zhang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, Yunchao Wei
- 🏛️ Institutions: University of Technology Sydney, Tencent, Beijing Jiaotong University, Westlake University
- 📅 Date: August 5, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [AppAgent v2]
- 📖 TLDR: This work presents AppAgent v2, a novel LLM-based multimodal agent framework for mobile devices capable of navigating applications by emulating human-like interactions such as tapping and swiping. The agent constructs a flexible action space that enhances adaptability across various applications, including parsing text and vision descriptions. It operates through two main phases: exploration and deployment, utilizing retrieval-augmented generation (RAG) technology to efficiently retrieve and update information from a knowledge base, thereby empowering the agent to perform tasks effectively and accurately.
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation
- Xinbei Ma, Zhuosheng Zhang, Hai Zhao
- 🏛️ Institutions: SJTU
- 📅 Date: August 2024
- 📑 Publisher: ACL 2024
- 💻 Env: [Mobile]
- 🔑 Key: [model], [framework], [benchmark]
- 📖 TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios.
AUITestAgent: Automatic Requirements Oriented GUI Function Testing
- Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, Yangfan Zhou
- 🏛️ Institutions: Fudan University, Meituan
- 📅 Date: July 12, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [GUI testing], [AUITestAgent]
- 📖 TLDR: This paper presents AUITestAgent, the first automatic, natural language-driven GUI testing tool for mobile apps. It automates the entire process of GUI interaction and function verification by extracting GUI interactions from test requirements via dynamically organized agents and employing a multi-dimensional data extraction strategy for verification.
MobileFlow: A Multimodal LLM For Mobile GUI Agent
- Songqin Nong, Jiali Zhu, Rui Wu, Jiongchao Jin, Shuo Shan, Xiutian Huang, Wenhao Xu
- 🏛️ Institutions: Ant Group
- 📅 Date: July 5, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [model], [framework], [MobileFlow]
- 📖 TLDR: This paper introduces MobileFlow, a multimodal large language model tailored for mobile GUI agents. With approximately 21 billion parameters and hybrid visual encoders, it supports variable image resolutions and multilingual GUIs, enhancing the model's ability to interpret image data and comprehend user instructions for GUI interaction tasks.
MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices
- Jiayi Zhang, Chuang Zhao, Yihan Zhao, Zhaoyang Yu, Ming He, Jianping Fan
- 🏛️ Institutions: HKUST, Ant Group
- 📅 Date: July 4, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [tool formulation], [multi-agent collaboration], [MobileExperts]
- 📖 TLDR: This paper introduces MobileExperts, a framework that enhances autonomous operations on mobile devices by dynamically assembling agent teams based on user requirements. Each agent independently explores and formulates tools to evolve into an expert, improving efficiency and reducing reasoning costs.
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
- Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng Li
- 🏛️ Institutions: CUHK, SJTU, Shanghai AI Lab, vivo AI Lab
- 📅 Date: July 3, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [benchmark], [AMEX]
- 📖 TLDR: This paper introduces the Android Multi-annotation EXpo (AMEX), a comprehensive dataset designed for training and evaluating mobile GUI-control agents. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, annotated at multiple levels, including GUI interactive element grounding, functionality descriptions, and complex natural language instructions. The dataset aims to advance research on AI agents capable of completing complex tasks by interacting directly with mobile device GUIs.
Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model
- Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Boyu Wu, Yawen Wang, Jun Hu, Qing Wang
- 🏛️ Institutions: Institute of Software, Chinese Academy of Sciences, Monash University, Beijing Institute of Technology, University of Chinese Academy of Sciences
- 📅 Date: July 3, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [VisionDroid]
- 📖 TLDR: The paper presents VisionDroid, a vision-driven automated GUI testing approach utilizing Multimodal Large Language Models (MLLM) to detect non-crash functional bugs in mobile applications. By extracting GUI text information and aligning it with screenshots, VisionDroid enables MLLM to understand GUI context, facilitating deeper and function-oriented exploration. The approach segments exploration history into logically cohesive parts, prompting MLLM for bug detection, demonstrating superior performance over existing methods.
E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion
- Ke Wang, Tianyu Xia, Zhangxuan Gu, Yi Zhao, Shuheng Shen, Changhua Meng, Weiqiang Wang, Ke Xu
- 🏛️ Institutions: Ant Group, Tsinghua University
- 📅 Date: June 20, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [benchmark], [E-ANT]
- 📖 TLDR: This paper introduces E-ANT, the first large-scale Chinese GUI navigation dataset comprising over 40,000 real human interaction traces across more than 5,000 tiny apps. The dataset includes high-quality screenshots with annotations, facilitating the evaluation and development of GUI navigation and decision-making capabilities in multimodal large language models (MLLMs). The authors also assess various MLLMs on E-ANT, providing insights into their performance and potential improvements.
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
- Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar
- 🏛️ Institutions: UC Berkeley, UIUC, Google DeepMind
- 📅 Date: June 14, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [reinforcement learning], [DigiRL]
- 📖 TLDR: The authors present DigiRL, an autonomous reinforcement learning approach for training device-control agents. By fine-tuning a pre-trained vision-language model in two stages—offline and offline-to-online RL—DigiRL achieves a significant improvement in success rates on the Android-in-the-Wild dataset, establishing a new state-of-the-art for digital agents in device control.
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
- Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, Ping Luo
- 🏛️ Institutions: OpenGVLab, Shanghai AI Laboratory, HKU, Nanjing University, Harbin Institute of Technology, Shenzhen, SJTU
- 📅 Date: June 13, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [model], [OdysseyAgent], [cross-app navigation]
- 📖 TLDR: This paper presents GUI Odyssey, a dataset comprising 7,735 episodes from six mobile devices, designed to train and evaluate cross-app navigation agents. It spans six types of cross-app tasks across 201 apps and 1,399 app combinations. Leveraging this dataset, the authors developed OdysseyAgent, a multimodal cross-app navigation agent fine-tuned from the Qwen-VL model, demonstrating superior accuracy over existing models in both in-domain and out-of-domain scenarios.
Practical, Automated Scenario-based Mobile App Testing
- Shengcheng Yu, Chunrong Fang, Mingzhe Du, Zimin Ding, Zhenyu Chen, Zhendong Su
- 🏛️ Institutions: Nanjing University, ETH Zurich
- 📅 Date: June 12, 2024
- 📑 Publisher: IEEE Transactions on Software Engineering
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [ScenTest], [event knowledge graph], [GUI image understanding]
- 📖 TLDR: This paper introduces ScenTest, a novel approach for scenario-based mobile app testing that integrates event knowledge graphs (EKGs) with GUI image understanding. By extracting entities and relationships from crowdsourced test reports, ScenTest constructs EKGs for specific scenarios, guiding automated testing processes. This method bridges the gap between testing execution and app business logic, achieving fully automated testing on target scenarios for the first time.
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
- Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, Shoufa Chen
- 🏛️ Institutions: CMU, University of Michigan, Northeastern University, HKU
- 📅 Date: June 12, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [benchmark], [MobileAgentBench]
- 📖 TLDR: This paper introduces MobileAgentBench, a benchmark designed to evaluate the performance of large language model-based mobile agents. It defines 100 tasks across 10 open-source apps, categorized by difficulty levels, and assesses existing agents like AppAgent and MobileAgent to facilitate systematic comparisons.
On the Effects of Data Scale on UI Control Agents
- Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, Oriana Riva
- 🏛️ Institutions: Google DeepMind, Google
- 📅 Date: June 6, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [AndroidControl], [fine-tuning], [scalability]
- 📖 TLDR: This study investigates how the performance of computer control agents scales with the amount of fine-tuning data. The authors introduce AndroidControl, a dataset comprising 15,283 demonstrations across 833 Android applications. Findings indicate that while in-domain performance improves with more data, out-of-domain performance, especially on high-level tasks, scales more slowly, suggesting that fine-tuning alone may be insufficient for robust out-of-domain performance.
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
- Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
- 🏛️ Institutions: Alibaba Group, Beijing University of Posts and Telecommunications
- 📅 Date: June 3, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [multi-agent], [planning], [decision-making], [reflection]
- 📖 TLDR: The paper presents Mobile-Agent-v2, a multi-agent architecture designed to assist with mobile device operations. It comprises three agents: a planning agent that generates task progress, a decision agent that navigates tasks using a memory unit, and a reflection agent that corrects erroneous operations. This collaborative approach addresses challenges in navigation and long-context input scenarios, achieving over a 30% improvement in task completion compared to single-agent architectures.
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
- Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, Oriana Riva
- 🏛️ Institutions: Google DeepMind, Google
- 📅 Date: May 23, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [benchmark], [Android-based agents], [task diversity], [reinforcement learning], [dynamic environment]
- 📖 TLDR: AndroidWorld introduces a dynamic Android environment for benchmarking autonomous agents across 116 tasks spanning 20 Android apps. These tasks vary through parameterized and natural language prompts, fostering a realistic testing ground for agents designed to operate in complex mobile environments. The benchmark supports millions of task variations, allowing agents to respond to the Android system's changing states and improving real-world applicability.
Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent
- Wei Chen, Zhiyuan Li
- 🏛️ Institutions: Stanford University
- 📅 Date: April 17, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [model], [functional token], [on-device AI], [Octopus v3]
- 📖 TLDR: This paper introduces Octopus v3, a compact multimodal AI agent with less than 1 billion parameters, designed for efficient on-device operation. It processes both English and Chinese inputs, integrating visual and textual data to perform tasks such as sending emails, messaging, and online shopping. The model employs a functional token approach to translate image-based data into actionable outcomes, demonstrating high accuracy and efficiency on edge devices, including Raspberry Pi.
LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation
- Li Zhang, Shihe Wang, Xianqing Jia, Zhihan Zheng, Yunhe Yan, Longxi Gao, Yuanchun Li, Mengwei Xu
- 🏛️ Institutions: BUPT, Tsinghua University
- 📅 Date: April 12, 2024
- 📑 Publisher: UIST 2024
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dataset], [benchmark], [UI automation], [mobile agent evaluation]
- 📖 TLDR: LlamaTouch is an evaluation testbed designed for mobile UI automation, enabling reliable task assessment across 495 annotated tasks. It provides a scalable solution to evaluate agents in real-world mobile settings, comparing agent actions to essential UI states for accurate task completion. LlamaTouch supports dynamic environments, advancing mobile agent reliability and scalability in task automation.
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
- Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan
- 🏛️ Institutions: Apple
- 📅 Date: April 8, 2024
- 📑 Publisher: ECCV 2024
- 💻 Env: [Mobile]
- 🔑 Key: [model], [framework], [dataset], [benchmark], [mobile UI understanding]
- 📖 TLDR: This paper presents Ferret-UI, a multimodal large language model (MLLM) designed to understand and interact with mobile user interfaces. The model incorporates advanced capabilities for referring, grounding, and reasoning about UI elements. By training on a variety of UI tasks, Ferret-UI achieves high performance in tasks such as icon recognition and text extraction. The authors introduce a unique architecture that allows for improved visual feature extraction from mobile screens, paving the way for applications in accessibility and user interaction.
Enhancing Mobile "How-to" Queries with Automated Search Results Verification and Reranking
- Lei Ding, Jeshwanth Bheemanpally, Yi Zhang
- 🏛️ Institutions: UCSC
- 📅 Date: April 2024
- 📑 Publisher: SIGIR 2024
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [benchmark], [reranking], [verification], [mobile task automation]
- 📖 TLDR: This paper presents a system that enhances mobile "how-to" queries by verifying and reranking search results through automated instruction extraction, on-device action execution, and reranking based on relevance. The method improves on traditional ranking by analyzing device-specific execution success. The approach comprises a three-stage pipeline: 1) extracting step-by-step instructions from top search results, 2) validating these instructions on mobile devices, and 3) reranking based on performance. The system leverages a pre-trained GPT model for initial processing, ensuring adaptability across diverse apps and systems.
Benchmarking Mobile Device Control Agents across Diverse Configurations
- Juyong Lee, Taywon Min, Minyong An, Dongyoon Hahm, Haeone Lee, Changyeon Kim, Kimin Lee
- 🏛️ Institutions: KAIST, Seoul National University, Yonsei University
- 📅 Date: April 2024
- 📑 Publisher: ICLR 2024
- 💻 Env: [Mobile]
- 🔑 Key: [benchmark], [dataset], [mobile device control], [agent performance]
- 📖 TLDR: This paper presents B-MoCA, a comprehensive benchmark for evaluating mobile device control agents using an Android-based testbed with 131 tasks and various device configurations. The benchmark assesses agents' abilities across tasks that include device-specific variations, navigation, and human-like dual-gesture interactions. B-MoCA highlights that current agents perform well on basic tasks but struggle with complex configurations, pointing to opportunities for future improvements in mobile automation capabilities.
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, Duyu Tang
- 🏛️ Institutions: Fudan University, Huawei
- 📅 Date: March 5, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dataset], [Android GUI], [Chain-of-Action-Thought], [autonomous GUI agents]
- 📖 TLDR: This paper introduces Chain-of-Action-Thought (CoAT), a novel paradigm to improve GUI agent task completion by enabling agents to interpret previous actions, current screen content, and action rationale for next steps. The authors present the Android-In-The-Zoo (AitZ) dataset, which includes 18,643 screen-action pairs with detailed annotations, supporting CoAT's development and evaluation. The study demonstrates that fine-tuning with the AitZ dataset improves performance of a baseline large language model in predicting correct action sequences in Android tasks.
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
- Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
- 🏛️ Institutions: Beijing Jiaotong University, Alibaba
- 📅 Date: January 29, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [benchmark]
- 📖 TLDR: This paper presents Mobile-Agent, an autonomous multi-modal agent designed for mobile device interaction. The system integrates visual perception, natural language processing, and action prediction to navigate and operate mobile applications. The authors introduce a new dataset and benchmark for evaluating mobile agents, demonstrating Mobile-Agent's superior performance in task completion and generalization across various apps compared to existing methods.
AppAgent: Multimodal Agents as Smartphone Users
- Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, Gang Yu
- 🏛️ Institutions: Tencent
- 📅 Date: December 21, 2023
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [smartphone interaction], [autonomous exploration], [self-improve]
- 📖 TLDR: This paper introduces AppAgent, a novel multimodal agent framework designed to operate smartphone applications. The agent uses a simplified action space to mimic human-like interactions such as tapping and swiping. AppAgent learns to navigate and use new apps through autonomous exploration or by observing human demonstrations, creating a knowledge base for executing complex tasks across different applications. The framework's effectiveness is demonstrated through extensive testing on 50 tasks across 10 diverse applications.
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
- An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, Lijuan Wang
- 🏛️ Institutions: UCSD, Microsoft, UCSB, UWM
- 📅 Date: November 13, 2023
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [benchmark], [zero-shot GUI navigation], [multimodal LLMs]
- 📖 TLDR: This paper explores the capabilities of GPT-4V in navigating smartphone GUIs without prior training. The authors introduce a novel framework for GUI navigation and a new benchmark, MobileNav, featuring 1,000 navigation tasks across 100 mobile apps. The study demonstrates GPT-4V's impressive zero-shot performance in understanding and interacting with mobile interfaces, outperforming previous methods and even approaching human-level performance on some tasks.
UI Layout Generation with LLMs Guided by UI Grammar
- Yuwen Lu, Ziang Tong, Qinyi Zhao, Chengzhi Zhang, Toby Jia-Jun Li
- 🏛️ Institutions: ICML 2023 Workshop on AI and HCI
- 📅 Date: October 24, 2023
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [UI grammar], [UI Layout Generation]
- 📖 TLDR: This position paper explores the use of Large Language Models (LLMs) for generating mobile user interface (UI) layouts. It introduces UI grammar, a novel approach to represent the hierarchical structure of UI screens, aiming to guide LLMs' generative capabilities more effectively and enhance the explainability and controllability of the process. Initial experiments with GPT-4 demonstrate the potential of LLMs to produce high-quality UIs through in-context learning, with the grammar-based approach improving certain aspects of generation quality.
AutoDroid: LLM-powered Task Automation in Android
- Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu
- 🏛️ Institutions: Tsinghua University, Shanghai AI Lab, University of Notre Dame, MSR
- 📅 Date: August 29, 2023
- 📑 Publisher: MobiCom 2024
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dataset], [benchmark], [Android task automation], [LLM-powered agent]
- 📖 TLDR: This paper introduces AutoDroid, a novel mobile task automation system capable of handling arbitrary tasks on any Android application without manual efforts. The framework combines the commonsense knowledge of LLMs with domain-specific knowledge of apps through automated dynamic analysis. AutoDroid features a functionality-aware UI representation method, exploration-based memory injection techniques, and a multi-granularity query optimization module. Evaluated on a new benchmark with 158 common tasks, AutoDroid achieves a 90.9% action generation accuracy and a 71.3% task completion rate, significantly outperforming GPT-4-powered baselines.
Android in the Wild: A Large-Scale Dataset for Android Device Control
- Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, Timothy Lillicrap
- 🏛️ Institutions: Google Research, Google DeepMind
- 📅 Date: July 19, 2023
- 📑 Publisher: NeurIPS 2023
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [benchmark], [device control], [natural language interaction], [gesture-based actions]
- 📖 TLDR: The Android in the Wild (AitW) dataset introduces a significant benchmark for Android device control, encompassing over 715,000 human-labeled episodes with natural language commands and corresponding UI actions. Collected from Android devices across versions 10-13, it captures complex multi-step tasks requiring both visual and contextual understanding. The dataset is structured to test the robustness of device-control systems under varying conditions, such as new tasks or applications, and includes data to evaluate gesture-based interactions, providing a unique foundation for mobile interface automation and task execution research.
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction
- Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, Kai Yu
- 🏛️ Institutions: SJTU, HKU
- 📅 Date: May 14, 2023
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [benchmark], [dataset], [interaction platform], [multistep interaction], [InfoUI]
- 📖 TLDR: This paper introduces Mobile-Env, a novel interaction platform and benchmark aimed at assessing large language models' (LLMs) capabilities in interactive environments. It builds on the InfoUI task set, derived from WikiHow, to create structured text-based challenges that simulate real-world mobile interactions. The platform is designed to support task expansions from the community, aiming to drive advancements in LLM-based interactive agents.
Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus
- Gang Li, Yang Li
- 🏛️ Institutions: Google Research
- 📅 Date: September 29, 2022
- 📑 Publisher: ICLR 2023
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [model], [dataset], [mobile UI tasks], [region-based focus]
- 📖 TLDR: This paper introduces "Spotlight," a vision-language model for mobile UI understanding that operates solely on visual inputs (screenshots) and a specified focus region on the screen. By leveraging a large-scale dataset and training strategies tailored to mobile interfaces, Spotlight performs multiple UI-related tasks, including widget captioning, screen summarization, command grounding, and tappability prediction. It utilizes a vision-only approach, avoiding reliance on view hierarchies to achieve greater robustness and scalability across different mobile UI environments.
META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI
- Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, Kai Yu
- 🏛️ Institutions: SJTU
- 📅 Date: May 23, 2022
- 📑 Publisher: EMNLP 2022
- 💻 Env: [Mobile]
- 🔑 Key: [benchmark], [dataset], [task-oriented dialogue], [GUI-based interaction], [multi-modal agent]
- 📖 TLDR: This paper presents META-GUI, a dataset and framework for training multi-modal conversational agents capable of interacting directly with mobile app interfaces without the need for backend APIs. META-GUI includes over 1,100 dialogues with annotated action sequences on various tasks such as booking and scheduling. The authors propose a GUI-based task-oriented dialogue system that allows agents to navigate mobile interfaces via direct GUI actions, with performance shown to improve in multi-modal task-oriented dialogue contexts.
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility
- Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, Bryan A. Plummer
- 🏛️ Institutions: Boston University, UIUC
- 📅 Date: February 4, 2022
- 📑 Publisher: ECCV 2022
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [feasibility prediction], [vision-language navigation], [mobile interaction]
- 📖 TLDR: This paper introduces the Mobile App Tasks with Iterative Feedback (MoTIF) dataset, which addresses vision-language navigation (VLN) with a focus on task feasibility uncertainty in mobile applications. MoTIF provides commands paired with mobile actions and feasibility annotations, allowing researchers to examine the impact of command feasibility on task completion. The dataset includes 125 apps and emphasizes diverse app environments, action sequences, and follow-up questions to improve task ambiguity resolution, making it a valuable resource for feasibility prediction research.
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
- Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, Yang Li
- 🏛️ Institutions: University of Toronto
- 📅 Date: August 6, 2021
- 📑 Publisher: UIST 2021
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dataset], [mobile UI summarization], [multimodal learning], [Transformer model]
- 📖 TLDR: The paper introduces Screen2Words, an approach that utilizes multimodal learning to generate descriptive language summaries for mobile UI screens, combining textual, visual, and structural data from screens. The study created a large-scale dataset with 112,085 annotated screen summaries for 22,417 unique UIs, aiming to support model training for mobile UI understanding. The dataset facilitates a Transformer-based model trained to summarize screens by highlighting main functionalities, and the approach is validated with benchmarks in the mobile environment.
UIBert: Learning Generic Multimodal Representations for UI Understanding
- Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, Blaise Agüera y Arcas
- 🏛️ Institutions: Google Research
- 📅 Date: July 29, 2021
- 📑 Publisher: IJCAI 2021
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [model], [dataset], [multimodal representation learning], [UI understanding]
- 📖 TLDR: This paper presents UIBert, a multimodal model aimed at understanding user interfaces (UIs) by combining visual, textual, and structural metadata. UIBert is designed for tasks such as component retrieval and expression resolution, using a transformer-based joint image-text model. The authors introduce five novel pre-training tasks to leverage UI-specific features, enhancing accessibility and task completion in mobile applications. UIBert demonstrates superior performance on nine downstream UI tasks, highlighting the potential of multimodal pre-training in UI understanding.
AndroidEnv: A Reinforcement Learning Platform for Android
- Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, Doina Precup
- 🏛️ Institutions: DeepMind
- 📅 Date: May 27, 2021
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [reinforcement learning], [Android interface], [RL environment], [task flexibility], [touchscreen action space]
- 📖 TLDR: AndroidEnv provides a reinforcement learning (RL) platform for Android that lets RL agents interact with a realistic Android simulation via touchscreen events. The platform supports diverse applications, enabling agents to interact with over 100 predefined tasks across a variety of apps. With hybrid continuous and discrete action spaces, AndroidEnv is well-suited for training agents in complex, real-world Android scenarios where actions must be contextually sequenced, such as in UI navigation, gaming, and productivity apps. This environment encourages further RL research by offering task flexibility and realistic Android emulation.
Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements
- Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, Zhiwei Guan
- 🏛️ Institutions: Google Research
- 📅 Date: November 2020
- 📑 Publisher: EMNLP 2020
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [benchmark], [model], [accessibility], [natural language generation], [WidgetCaption]
- 📖 TLDR: This paper introduces the task of widget captioning, which aims to automatically generate natural language descriptions for UI elements in mobile apps to enhance accessibility. Using both visual and structural data from UI components, the study presents a novel dataset of 162,859 captions across 61,285 UI elements. Multiple deep learning models were tested on this dataset, with findings suggesting the potential for improving screen reader usability for visually impaired users by generating descriptive captions of UI elements.
Interactive Task Learning from GUI-Grounded Natural Language Instructions and Demonstrations
- Toby Jia-Jun Li, Tom Mitchell, Brad Myers
- 🏛️ Institutions: CMU
- 📅 Date: July 2020
- 📑 Publisher: ACL 2020
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [Sugilite], [programming-by-demonstration]
- 📖 TLDR: This paper introduces Sugilite, an intelligent task automation agent that learns new tasks and associated concepts interactively from users' natural language instructions and demonstrations on third-party mobile app GUIs. The system allows users to teach procedures and concepts through verbal instructions combined with GUI demonstrations, supports intent clarification for demonstrated actions, infers task parameters using hierarchical app GUI structures, and generalizes taught concepts across different contexts and domains. A prototype is presented as a conversational assistant on Android. oai_citation_attribution:0‡ACL Anthology
Mapping Natural Language Instructions to Mobile UI Action Sequences
- Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, Jason Baldridge
- 🏛️ Institutions: Google Researc
- 📅 Date: July 2020
- 📑 Publisher: ACL 2020
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dataset], [mobile UI automation], [natural language instructions], [action grounding], [RicoSCA]
- 📖 TLDR: This paper introduces a method for grounding natural language instructions to mobile UI actions, aiming to automate mobile task execution through user interface manipulation. It introduces three key datasets: PixelHelp for task instruction-performance mappings on a Pixel emulator, AndroidHowTo for detailed phrase extraction, and RicoSCA for synthetic UI command training. The system utilizes a Transformer model to extract action phrase tuples, aligning them to UI elements with contextual screen positioning. Achieving over 70% accuracy in task completion, this approach is foundational for natural language-driven mobile UI automation.
PUMICE: A Multi-Modal Agent that Learns Concepts and Conditionals from Natural Language and Demonstrations
- Toby Jia-Jun Li, Marissa Radensky, Justin Jia, Kirielle Singarajah, Tom M. Mitchell, Brad A. Myers
- 🏛️ Institutions: CMU, Amherst College
- 📅 Date: August 30, 2019
- 📑 Publisher: UIST 2019
- 💻 Env: [Mobile]
- 🔑 Key: [programming-by-demonstration], [PUMICE]
- 📖 TLDR: This paper introduces PUMICE, a multi-modal agent that combines natural language programming and programming-by-demonstration to enable end users to instruct intelligent agents in performing new tasks. By allowing users to describe tasks and conditions naturally and then collaboratively resolving ambiguities through conversation and demonstration, PUMICE facilitates the teaching of new concepts and procedures within existing mobile app GUIs. A lab study with 10 users demonstrated its usability and effectiveness.
Rico: A Mobile App Dataset for Building Data-Driven Design Applications
- Genevieve Patterson, Joseph Gonzalez, Jeffrey Heer, Daniel H. Haim, Keyur Govani, Andrew Hertzmann, Noah Snavely, Neel Joshi
- 🏛️ Institutions: UIUC, Northwestern University, Google
- 📅 Date: October 20, 2017
- 📑 Publisher: UIST 2017
- 💻 Env: [Mobile]
- 🔑 Key: [dataset], [mobile UI], [UI design analysis], [interaction mining], [RICO]
- 📖 TLDR: This paper introduces Rico, a large-scale dataset comprising UI screens and view hierarchies from over 9,000 Android apps, designed to aid in understanding mobile app design. Rico supports a variety of tasks, including UI design analysis and interaction mining, by providing labeled UI components, screenshots, and interaction traces.
SUGILITE: Creating Multimodal Smartphone Automation by Demonstration
- Toby Jia-Jun Li, Amos Azaria, Brad A. Myers
- 🏛️ Institutions: CMU, Ariel University
- 📅 Date: May 6, 2017
- 📑 Publisher: CHI 2017
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [PBD], [multimodal interaction], [SUGILITE], [programming-by-demonstration], [demonstration]
- 📖 TLDR: This paper introduces SUGILITE, a programming-by-demonstration (PBD) system that enables users to automate tasks on smartphones through multimodal interactions. By leveraging Android's accessibility API, SUGILITE allows users to create generalized automation scripts for arbitrary third-party apps by demonstrating tasks using the regular app UI. The system combines verbal instructions, user demonstrations, and app UI hierarchies to generalize scripts from single demonstrations, facilitating task variations and parameterization. Extensive error handling and context checking enhance robustness against app UI changes. A lab study indicates that users with minimal programming knowledge can successfully automate smartphone tasks using SUGILITE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paper_mobile.md

paper_mobile.md

Files

paper_mobile.md

Latest commit

History

paper_mobile.md

File metadata and controls