-
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
- Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
- 🏛️ Institutions: NUS, Microsoft
- 📅 Date: November 26, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [model], [framework], [dataset], [UI-Guided Visual Token Selection], [Interleaved Vision-Language-Action Streaming], [ShowUI]
- 📖 TLDR: This paper introduces ShowUI, a vision-language-action model designed to enhance GUI automation by addressing challenges in UI visual perception and action modeling. It features innovations like UI-Guided Visual Token Selection to reduce computational costs and Interleaved Vision-Language-Action Streaming for effective management of visual-action history. Trained on a curated dataset, ShowUI achieves 75.1% accuracy in zero-shot screenshot grounding and demonstrates competitive performance across web, mobile, and online environments.
-
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use
- Siyuan Hu, Mingyu Ouyang, Difei Gao, Mike Zheng Shou
- 🏛️ Institutions: NUS
- 📅 Date: Nov 15, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Desktop]
- 🔑 Key: [framework], [Claude 3.5 Computer Use], [GUI automation], [planning], [action], [critic]
- 📖 TLDR: This study evaluates Claude 3.5 Computer Use, an AI model enabling end-to-end language-to-desktop actions, through curated tasks across various domains. It introduces an out-of-the-box framework for deploying API-based GUI automation models, analyzing the model's planning, action execution, and adaptability to dynamic environments.
-
GUI Action Narrator: Where and When Did That Action Take Place?
- Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou
- 🏛️ Institutions: NUS, Chinese Academy of Sciences
- 📅 Date: June 19, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Desktop]
- 🔑 Key: [dataset], [framework], [Act2Cap], [GUI Narrator]
- 📖 TLDR: The authors present Act2Cap, a GUI action dataset containing 4,189 video-caption pairs depicting various GUI actions such as clicks, drags, and typing across multiple software environments. They also propose GUI Narrator, a framework that leverages cursor detection as a visual prompt to enhance the interpretation of high-resolution screenshots for GUI video captioning. Evaluations reveal that even advanced multimodal models face challenges in this domain, highlighting the need for specialized approaches to improve performance.
-
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
- Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
- 🏛️ Institutions: NUS, Microsoft Gen AI
- 📅 Date: June 2024
- 📑 Publisher: NeurIPS 2024
- 💻 Env: [Desktop, Web]
- 🔑 Key: [benchmark], [instructional videos], [visual planning], [hierarchical task decomposition], [complex software interaction]
- 📖 TLDR: VideoGUI presents a benchmark for evaluating GUI automation on tasks derived from instructional videos, focusing on visually intensive applications like Adobe Photoshop and video editing software. The benchmark includes 178 tasks, with a hierarchical evaluation method distinguishing high-level planning, mid-level procedural steps, and precise action execution. VideoGUI reveals current model limitations in complex visual tasks, marking a significant step toward improved visual planning in GUI automation.
-
AssistGUI: Task-Oriented Desktop Graphical User Interface Automation
- Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou
- 🏛️ Institutions: NUS
- 📅 Date: December 20, 2023
- 📑 Publisher: CVPR 2024
- 💻 Env: [Desktop]
- 🔑 Key: [framework], [dataset], [benchmark], [desktop productivity tasks]
- 📖 TLDR: This study presents AssistGUI, a benchmark and framework for desktop GUI automation, featuring an LLM-based agent capable of completing complex user requests by analyzing instructional videos and performing actions on the desktop. Utilizing a novel Actor-Critic framework and GUI parser, AssistGUI was tested on 100 tasks across nine applications, such as MS Word and After Effects. Despite advances, the top-performing model achieved only a 46% success rate, illustrating the challenge of comprehensive desktop automation and underscoring areas for future research in agent-driven GUI tasks.