-
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining
- Zhiqi Ge, Juncheng Li, Xinglei Pang, Minghe Gao, Kaihang Pan, Wang Lin, Hao Fei, Wenqiao Zhang, Siliang Tang, Yueting Zhuang
- 🏛️ Institutions: Zhejiang University, NUS
- 📅 Date: December 13, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [framework], [Information-Sensitive Cropping], [Self-Refining Dual Learning], [visual grounding], [model]
- 📖 TLDR: This paper introduces Iris, a visual agent designed to enhance GUI automation by addressing challenges in high-resolution, complex digital environments. It employs two key innovations: Information-Sensitive Cropping (ISC), which dynamically identifies and prioritizes visually dense regions using an edge detection algorithm for efficient processing, and Self-Refining Dual Learning (SRDL), which enhances the agent's ability to handle complex tasks through a dual-learning loop that iteratively refines its performance without requiring additional annotated data. Empirical evaluations demonstrate that Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations, outperforming methods using ten times more training data.
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
- Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong
- 🏛️ Institutions: HKU, NTU, Salesforce
- 📅 Date: Dec 5, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [model], [dataset], [planning], [reasoning], [Aguvis], [visual grounding]
- 📖 TLDR: This paper introduces Aguvis, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. It leverages image-based observations and grounds natural language instructions to visual elements, employing a consistent action space to ensure cross-platform generalization. The approach integrates explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. A large-scale dataset of GUI agent trajectories is constructed, incorporating multimodal reasoning and grounding. Comprehensive experiments demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. All datasets, models, and training recipes are open-sourced to facilitate future research.
-
Improved GUI Grounding via Iterative Narrowing
- Anthony Nguyen
- 🏛️ Institutions: Algoma University
- 📅 Date: November 18, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [framework], [grounding], [visual grounding], [iterative narrowing]
- 📖 TLDR: This paper introduces a visual framework to enhance GUI grounding. By iteratively refining model predictions through progressively focused image crops, the proposed method improves the performance of both general and fine-tuned Vision-Language Models (VLMs) in GUI grounding tasks.
-
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
- Boyu Gou, Ruochen Wang, Boyuan Zheng, Yucheng Xie, Cheng Chang, Yiheng Shu, Haotian Sun, Yu Su
- 🏛️ Institutions: OSU, Orby AI
- 📅 Date: October 7, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [framework], [visual grounding], [GUI agents], [cross-platform generalization], [UGround], [SeeAct-V], [synthetic data]
- 📖 TLDR: This paper introduces UGround, a universal visual grounding model for GUI agents that enables human-like navigation of digital interfaces. The authors advocate for GUI agents with human-like embodiment that perceive the environment entirely visually and take pixel-level actions. UGround is trained on a large-scale synthetic dataset of 10M GUI elements across 1.3M screenshots. Evaluated on six benchmarks spanning grounding, offline, and online agent tasks, UGround significantly outperforms existing visual grounding models by up to 20% absolute. Agents using UGround achieve comparable or better performance than state-of-the-art agents that rely on additional textual input, demonstrating the feasibility of vision-only GUI agents.
-
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
- Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang
- 🏛️ Institutions: Apple
- 📅 Date: September 30, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Misc]
- 🔑 Key: [model], [MM1.5], [vision language model], [visual grounding], [reasoning], [data-centric], [analysis]
- 📖 TLDR: This paper introduces MM1.5, a family of multimodal large language models (MLLMs) ranging from 1B to 30B parameters, including dense and mixture-of-experts variants. MM1.5 enhances capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. The authors employ a data-centric training approach, utilizing high-quality OCR data and synthetic captions for continual pre-training, alongside an optimized visual instruction-tuning data mixture for supervised fine-tuning. Specialized variants, MM1.5-Video and MM1.5-UI, are designed for video understanding and mobile UI comprehension, respectively. Extensive empirical studies provide insights into the training processes, offering guidance for future MLLM development.
-
From Grounding to Planning: Benchmarking Bottlenecks in Web Agents
- Segev Shlomov, Ben Wiesel, Aviad Sela, Ido Levy, Liane Galanti, Roy Abitbol
- 🏛️ Institutions: IBM
- 📅 Date: September 3, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [planning], [grounding], [Mind2Web dataset], [web navigation]
- 📖 TLDR: This paper analyzes performance bottlenecks in web agents by separately evaluating grounding and planning tasks, isolating their individual impacts on navigation efficacy. Using an enhanced version of the Mind2Web dataset, the study reveals planning as a significant bottleneck, with advancements in grounding and task-specific benchmarking for elements like UI component recognition. Through experimental adjustments, the authors propose a refined evaluation framework, aiming to enhance web agents' contextual adaptability and accuracy in complex web environments.
-
Visual Grounding for User Interfaces
- Yijun Qian, Yujie Lu, Alexander Hauptmann, Oriana Riva
- 🏛️ Institutions: CMU, UCSB
- 📅 Date: June 2024
- 📑 Publisher: NAACL 2024
- 💻 Env: [GUI]
- 🔑 Key: [framework], [visual grounding], [UI element localization], [LVG]
- 📖 TLDR: This work introduces the task of visual UI grounding, which unifies detection and grounding by enabling models to identify UI elements referenced by natural language commands solely from visual input. The authors propose LVG, a model that outperforms baselines pre-trained on larger datasets by over 4.9 points in top-1 accuracy, demonstrating its effectiveness in localizing referenced UI elements without relying on UI metadata.
-
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
- Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue
- 🏛️ Institutions: CMU
- 📅 Date: April 9, 2024
- 📑 Publisher: COLM 2024
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [dataset], [web page understanding], [grounding]
- 📖 TLDR: VisualWebBench introduces a comprehensive benchmark for evaluating multimodal large language models (MLLMs) on web-based tasks. It includes 1.5K human-curated instances across 139 websites in 87 sub-domains. The benchmark spans seven tasks—such as OCR, grounding, and web-based QA—aiming to test MLLMs' capabilities in fine-grained web page understanding. Results reveal significant performance gaps, particularly in grounding tasks, highlighting the need for advancement in MLLM web understanding.
-
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, Zhiyong Wu
- 🏛️ Institutions: Nanjing University, Shanghai AI Lab
- 📅 Date: January 19, 2024
- 📑 Publisher: ACL 2024
- 💻 Env: [GUI]
- 🔑 Key: [model], [benchmark], [GUI grounding], [visual grounding]
- 📖 TLDR: TBD.
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded
- Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
- 🏛️ Institutions: OSU
- 📅 Date: January 1, 2024
- 📑 Publisher: ICML 2024
- 💻 Env: [Web]
- 🔑 Key: [framework], [dataset], [benchmark], [grounding], [SeeAct], [Multimodal-Mind2web]
- 📖 TLDR: This paper explores the capability of GPT-4V(ision), a multimodal model, as a web agent that can perform tasks across various websites by following natural language instructions. It introduces the SEEACT framework, enabling GPT-4V to navigate, interpret, and interact with elements on websites. Evaluated using the Mind2Web benchmark and an online test environment, the framework demonstrates high performance on complex web tasks by integrating grounding strategies like element attributes and image annotations to improve HTML element targeting. However, grounding remains challenging, presenting opportunities for further improvement.
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
- Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao
- 🏛️ Institutions: MSR
- 📅 Date: October 17, 2023
- 📑 Publisher: arXiv
- 💻 Env: [Misc]
- 🔑 Key: [visual prompting], [framework], [benchmark], [visual grounding], [zero-shot]
- 📖 TLDR: This paper introduces Set-of-Mark (SoM), a novel visual prompting approach designed to enhance the visual grounding capabilities of multimodal models like GPT-4V. By overlaying images with spatially and semantically distinct marks, SoM enables fine-grained object recognition and interaction within visual data, surpassing conventional zero-shot segmentation methods in accuracy. The framework is validated on tasks requiring detailed spatial reasoning, demonstrating a significant improvement over existing visual-language models without fine-tuning.
-
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API
- Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, Yan Lu
- 🏛️ Institutions: MSRA
- 📅 Date: October 7, 2023
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [model], [framework], [reinforcement learning], [UI task automation], [instruction grounding]
- 📖 TLDR: This paper introduces a multimodal model, termed RUIG (Reinforced UI Instruction Grounding), for automating UI tasks through natural language instructions. By leveraging a pixel-to-sequence approach, the model directly decodes UI element locations from screenshots based on user commands, removing the need for metadata like element coordinates. The framework uses a transformer-based encoder-decoder setup optimized through reinforcement learning to improve spatial accuracy. This novel approach outperforms prior methods, offering a generalized solution for UI task automation.
-
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
- Shunyu Yao, Howard Chen, John Yang, Karthik Narasimhan
- 🏛️ Institutions: Princeton University
- 📅 Date: July 2022
- 📑 Publisher: NeurIPS 2022
- 💻 Env: [Web]
- 🔑 Key: [framework], [dataset], [benchmark], [e-commerce web interaction], [language grounding]
- 📖 TLDR: This paper introduces WebShop, a simulated web-based shopping environment with over 1 million real-world products and 12,087 annotated instructions. It allows language agents to navigate, search, and make purchases based on natural language commands. The study explores how agents handle compositional instructions and noisy web data, providing a robust environment for reinforcement learning and imitation learning. The best models show effective sim-to-real transfer on websites like Amazon, illustrating WebShop’s potential for training grounded agents.
-
Grounding Open-Domain Instructions to Automate Web Support Tasks
- Nancy Xu, Sam Masling, Michael Du, Giovanni Campagna, Larry Heck, James Landay, Monica Lam
- 🏛️ Institutions: Stanford
- 📅 Date: March 30, 2021
- 📑 Publisher: NAACL 2021
- 💻 Env: [Web]
- 🔑 Key: [benchmark], [framework], [grounding], [task automation], [open-domain instructions], [RUSS]
- 📖 TLDR: This paper introduces RUSS (Rapid Universal Support Service), a framework designed to interpret and execute open-domain, step-by-step web instructions automatically. RUSS uses a BERT-LSTM model for semantic parsing into a custom language, ThingTalk, which allows the system to map language to actions across various web elements. The framework, including a dataset of instructions, facilitates agent-based web support task automation by grounding natural language to interactive commands.
-
Mapping Natural Language Instructions to Mobile UI Action Sequences
- Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, Jason Baldridge
- 🏛️ Institutions: Google Researc
- 📅 Date: July 2020
- 📑 Publisher: ACL 2020
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dataset], [mobile UI automation], [natural language instructions], [action grounding], [RicoSCA]
- 📖 TLDR: This paper introduces a method for grounding natural language instructions to mobile UI actions, aiming to automate mobile task execution through user interface manipulation. It introduces three key datasets: PixelHelp for task instruction-performance mappings on a Pixel emulator, AndroidHowTo for detailed phrase extraction, and RicoSCA for synthetic UI command training. The system utilizes a Transformer model to extract action phrase tuples, aligning them to UI elements with contextual screen positioning. Achieving over 70% accuracy in task completion, this approach is foundational for natural language-driven mobile UI automation.