-
WebOlympus: An Open Platform for Web Agents on Live Websites
- Boyuan Zheng, Boyu Gou, Scott Salisbury, Zheng Du, Huan Sun, Yu Su
- 🏛️ Institutions: OSU
- 📅 Date: November 12, 2024
- 📑 Publisher: EMNLP 2024
- 💻 Env: [Web]
- 🔑 Key: [safety], [Chrome extension], [WebOlympus], [SeeAct], [Annotation Tool]
- 📖 TLDR: This paper introduces WebOlympus, an open platform designed to facilitate the research and deployment of web agents on live websites. It features a user-friendly Chrome extension interface, allowing users without programming expertise to operate web agents with minimal effort. The platform incorporates a safety monitor module to prevent harmful actions through human supervision or model-based control, supporting applications such as annotation interfaces for web agent trajectories and data crawling.
-
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents
- Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su
- 🏛️ Institutions: OSU, Orby AI
- 📅 Date: November 10, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [WebDreamer], [model-based planning], [world model]
- 📖 TLDR: This paper investigates whether Large Language Models (LLMs) can function as world models within web environments, enabling model-based planning for web agents. Introducing WebDreamer, a framework that leverages LLMs to simulate potential action sequences in web environments, the study demonstrates significant performance improvements over reactive baselines on benchmarks like VisualWebArena and Mind2Web-live. The findings suggest that LLMs possess the capability to model the dynamic nature of the internet, paving the way for advancements in automated web interaction and opening new research avenues in optimizing LLMs for complex, evolving environments.
-
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
- Boyu Gou, Ruochen Wang, Boyuan Zheng, Yucheng Xie, Cheng Chang, Yiheng Shu, Haotian Sun, Yu Su
- 🏛️ Institutions: OSU, Orby AI
- 📅 Date: October 7, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [framework], [visual grounding], [GUI agents], [cross-platform generalization], [UGround], [SeeAct-V], [synthetic data]
- 📖 TLDR: This paper introduces UGround, a universal visual grounding model for GUI agents that enables human-like navigation of digital interfaces. The authors advocate for GUI agents with human-like embodiment that perceive the environment entirely visually and take pixel-level actions. UGround is trained on a large-scale synthetic dataset of 10M GUI elements across 1.3M screenshots. Evaluated on six benchmarks spanning grounding, offline, and online agent tasks, UGround significantly outperforms existing visual grounding models by up to 20% absolute. Agents using UGround achieve comparable or better performance than state-of-the-art agents that rely on additional textual input, demonstrating the feasibility of vision-only GUI agents.
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded
- Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
- 🏛️ Institutions: OSU
- 📅 Date: January 1, 2024
- 📑 Publisher: ICML 2024
- 💻 Env: [Web]
- 🔑 Key: [framework], [dataset], [benchmark], [grounding], [SeeAct], [Multimodal-Mind2web]
- 📖 TLDR: This paper explores the capability of GPT-4V(ision), a multimodal model, as a web agent that can perform tasks across various websites by following natural language instructions. It introduces the SEEACT framework, enabling GPT-4V to navigate, interpret, and interact with elements on websites. Evaluated using the Mind2Web benchmark and an online test environment, the framework demonstrates high performance on complex web tasks by integrating grounding strategies like element attributes and image annotations to improve HTML element targeting. However, grounding remains challenging, presenting opportunities for further improvement.