diff --git a/add_paper_here.md b/add_paper_here.md index b5ad356..b39c64c 100644 --- a/add_paper_here.md +++ b/add_paper_here.md @@ -304,13 +304,13 @@ - πŸ”‘ Key: [framework], [multimodal agent], [smartphone interaction], [autonomous exploration] - πŸ“– TLDR: This paper introduces AppAgent, a novel multimodal agent framework designed to operate smartphone applications. The agent uses a simplified action space to mimic human-like interactions such as tapping and swiping. AppAgent learns to navigate and use new apps through autonomous exploration or by observing human demonstrations, creating a knowledge base for executing complex tasks across different applications. The framework's effectiveness is demonstrated through extensive testing on 50 tasks across 10 diverse applications. -- [GPT-4V(ision) is a Generalist Web Agent, if Grounded](https://arxiv.org/abs/2401.01614) +- [GPT-4V(ision) is a Generalist Web Agent, if Grounded](https://osu-nlp-group.github.io/SeeAct/) - Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su - πŸ›οΈ Institutions: OSU - πŸ“… Date: January 1, 2024 - πŸ“‘ Publisher: ICML 2024 - πŸ’» Env: [Web] - - πŸ”‘ Key: [framework], [dataset], [benchmark], [generalist web agent], [grounding] + - πŸ”‘ Key: [framework], [dataset], [benchmark], [generalist web agent], [grounding],[seeact] - πŸ“– TLDR: This paper explores the capability of GPT-4V(ision), a multimodal model, as a web agent that can perform tasks across various websites by following natural language instructions. It introduces the **SEEACT** framework, enabling GPT-4V to navigate, interpret, and interact with elements on websites. Evaluated using the **Mind2Web** benchmark and an online test environment, the framework demonstrates high performance on complex web tasks by integrating grounding strategies like element attributes and image annotations to improve HTML element targeting. However, grounding remains challenging, presenting opportunities for further improvement. - [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935) @@ -540,11 +540,11 @@ - [VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?](https://arxiv.org/abs/2404.05955) - Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue - - πŸ›οΈ Institutions: Unknown + - πŸ›οΈ Institutions: CMU - πŸ“… Date: April 9, 2024 - πŸ“‘ Publisher: COLM 2024 - πŸ’» Env: [Web] - - πŸ”‘ Key: [benchmark], [dataset], [web page understanding], [grounding], [multimodal LLMs] + - πŸ”‘ Key: [benchmark], [dataset], [web page understanding], [grounding] - πŸ“– TLDR: VisualWebBench introduces a comprehensive benchmark for evaluating multimodal large language models (MLLMs) on web-based tasks. It includes 1.5K human-curated instances across 139 websites in 87 sub-domains. The benchmark spans seven tasksβ€”such as OCR, grounding, and web-based QAβ€”aiming to test MLLMs' capabilities in fine-grained web page understanding. Results reveal significant performance gaps, particularly in grounding tasks, highlighting the need for advancement in MLLM web understanding. - [OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments](https://arxiv.org/abs/2404.07972) @@ -961,13 +961,13 @@ - πŸ”‘ Key: [framework], [dynamic planning] - πŸ“– TLDR: This paper introduces a novel method called Dynamic Planning of Thoughts (D-PoT) aimed at enhancing LLM-based agents for GUI tasks. It addresses the challenges of task execution by dynamically adjusting planning based on environmental feedback and action history, outperforming existing methods such as ReAct by improving accuracy significantly in navigating GUI environments. The study emphasizes the importance of integrating execution history and contextual cues to optimize decision-making processes for autonomous agents. -- [Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents](https://arxiv.org/abs/2410.05243) +- [Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents](https://osu-nlp-group.github.io/UGround/) - Boyu Gou, Ruochen Wang, Boyuan Zheng, Yucheng Xie, Cheng Chang, Yiheng Shu, Haotian Sun, Yu Su - πŸ›οΈ Institutions: OSU, Orby AI - πŸ“… Date: October 7, 2024 - πŸ“‘ Publisher: arXiv - πŸ’» Env: [GUI] - - πŸ”‘ Key: [framework], [visual grounding], [GUI agents], [cross-platform generalization], [UGround], [SeeAct-V] + - πŸ”‘ Key: [framework], [visual grounding], [GUI agents], [cross-platform generalization], [UGround], [SeeAct-V], [synthetic data] - πŸ“– TLDR: This paper introduces UGround, a universal visual grounding model for GUI agents that enables human-like navigation of digital interfaces. The authors advocate for GUI agents with human-like embodiment that perceive the environment entirely visually and take pixel-level actions. UGround is trained on a large-scale synthetic dataset of 10M GUI elements across 1.3M screenshots. Evaluated on six benchmarks spanning grounding, offline, and online agent tasks, UGround significantly outperforms existing visual grounding models by up to 20% absolute. Agents using UGround achieve comparable or better performance than state-of-the-art agents that rely on additional textual input, demonstrating the feasibility of vision-only GUI agents. - [Agent S: An Open Agentic Framework that Uses Computers Like a Human](https://arxiv.org/abs/2410.08164) @@ -1005,3 +1005,23 @@ - πŸ’» Env: [GUI] - πŸ”‘ Key: [framework], [multi-agent systems], [specialized generalist agent], [OSWorld benchmark] - πŸ“– TLDR: AgentStore introduces a scalable platform to integrate and manage heterogeneous agents, designed to enhance generalist assistant capabilities for diverse computer tasks. Using a MetaAgent and AgentToken strategy, AgentStore shows improved generalization on the OSWorld benchmark. + + +- [OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization](https://doi.org/10.48550/arXiv.2410.19609) + - Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Hongming Zhang, Tianqing Fang, Zhenzhong Lan, Dong Yu + - πŸ›οΈ Institutions: Zhejiang University, Tencent AI Lab, Westlake University + - πŸ“… Date: October 25, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [Web] + - πŸ”‘ Key: [framework], [learning], [imitation learning], [exploration], [AI feedback] + - πŸ“– TLDR: The paper presents **OpenWebVoyager**, an open-source framework for training web agents that explore real-world online environments autonomously. The framework employs a cycle of exploration, feedback, and optimization, enhancing agent capabilities through multimodal perception and iterative learning. Initial skills are acquired through imitation learning, followed by real-world exploration, where the agent’s performance is evaluated and refined through feedback loops. + +- [EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data](https://doi.org/10.48550/arXiv.2410.19461) + - Xuetian Chen, Hangcheng Li, Jiaqing Liang, Sihang Jiang, Deqing Yang + - πŸ›οΈ Institutions: Fudan University + - πŸ“… Date: October 25, 2024 + - πŸ“‘ Publisher: arXiv + - πŸ’» Env: [GUI] + - πŸ”‘ Key: [dataset], [framework], [synthetic data] + - πŸ“– TLDR: The *EDGE* framework proposes an innovative approach to improve GUI understanding and interaction capabilities in vision-language models through large-scale, multi-granularity synthetic data generation. By leveraging webpage data, EDGE minimizes the need for manual annotations and enhances the adaptability of models across desktop and mobile GUI environments. Evaluations show its effectiveness in diverse GUI-related tasks, contributing significantly to autonomous agent development in GUI navigation and interaction. +