Skip to content

Latest commit

 

History

History
82 lines (73 loc) · 9.29 KB

paper_Yu_Su.md

File metadata and controls

82 lines (73 loc) · 9.29 KB

Yu Su's Papers

  • WebOlympus: An Open Platform for Web Agents on Live Websites

    • Boyuan Zheng, Boyu Gou, Scott Salisbury, Zheng Du, Huan Sun, Yu Su
    • 🏛️ Institutions: OSU
    • 📅 Date: November 12, 2024
    • 📑 Publisher: EMNLP 2024
    • 💻 Env: [Web]
    • 🔑 Key: [safety], [Chrome extension], [WebOlympus], [SeeAct], [Annotation Tool]
    • 📖 TLDR: This paper introduces WebOlympus, an open platform designed to facilitate the research and deployment of web agents on live websites. It features a user-friendly Chrome extension interface, allowing users without programming expertise to operate web agents with minimal effort. The platform incorporates a safety monitor module to prevent harmful actions through human supervision or model-based control, supporting applications such as annotation interfaces for web agent trajectories and data crawling.
  • Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

    • Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su
    • 🏛️ Institutions: OSU, Orby AI
    • 📅 Date: November 10, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Web]
    • 🔑 Key: [framework], [WebDreamer], [model-based planning], [world model]
    • 📖 TLDR: This paper investigates whether Large Language Models (LLMs) can function as world models within web environments, enabling model-based planning for web agents. Introducing WebDreamer, a framework that leverages LLMs to simulate potential action sequences in web environments, the study demonstrates significant performance improvements over reactive baselines on benchmarks like VisualWebArena and Mind2Web-live. The findings suggest that LLMs possess the capability to model the dynamic nature of the internet, paving the way for advancements in automated web interaction and opening new research avenues in optimizing LLMs for complex, evolving environments.
  • Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

    • Boyu Gou, Ruochen Wang, Boyuan Zheng, Yucheng Xie, Cheng Chang, Yiheng Shu, Haotian Sun, Yu Su
    • 🏛️ Institutions: OSU, Orby AI
    • 📅 Date: October 7, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [GUI]
    • 🔑 Key: [framework], [visual grounding], [GUI agents], [cross-platform generalization], [UGround], [SeeAct-V], [synthetic data]
    • 📖 TLDR: This paper introduces UGround, a universal visual grounding model for GUI agents that enables human-like navigation of digital interfaces. The authors advocate for GUI agents with human-like embodiment that perceive the environment entirely visually and take pixel-level actions. UGround is trained on a large-scale synthetic dataset of 10M GUI elements across 1.3M screenshots. Evaluated on six benchmarks spanning grounding, offline, and online agent tasks, UGround significantly outperforms existing visual grounding models by up to 20% absolute. Agents using UGround achieve comparable or better performance than state-of-the-art agents that rely on additional textual input, demonstrating the feasibility of vision-only GUI agents.
  • VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

    • Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang
    • 🏛️ Institutions: Tsinghua University, MSRA, The Ohio State University
    • 📅 Date: August 12, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [GUI]
    • 🔑 Key: [benchmark], [dataset], [VisualAgentBench], [VAB]
    • 📖 TLDR: The authors introduce VisualAgentBench (VAB), a comprehensive benchmark designed to train and evaluate large multimodal models (LMMs) as visual foundation agents across diverse scenarios, including embodied tasks, graphical user interfaces, and visual design. VAB comprises five distinct environments that systematically challenge LMMs' understanding and interaction capabilities. Additionally, the benchmark offers supervised fine-tuning trajectory data for behavior cloning training, demonstrating the potential to improve open LMMs for serving as visual foundation agents.
  • A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents

    • Lingbo Mo, Zeyi Liao, Boyuan Zheng, Yu Su, Chaowei Xiao, Huan Sun
    • 🏛️ Institutions: OSU, UWM
    • 📅 Date: February 15, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Misc]
    • 🔑 Key: [safety], [adversarial attacks], [security risks], [language agents], [Perception-Brain-Action]
    • 📖 TLDR: This paper introduces a conceptual framework to assess and understand adversarial vulnerabilities in language agents, dividing the agent structure into three components—Perception, Brain, and Action. It discusses 12 specific adversarial attack types that exploit these components, ranging from input manipulation to complex backdoor and jailbreak attacks. The framework provides a basis for identifying and mitigating risks before the widespread deployment of these agents in real-world applications.
  • Dual-View Visual Contextualization for Web Navigation

    • Jihyung Kil, Chan Hee Song, Boyuan Zheng, Xiang Deng, Yu Su, Wei-Lun Chao
    • 🏛️ Institutions: OSU
    • 📅 Date: February 6, 2024
    • 📑 Publisher: CVPR 2024
    • 💻 Env: [Web]
    • 🔑 Key: [framework], [visual contextualization]
    • 📖 TLDR: This paper proposes a novel approach to web navigation by contextualizing HTML elements through their "dual views" in webpage screenshots. The method leverages both the textual content of HTML elements and their visual representation in the screenshot to create more informative representations for web agents. Evaluated on the Mind2Web dataset, the approach demonstrates consistent improvements over baseline methods across various scenarios, including cross-task, cross-website, and cross-domain navigation tasks.
  • AgentBench: Evaluating LLMs as Agents

    • Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang
    • 🏛️ Institutions: THU, OSU, ByteDance
    • 📅 Date: January 1, 2024
    • 📑 Publisher: ICLR 2024
    • 💻 Env: [GUI], [General]
    • 🔑 Key: [benchmark], [evaluation]
    • 📖 TLDR: AgentBench provides a comprehensive benchmark for evaluating LLMs as autonomous agents in various environments. It includes eight distinct scenarios, testing the LLMs' reasoning and decision-making capabilities in tasks such as OS interaction, database querying, knowledge graph traversal, and more. This benchmark compares the effectiveness of multiple commercial and open-source LLMs, revealing areas of improvement in instruction-following and long-term reasoning, essential for practical agent development.
  • GPT-4V(ision) is a Generalist Web Agent, if Grounded

    • Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
    • 🏛️ Institutions: OSU
    • 📅 Date: January 1, 2024
    • 📑 Publisher: ICML 2024
    • 💻 Env: [Web]
    • 🔑 Key: [framework], [dataset], [benchmark], [grounding], [SeeAct], [Multimodal-Mind2web]
    • 📖 TLDR: This paper explores the capability of GPT-4V(ision), a multimodal model, as a web agent that can perform tasks across various websites by following natural language instructions. It introduces the SEEACT framework, enabling GPT-4V to navigate, interpret, and interact with elements on websites. Evaluated using the Mind2Web benchmark and an online test environment, the framework demonstrates high performance on complex web tasks by integrating grounding strategies like element attributes and image annotations to improve HTML element targeting. However, grounding remains challenging, presenting opportunities for further improvement.
  • Mind2Web: Towards a Generalist Agent for the Web

    • Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, Yu Su
    • 🏛️ Institutions: OSU
    • 📅 Date: June 9, 2023
    • 📑 Publisher: NeurIPS 2023
    • 💻 Env: [Web]
    • 🔑 Key: [dataset], [benchmark], [model], [Mind2Web], [MindAct]
    • 📖 TLDR: Mind2Web presents a dataset and benchmark specifically crafted for generalist web agents capable of performing language-guided tasks across varied websites. Featuring over 2,000 tasks from 137 sites, it spans 31 domains and emphasizes open-ended, realistic tasks in authentic, unsimplified web settings. The study proposes the MindAct framework, which optimizes LLMs for handling complex HTML elements by using small LMs to rank elements before full processing, thereby enhancing the efficiency and versatility of web agents in diverse contexts.