Skip to content

Latest commit

 

History

History
152 lines (136 loc) · 17.3 KB

paper_desktop.md

File metadata and controls

152 lines (136 loc) · 17.3 KB
  • The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use

    • Siyuan Hu, Mingyu Ouyang, Difei Gao, Mike Zheng Shou
    • 🏛️ Institutions: NUS
    • 📅 Date: Nov 15, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Desktop]
    • 🔑 Key: [framework], [Claude 3.5 Computer Use], [GUI automation], [planning], [action], [critic]
    • 📖 TLDR: This study evaluates Claude 3.5 Computer Use, an AI model enabling end-to-end language-to-desktop actions, through curated tasks across various domains. It introduces an out-of-the-box framework for deploying API-based GUI automation models, analyzing the model's planning, action execution, and adaptability to dynamic environments.
  • Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

    • Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui
    • 🏛️ Institutions: Microsoft
    • 📅 Date: September 13, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Desktop]
    • 🔑 Key: [framework], [benchmark], [Navi]
    • 📖 TLDR: This paper introduces the Windows Agent Arena (WAA), a scalable platform for testing and benchmarking multi-modal AI agents within a realistic Windows OS environment. WAA enables researchers to evaluate agentic workflows across diverse tasks and supports large-scale deployment using Azure ML. The study also presents Navi, a multi-modal agent achieving a 19.5% success rate on Windows tasks, highlighting the platform's potential for advancing AI agent development.
  • TinyAgent: Function Calling at the Edge

    • Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami
    • 🏛️ Institutions: UC Berkeley, ICSI
    • 📅 Date: September 1, 2024
    • 📑 Publisher: EMNLP 2024
    • 💻 Env: [Desktop]
    • 🔑 Key: [framework], [dataset], [quantization], [LLMCompiler], [TinyAgent-1.1B], [TinyAgent-7B]
    • 📖 TLDR: This paper introduces TinyAgent, an end-to-end framework for training and deploying task-specific small language model agents capable of function calling at the edge. By fine-tuning small models with curated datasets and employing techniques like quantization and a novel tool retrieval method, TinyAgent enables efficient, real-time execution of user commands on local devices without relying on cloud infrastructure. The framework demonstrates that these small models can match or even surpass the function-calling capabilities of larger models like GPT-4-Turbo while operating entirely on edge devices.
  • OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

    • Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, Jingbo Shang
    • 🏛️ Institutions: UCSD, UCLA, AI2
    • 📅 Date: July 26, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Desktop]
    • 🔑 Key: [benchmark], [multi-application], [office automation]
    • 📖 TLDR: OfficeBench introduces a benchmark that evaluates language models' ability to automate office tasks across a range of applications like Word, Excel, and email. The benchmark tests agents’ skills in task-switching, planning, and decision-making by simulating realistic office workflows. Current models, including GPT-4, demonstrate significant gaps in task accuracy and efficiency, revealing areas for improvement in managing complex, multi-application tasks in office environments.
  • Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

    • Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongsheng Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu
    • 🏛️ Institutions: HKU, SJTU, Google Cloud AI Research, Google DeepMind, Salesforce Research, Yale University, Sea AI Lab, University of Waterloo
    • 📅 Date: July 15, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Desktop]
    • 🔑 Key: [benchmark], [dataset], [data science], [engineering workflows], [Spider2-V]
    • 📖 TLDR: This paper introduces Spider2-V, a multimodal agent benchmark designed to evaluate the capability of agents in automating professional data science and engineering workflows. It comprises 494 real-world tasks across 20 enterprise-level applications, assessing agents' proficiency in code generation and GUI operations within authentic computer environments.
  • GUI Action Narrator: Where and When Did That Action Take Place?

    • Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou
    • 🏛️ Institutions: NUS, Chinese Academy of Sciences
    • 📅 Date: June 19, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Desktop]
    • 🔑 Key: [dataset], [framework], [Act2Cap], [GUI Narrator]
    • 📖 TLDR: The authors present Act2Cap, a GUI action dataset containing 4,189 video-caption pairs depicting various GUI actions such as clicks, drags, and typing across multiple software environments. They also propose GUI Narrator, a framework that leverages cursor detection as a visual prompt to enhance the interpretation of high-resolution screenshots for GUI video captioning. Evaluations reveal that even advanced multimodal models face challenges in this domain, highlighting the need for specialized approaches to improve performance.
  • VideoGUI: A Benchmark for GUI Automation from Instructional Videos

    • Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
    • 🏛️ Institutions: NUS, Microsoft Gen AI
    • 📅 Date: June 2024
    • 📑 Publisher: NeurIPS 2024
    • 💻 Env: [Desktop, Web]
    • 🔑 Key: [benchmark], [instructional videos], [visual planning], [hierarchical task decomposition], [complex software interaction]
    • 📖 TLDR: VideoGUI presents a benchmark for evaluating GUI automation on tasks derived from instructional videos, focusing on visually intensive applications like Adobe Photoshop and video editing software. The benchmark includes 178 tasks, with a hierarchical evaluation method distinguishing high-level planning, mid-level procedural steps, and precise action execution. VideoGUI reveals current model limitations in complex visual tasks, marking a significant step toward improved visual planning in GUI automation.
  • AgentStudio: A Toolkit for Building General Virtual Agents

    • Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, Shuicheng Yan
    • 🏛️ Institutions: NTU, Skywork AI, ETH Zurich
    • 📅 Date: March 26, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Desktop]
    • 🔑 Key: [framework], [dataset], [general virtual agents], [open-ended learning], [tool creation], [GroundUI], [benchmark]
    • 📖 TLDR: AgentStudio is a robust toolkit for developing virtual agents with versatile actions, such as GUI automation and code execution. It unifies real-world human-computer interactions across OS platforms and includes diverse observation and action spaces, facilitating comprehensive training and benchmarking in complex settings. The toolkit's flexibility promotes agent generalization across varied tasks, supporting tool creation and a multimodal interaction interface to advance agent adaptability and learning.
  • Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study

    • Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, Börje F. Karlsson, Bo An, Zongqing Lu
    • 🏛️ Institutions: NTU, BAAI, PKU
    • 📅 Date: March 5, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Desktop]
    • 🔑 Key: [framework], [Cradle], [General Computer Control], [multimodal], [keyboard and mouse control], [long-term memory], [reasoning], [self-improvement]
    • 📖 TLDR: This paper introduces Cradle, a framework designed to achieve General Computer Control (GCC) by enabling agents to perform any computer task using only screen images (and possibly audio) as input and producing keyboard and mouse operations as output. The authors deploy Cradle in the complex AAA game Red Dead Redemption II, demonstrating its capability to follow the main storyline and complete real missions with minimal reliance on prior knowledge or resources.
  • Cradle: Empowering Foundation Agents Towards General Computer Control

    • Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian, Chaojie Wang, Xinrun Wang, Börje F. Karlsson, Bo An, Shuicheng Yan, Zongqing Lu
    • 🏛️ Institutions: Skywork AI, BAAI, NTU, PKU, Institute of Software - Chinese Academy of Sciences, HKU, CUHK
    • 📅 Date: March 5, 2024
    • 📑 Publisher: TBD
    • 💻 Env: [Desktop]
    • 🔑 Key: [framework], [model], [general computer control], [skill curation], [self-improvement]
    • 📖 TLDR: This paper introduces the Cradle framework, designed to enable general computer control (GCC) through multimodal input (e.g., screen images and optional audio) and outputs (keyboard and mouse). Cradle’s six core modules, including self-reflection, skill curation, and memory, allow for generalized task handling in complex environments like AAA games. Demonstrated in Red Dead Redemption II, the framework exhibits adaptability by performing real missions and following the storyline with minimal prior knowledge, showcasing its potential as a generalist agent for diverse computer tasks.
  • UFO: A UI-Focused Agent for Windows OS Interaction

    • Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
    • 🏛️ Institutions: Microsoft
    • 📅 Date: February 14, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Desktop]
    • 🔑 Key: [framework], [UI automation], [Windows], [UFO]
    • 📖 TLDR: This paper presents UFO, a pioneering multimodal LLM-based agent designed to fulfill user requests on Windows OS. UFO employs a dual-agent architecture—comprising AppAgent and ActAgent—that can interpret and execute complex tasks across multiple Windows applications by observing UI elements and utilizing control interactions. The framework allows UFO to handle intricate, cross-application workflows and execute commands seamlessly based on natural language prompts. It integrates GPT-Vision to recognize and interact with graphical elements, enabling flexible, autonomous task completion within and across diverse Windows applications.
  • OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

    • Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, Lingpeng Kong
    • 🏛️ Institutions: Shanghai AI Lab, East China Normal University, Princeton University, University of Hong Kong
    • 📅 Date: February 12, 2024
    • 📑 Publisher: ICLR 2024 Workshop LLMAgents
    • 💻 Env: [Desktop]
    • 🔑 Key: [framework], [self-directed learning], [GAIA], [FRIDAY], [OS-Copilot]
    • 📖 TLDR: The OS-Copilot framework supports building generalist agents capable of performing diverse tasks across an operating system (OS). This work introduces FRIDAY, an embodied agent using OS-Copilot to self-improve by learning from task outcomes. It operates with a memory-based architecture to tackle OS-level tasks across applications like terminals, web browsers, and third-party tools. Tested on the GAIA benchmark, FRIDAY achieved 35% higher performance than prior methods, proving effective in adapting to unfamiliar applications and refining its capabilities with minimal guidance.
  • OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

    • Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov
    • 🏛️ Institutions: CMU
    • 📅 Date: February 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Desktop]
    • 🔑 Key: [dataset], [benchmark]
    • 📖 TLDR: OmniACT introduces a dataset and benchmark to train and evaluate multimodal agents capable of autonomously performing diverse tasks across desktop and web environments. Using annotated UI elements across applications, it combines visual grounding with natural language instructions, providing 9,802 data points for developing agents that integrate high-level reasoning with UI interactions. The study highlights the limited proficiency of current models, with baselines like GPT-4 only achieving 15% of human performance on executable scripts, emphasizing OmniACT's potential as a testbed for advancing multimodal AI.
  • AssistGUI: Task-Oriented Desktop Graphical User Interface Automation

    • Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou
    • 🏛️ Institutions: NUS
    • 📅 Date: December 20, 2023
    • 📑 Publisher: CVPR 2024
    • 💻 Env: [Desktop]
    • 🔑 Key: [framework], [dataset], [benchmark], [desktop productivity tasks]
    • 📖 TLDR: This study presents AssistGUI, a benchmark and framework for desktop GUI automation, featuring an LLM-based agent capable of completing complex user requests by analyzing instructional videos and performing actions on the desktop. Utilizing a novel Actor-Critic framework and GUI parser, AssistGUI was tested on 100 tasks across nine applications, such as MS Word and After Effects. Despite advances, the top-performing model achieved only a 46% success rate, illustrating the challenge of comprehensive desktop automation and underscoring areas for future research in agent-driven GUI tasks.
  • Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control

    • Longtao Zheng, Rundong Wang, Xinrun Wang, Bo An
    • 🏛️ Institutions: NTU
    • 📅 Date: June 13, 2023
    • 📑 Publisher: ICLR 2024
    • 💻 Env: [Desktop]
    • 🔑 Key: [framework], [benchmark], [trajectory prompting], [state abstraction], [memory retrieval]
    • 📖 TLDR: Synapse introduces a novel framework for computer control tasks, leveraging trajectory-as-exemplar prompting and memory to enhance LLM performance in complex, multi-step computer tasks. The system combines state abstraction, trajectory-based prompts, and memory retrieval, overcoming LLM limitations by filtering task-irrelevant data, storing exemplar trajectories, and retrieving relevant instances for improved decision-making. Synapse achieves significant performance gains on benchmarks such as MiniWoB++ and Mind2Web, demonstrating enhanced task success rates and generalization across diverse web-based tasks.
  • Language Models can Solve Computer Tasks

    • Geunwoo Kim, Pierre Baldi, Stephen McAleer
    • 🏛️ Institutions: UCI
    • 📅 Date: March 30, 2023
    • 📑 Publisher: NeurIPS 2023
    • 💻 Env: [Desktop]
    • 🔑 Key: [framework], [benchmark], [Recursive Critique and Improve], [RCI], [MiniWoB++], [general computer tasks]
    • 📖 TLDR: This study demonstrates that large language models (LLMs) can effectively automate computer tasks using a Recursive Critique and Improve (RCI) prompting method, enabling agents to handle complex desktop tasks like email and file management. By combining RCI with existing Chain of Thought (CoT) prompting, the method outperforms prior LLM approaches and traditional supervised and reinforcement learning models on the MiniWoB++ benchmark, showing potential for broad computer task automation.
  • A Data-Driven Approach for Learning to Control Computers

    • Peter C. Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Alex Goldin, Adam Santoro, Timothy Lillicrap
    • 🏛️ Institutions: DeepMind
    • 📅 Date: February 16, 2022
    • 📑 Publisher: ICML 2022
    • 💻 Env: [Desktop]
    • 🔑 Key: [dataset], [framework], [computer control], [reinforcement learning], [multimodal transformer]
    • 📖 TLDR: This study presents a reinforcement learning-based approach to train agents for computer control tasks, using keyboard and mouse interactions guided by natural language. By leveraging human demonstration data, agents trained in this environment achieved strong cross-task generalization across the MiniWob++ benchmark. This framework demonstrates how agents can control computers as humans would, enabling enhanced performance in complex computer tasks with high transferability.