Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main'
Browse files Browse the repository at this point in the history
  • Loading branch information
boyugou committed Oct 30, 2024
2 parents 394b7f5 + 3f5da7b commit ededceb
Show file tree
Hide file tree
Showing 2 changed files with 47 additions and 47 deletions.
48 changes: 24 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ For missing information, use "Unknown."

You can contribute by providing either the paper title or a fully formatted entry in [Paper Collection](https://github.com/boyugou/GUI-Agents-Paper-List/issues/1). You’re also welcome to open a new PR with your submission.

For ease of use, feel free to use "auto_prompt.txt" alongside ChatGPT to help format your entry automatically.
For ease of use, feel free to use "auto_prompt.txt" alongside ChatGPT to help search your paper and format the entry automatically.



Expand Down Expand Up @@ -543,23 +543,23 @@ For ease of use, feel free to use "auto_prompt.txt" alongside ChatGPT to help fo
- 🔑 Key: [framework], [benchmark], [reranking], [verification], [mobile task automation]
- 📖 TLDR: This paper presents a system that enhances mobile "how-to" queries by verifying and reranking search results through automated instruction extraction, on-device action execution, and reranking based on relevance. The method improves on traditional ranking by analyzing device-specific execution success. The approach comprises a three-stage pipeline: 1) extracting step-by-step instructions from top search results, 2) validating these instructions on mobile devices, and 3) reranking based on performance. The system leverages a pre-trained GPT model for initial processing, ensuring adaptability across diverse apps and systems.

- [Octopus: On-device language model for function calling of software APIs](https://arxiv.org/abs/2404.01549)
- Wei Chen, Zhiyuan Li, Mingyuan Ma
- [Octopus v2: On-device language model for super agent](https://arxiv.org/abs/2404.01744)
- Wei Chen, Zhiyuan Li
- 🏛️ Institutions: Unknown
- 📅 Date: April 2, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [model], [dataset], [benchmark], [API function calling], [conditional masking], [on-device LLMs]
- 📖 TLDR: This paper introduces *Octopus*, an on-device language model fine-tuned to perform software API function calls with improved accuracy over cloud-based models like GPT-4. By compiling a dataset from 20,000 API documents and utilizing conditional masking techniques, the model enhances API interactions while maintaining quick inference speeds. Octopus also introduces a new benchmark for evaluating API call accuracy, addressing challenges in automated software development and API integration, particularly for edge devices.
- 🔑 Key: [model], [framework], [on-device language model], [function calling], [super agent]
- 📖 TLDR: This paper introduces Octopus v2, an innovative on-device language model designed for efficient function calling in AI agents. The 2-billion parameter model outperforms GPT-4 in both accuracy and latency, while reducing context length by 95%. Octopus v2 uses a novel method of encoding functions into specialized tokens, significantly improving performance and enabling deployment across various edge devices. The model demonstrates a 35-fold latency improvement over Llama-7B with RAG-based function calling, making it suitable for real-world applications on resource-constrained devices.

- [Octopus v2: On-device language model for super agent](https://arxiv.org/abs/2404.01744)
- Wei Chen, Zhiyuan Li
- [Octopus: On-device language model for function calling of software APIs](https://arxiv.org/abs/2404.01549)
- Wei Chen, Zhiyuan Li, Mingyuan Ma
- 🏛️ Institutions: Unknown
- 📅 Date: April 2, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [model], [framework], [on-device language model], [function calling], [super agent]
- 📖 TLDR: This paper introduces Octopus v2, an innovative on-device language model designed for efficient function calling in AI agents. The 2-billion parameter model outperforms GPT-4 in both accuracy and latency, while reducing context length by 95%. Octopus v2 uses a novel method of encoding functions into specialized tokens, significantly improving performance and enabling deployment across various edge devices. The model demonstrates a 35-fold latency improvement over Llama-7B with RAG-based function calling, making it suitable for real-world applications on resource-constrained devices.
- 🔑 Key: [model], [dataset], [benchmark], [API function calling], [conditional masking], [on-device LLMs]
- 📖 TLDR: This paper introduces *Octopus*, an on-device language model fine-tuned to perform software API function calls with improved accuracy over cloud-based models like GPT-4. By compiling a dataset from 20,000 API documents and utilizing conditional masking techniques, the model enhances API interactions while maintaining quick inference speeds. Octopus also introduces a new benchmark for evaluating API call accuracy, addressing challenges in automated software development and API integration, particularly for edge devices.

- [AutoWebGLM: Bootstrap and Reinforce a Large Language Model-based Web Navigating Agent](https://arxiv.org/abs/2404.03648)
- Haotian Luo, Yongqi Li, Xiao Liu, Yansong Feng, Dongyan Zhao
Expand Down Expand Up @@ -921,14 +921,23 @@ For ease of use, feel free to use "auto_prompt.txt" alongside ChatGPT to help fo
- 🔑 Key: [framework], [autonomous web navigation], [hierarchical architecture], [DOM distillation]
- 📖 TLDR: This paper presents Agent-E, a novel web agent that introduces several architectural improvements over previous state-of-the-art systems. Key features include a hierarchical architecture, flexible DOM distillation and denoising methods, and a "change observation" concept for improved performance. Agent-E outperforms existing text and multi-modal web agents by 10-30% on the WebVoyager benchmark. The authors synthesize their findings into general design principles for developing agentic systems, including the use of domain-specific primitive skills, hierarchical architectures, and agentic self-improvement.

- [Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions](https://arxiv.org/abs/2408.xxxxx)
- [OmniParser for Pure Vision Based GUI Agent](https://arxiv.org/abs/2408.00203)
- Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah
- 🏛️ Institutions: Unknown
- 📅 Date: August 1, 2024
- 📑 Publisher: ICLR 2025
- 💻 Env: [GUI]
- 🔑 Key: [framework], [benchmark], [multimodal agent], [screen parsing]
- 📖 TLDR: This paper presents OmniParser, a method for parsing user interface screenshots into structured elements to enhance the performance of vision-language models like GPT-4V. The approach includes the development of an interactable icon detection dataset and a model that accurately identifies actionable regions in UI screenshots. OmniParser significantly improves the capability of agents to generate contextually grounded actions in user interfaces, outperforming existing benchmarks such as Mind2Web and AITW.

- [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https://arxiv.org/abs/2408.xxxxx)
- [Author information not available]
- 🏛️ Institutions: Unknown
- 📅 Date: August 2024
- 📑 Publisher: arXiv
- 💻 Env: [General]
- 🔑 Key: [multimodal agents], [environmental distractions], [robustness]
- 📖 TLDR: This paper highlights the vulnerability of multimodal agents to environmental distractions. The researchers demonstrate that these agents, which process multiple types of input (e.g., text, images, audio), can be significantly impacted by irrelevant or misleading environmental cues. The study provides insights into the limitations of current multimodal systems and emphasizes the need for more robust architectures that can filter out distractions and maintain focus on relevant information in complex, real-world environments.
- 🔑 Key: [framework], [autonomous agents], [advanced reasoning], [continual learning]
- 📖 TLDR: This paper introduces Agent Q, a novel framework for developing autonomous AI agents with advanced reasoning and learning capabilities. The system combines reinforcement learning, meta-learning, and causal reasoning to enable agents to adapt to new tasks and environments more effectively. Agent Q demonstrates improved performance in complex decision-making scenarios compared to traditional agent architectures, showing potential for more versatile and intelligent AI systems.

- [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://aclanthology.org/2024.findings-acl.539)
- Xinbei Ma, Zhuosheng Zhang, Hai Zhao
Expand All @@ -939,23 +948,14 @@ For ease of use, feel free to use "auto_prompt.txt" alongside ChatGPT to help fo
- 🔑 Key: [model], [framework], [benchmark]
- 📖 TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios​:contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}​:contentReference[oaicite:2]{index=2}​:contentReference[oaicite:3]{index=3}.

- [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https://arxiv.org/abs/2408.xxxxx)
- [Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions](https://arxiv.org/abs/2408.xxxxx)
- [Author information not available]
- 🏛️ Institutions: Unknown
- 📅 Date: August 2024
- 📑 Publisher: arXiv
- 💻 Env: [General]
- 🔑 Key: [framework], [autonomous agents], [advanced reasoning], [continual learning]
- 📖 TLDR: This paper introduces Agent Q, a novel framework for developing autonomous AI agents with advanced reasoning and learning capabilities. The system combines reinforcement learning, meta-learning, and causal reasoning to enable agents to adapt to new tasks and environments more effectively. Agent Q demonstrates improved performance in complex decision-making scenarios compared to traditional agent architectures, showing potential for more versatile and intelligent AI systems.

- [OmniParser for Pure Vision Based GUI Agent](https://arxiv.org/abs/2408.00203)
- Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah
- 🏛️ Institutions: Unknown
- 📅 Date: August 1, 2024
- 📑 Publisher: ICLR 2025
- 💻 Env: [GUI]
- 🔑 Key: [framework], [benchmark], [multimodal agent], [screen parsing]
- 📖 TLDR: This paper presents OmniParser, a method for parsing user interface screenshots into structured elements to enhance the performance of vision-language models like GPT-4V. The approach includes the development of an interactable icon detection dataset and a model that accurately identifies actionable regions in UI screenshots. OmniParser significantly improves the capability of agents to generate contextually grounded actions in user interfaces, outperforming existing benchmarks such as Mind2Web and AITW.
- 🔑 Key: [multimodal agents], [environmental distractions], [robustness]
- 📖 TLDR: This paper highlights the vulnerability of multimodal agents to environmental distractions. The researchers demonstrate that these agents, which process multiple types of input (e.g., text, images, audio), can be significantly impacted by irrelevant or misleading environmental cues. The study provides insights into the limitations of current multimodal systems and emphasizes the need for more robust architectures that can filter out distractions and maintain focus on relevant information in complex, real-world environments.

- [VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents](https://arxiv.org/abs/2408.06327)
- Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, et al.
Expand Down
Loading

0 comments on commit ededceb

Please sign in to comment.