Skip to content

Commit

Permalink
init
Browse files Browse the repository at this point in the history
  • Loading branch information
boyugou committed Oct 30, 2024
1 parent 4faa486 commit fd08963
Showing 1 changed file with 46 additions and 8 deletions.
54 changes: 46 additions & 8 deletions add_paper_here.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,14 +124,24 @@
- 🔑 Key: [framework], [benchmark], [multimodal agent], [screen parsing]
- 📖 TLDR: This paper presents OmniParser, a method for parsing user interface screenshots into structured elements to enhance the performance of vision-language models like GPT-4V. The approach includes the development of an interactable icon detection dataset and a model that accurately identifies actionable regions in UI screenshots. OmniParser significantly improves the capability of agents to generate contextually grounded actions in user interfaces, outperforming existing benchmarks such as Mind2Web and AITW.

- [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https://arxiv.org/abs/2408.xxxxx)
- [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents](https://arxiv.org/abs/2408.07199)
- [Author information not available]
- 🏛️ Institutions: Unknown
- 🏛️ Institutions: MultiOn, Stanford
- 📅 Date: August 2024
- 📑 Publisher: arXiv
- 💻 Env: [General]
- 🔑 Key: [framework], [autonomous agents], [advanced reasoning], [continual learning]
- 📖 TLDR: This paper introduces Agent Q, a novel framework for developing autonomous AI agents with advanced reasoning and learning capabilities. The system combines reinforcement learning, meta-learning, and causal reasoning to enable agents to adapt to new tasks and environments more effectively. Agent Q demonstrates improved performance in complex decision-making scenarios compared to traditional agent architectures, showing potential for more versatile and intelligent AI systems.
- 🔑 Key: [framework]
- 📖 TLDR: TBD

- [GAIA: a benchmark for General AI Assistants](https://huggingface.co/gaia-benchmark)
- Grégoire Mialon, Yassine Nakkach, Aslan Tchamkerten, Albert Thomas, Laurent Dinh, and a research team from Meta AI and Hugging Face.
- 🏛️ Institutions: Meta AI, Hugging Face
- 📅 Date: November 21, 2023
- 📑 Publisher: arXiv
- 💻 Env: [General]
- 🔑 Key: [benchmark], [multi-modality], [tool use], [reasoning]
- 📖 TLDR: GAIA is a benchmark developed for evaluating general-purpose AI assistants. It aims to test assistant models across multiple modalities and complex reasoning tasks in real-world settings, including scenarios that require tool usage and open-ended question answering. With a dataset comprising 466 questions across various domains, GAIA highlights gaps between current AI performance and human capability, presenting a significant challenge for large language models such as GPT-4.


- [CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation](https://aclanthology.org/2024.findings-acl.539)
- Xinbei Ma, Zhuosheng Zhang, Hai Zhao
Expand All @@ -142,9 +152,9 @@
- 🔑 Key: [model], [framework], [benchmark]
- 📖 TLDR: This paper presents CoCo-Agent, a multimodal large language model (MLLM) designed for smartphone GUI automation. It introduces two novel approaches: Comprehensive Environment Perception (CEP) for enhanced GUI understanding, and Conditional Action Prediction (CAP) to improve action response accuracy. The proposed agent achieves state-of-the-art performance on GUI automation benchmarks such as AITW and META-GUI, showcasing its capabilities in realistic scenarios​:contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}​:contentReference[oaicite:2]{index=2}​:contentReference[oaicite:3]{index=3}.

- [Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions](https://arxiv.org/abs/2408.xxxxx)
- [Author information not available]
- 🏛️ Institutions: Unknown
- [Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions](https://arxiv.org/abs/2408.02544)
- Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao
- 🏛️ Institutions: SJTU, Meta
- 📅 Date: August 2024
- 📑 Publisher: arXiv
- 💻 Env: [General]
Expand Down Expand Up @@ -574,7 +584,7 @@
- 🔑 Key: [framework], [dataset], [general virtual agents], [open-ended learning], [tool creation]
- 📖 TLDR: AgentStudio is a robust toolkit for developing virtual agents with versatile actions, such as GUI automation and code execution. It unifies real-world human-computer interactions across OS platforms and includes diverse observation and action spaces, facilitating comprehensive training and benchmarking in complex settings. The toolkit's flexibility promotes agent generalization across varied tasks, supporting tool creation and a multimodal interaction interface to advance agent adaptability and learning.

- [Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study](https://arxiv.org/abs/2403.03186v1)
- [Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study](https://arxiv.org/abs/2403.03186)
- Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, Börje F. Karlsson, Bo An, Zongqing Lu
- 🏛️ Institutions: Unknown
- 📅 Date: March 5, 2024
Expand Down Expand Up @@ -1023,3 +1033,31 @@
- 💻 Env: [Web]
- 🔑 Key: [framework], [dataset], [structured web extraction], [minimal human labeling], [cross-vertical extraction]
- 📖 TLDR: This paper presents a scalable solution to structured web data extraction across diverse website domains (e.g., books, restaurants) by leveraging limited labeled data per vertical. The approach uses generalized features to characterize each vertical and adapts these to new sites by unsupervised constraints. The solution's robustness is validated on 80 sites across 8 categories, demonstrating that minimal site-specific training is needed to generalize extraction capabilities.

- [Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution](https://qwenlm.github.io/blog/qwen2-vl/)
- Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin
- 🏛️ Institutions: Alibaba Cloud
- 📅 Date: September 18, 2024
- 📑 Publisher: arXiv
- 💻 Env: [General]
- 🔑 Key: [framework], [model], [dynamic resolution], [M-RoPE], [vision-language]
- 📖 TLDR: Qwen2-VL introduces an advanced vision-language framework that enables dynamic resolution handling for images and videos through its Naive Dynamic Resolution mechanism and Multimodal Rotary Position Embedding (M-RoPE). This structure allows the model to convert images of varying resolutions into diverse token counts for improved visual comprehension. With model sizes up to 72B parameters, Qwen2-VL demonstrates competitive performance across multiple benchmarks, achieving results on par with or better than prominent multimodal models like GPT-4o and Claude3.5-Sonnet. This work represents a significant step forward in scalable vision-language learning for multimodal tasks.

- [Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V](https://arxiv.org/abs/2310.11441)
- Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao
- 🏛️ Institutions: Microsoft Research
- 📅 Date: October 17, 2023
- 📑 Publisher: arXiv
- 💻 Env: [General]
- 🔑 Key: [visual prompting], [framework], [benchmark], [visual grounding], [zero-shot]
- 📖 TLDR: This paper introduces Set-of-Mark (SoM), a novel visual prompting approach designed to enhance the visual grounding capabilities of multimodal models like GPT-4V. By overlaying images with spatially and semantically distinct marks, SoM enables fine-grained object recognition and interaction within visual data, surpassing conventional zero-shot segmentation methods in accuracy. The framework is validated on tasks requiring detailed spatial reasoning, demonstrating a significant improvement over existing visual-language models without fine-tuning.


- [EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage](https://arxiv.org/abs/2409.11295)
- Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, Huan Sun
- 🏛️ Institutions: OSU, UCLA, UChicago, UIUC, UW-Madison
- 📅 Date: September 17, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [safety], [privacy attack], [environmental injection], [stealth attack]
- 📖 TLDR: This paper introduces the Environmental Injection Attack (EIA), a privacy attack targeting generalist web agents by embedding malicious yet concealed web elements to trick agents into leaking users' PII. Utilizing 177 action steps within realistic web scenarios, EIA demonstrates a high success rate in extracting specific PII and whole user requests. Through its detailed threat model and defense suggestions, the work underscores the challenge of detecting and mitigating privacy risks in autonomous web agents.

0 comments on commit fd08963

Please sign in to comment.