Skip to content

Latest commit

 

History

History
101 lines (71 loc) · 4.94 KB

README.md

File metadata and controls

101 lines (71 loc) · 4.94 KB

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

With the integration of large language models (LLMs), embodied agents have strong capabilities to execute complicated instructions in natural language, paving a way for the potential deployment of embodied robots. However, a foreseeable issue is that those embodied agents can also flawlessly execute some hazardous tasks, potentially causing damages in real world. To study this issue, we present SafeAgentBench —- a new benchmark for safety-aware task planning of embodied LLM agents. SafeAgentBench includes: (1) a new dataset with 750 tasks, covering 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 8 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives. Experimental results show that the best-performing baseline gets 69% success rate for safe tasks, but only 5% rejection rate for hazardous tasks, indicating significant safety risks.

For the latest updates, see: our website

Quickstart

Clone repo:

$ git clone https://github.com/shengyin/safeagentbench.git 
$ cd safeagentbench

Install requirements:

$ pip install -r requirements.txt

More Info

  • Dataset: Safe detailed tasks(300 samples), unsafe detailed tasks(300 samples), abstract tasks(100 samples) and long-horizon tasks(50 samples).
  • Evaluators: Evaluation metrics for each type of task, including success rate, rejection rate, and other metrics.
  • low-level controller: A low-level controller for SafeAgentEnv, which takes in the high-level action and map them to low-level actions supported by AI2-THOR for the agent to execute. You can choose multi-agent version or single-agent version.

SOTA Embodied LLM Agents

Because each agent has different code structure, we can not provide all the implementation codes. You can refer to these works' papers and codes to implement your own agent.

LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents
Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, Minsu Jang
Paper, Code

Building Cooperative Embodied Agents Modularly with Large Language Models
Hongxin Zhang*, Weihua Du*, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, Chuang Gan_: Building Cooperative Embodied Agents Modularly with Large Language Models
Paper, Code

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models
Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, Animesh Garg
Paper, Code

MLDT: Multi-Level Decomposition for Complex Long-Horizon Robotic Task Planning with Open-Source Large Language Model
Yike Wu, Jiatao Zhang, Nan Hu, LanLing Tang, Guilin Qi, Jun Shao, Jie Ren, Wei Song
Paper, Code

PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain
Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Xiangdi Meng, Tianyu Liu, Baobao Chang
Paper, Code

ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
Paper, Code

LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models
Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, Yu Su,
Paper, Code

Multi-agent Planning using Visual Language Models
Michele Brienza, Francesco Argenziano, Vincenzo Suriani, Domenico D. Bloisi, Daniele Nardi
Paper, Code

Hardware

The same as AI2-THOR.

Citation

If you find the dataset or code useful, please cite:

TBD

License

MIT License

Contact

Questions or issues? Contact [email protected]