This is the code for evaluating ALFChat on ALFWorld.
Download alfworld
data and install environments following instructions here.
We provide a docker that has set up all the environments. The code is in its /autogen-eval
directory.
docker pull leoljl/alfchat:v2
docker run -it leoljl/alfchat zsh
# now you are in docker environment
cd /autogen-eval/application/A3-decision-making-ALFWorld
Fill in your api-key in twoagent.py
, then run the following command to evaluate ALFChat (2 agent) on AlfWorld. The conversation history will be saved in logs_twoagent/
python twoagent.py
Fill in your api-key in multiagent.py
, then run the following command to evaluate ALFChat (3 agent) on AlfWorld. The conversation history will be saved in logs_multiagent/
python multiagent.py
To calculate success rate given conversation history, use the following command.
python count.py --dir logs_multiagent/
python count.py --dir logs_twoagent/
We compare task success rate between ReAct, ALFChat (2 agent), ALFChat (3 agent).