Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Update test_webarena.raw.json for better evaluation. #67

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

yeonjooooni
Copy link

@yeonjooooni yeonjooooni commented Oct 4, 2024

Reason for Change: Some answers in test_webarena.raw.json are incorrect. I believe minor fixes are needed for more accurate evaluation.

Changes Made: I mainly fixed three types of configuration:

  1. I observed that the answers for the intent “Show me the command to clone {{repo}} with SSH.” were inconsistent. Specifically, while some configurations have the answer
    "exact_match": "git clone ssh://[email protected]:2222/{repo_name}.git",
    others use "exact_match": "ssh://[email protected]:2222/{repo_name}.git".
    Therefore, I unified the answers to the first one.
  2. I noticed that the answers for the intents “Open my latest updated issue that has the keyword ‘{{keyword}}’ in its title to check if it is closed” and “Open my latest created issue that has {{keyword}} in its title to check if it is closed” were not consistent.
    The first intent’s answer uses "fuzzy_match": ["Yes, it is closed"],
    while the second one uses "exact_match": "Yes".
    Therefore, I unified the answers to the first one.
  3. I observed that there are multiple ways to fulfill the intent, “I want to browse the products in the {{category}} category.” For example, if we want to find the men’s shoes category, we can either use the dashboard or first navigate to “Men” and then find “Shoes.” Both approaches lead to the same result, though the URLs may differ. The image on the bottom left shows the original answer, while the one on the right reflects the latter approach.
    I believe the latter approach is logically sound as well, so I added a reference URL for the latter approach: “ |OR| SHOPPING/clothing-shoes-jewelry/men.html?cat=145”.

Gold Answer Latter Approach

  1. Lastly, I fixed wrong url for task_id 102, which was also mentioned in webarena PR.

Testing: I tested these changes locally in a Docker environment and confirmed that no errors occurred as a result of these changes.

Request for Feedback: If there are any concerns or additional improvements you’d like me to make, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant