The MMAU dataset evaluates LLM agent capabilities across multiple domains, including tool use, math, and coding challenges.
-
Clone the Repository
git clone https://github.com/apple/axlearn.git cd axlearn
-
Install Dependencies
pip install ".[mmau]"
-
Dataset Preparation
-
Google Cloud (Coming Soon):
mkdir -p ./data/ gsutil -m cp -r "gs://axlearn-public/datasets/mmau/20240712/*" ./data/
-
Hugging Face (Coming Soon):
huggingface-cli download apple/mmau --local-dir ./data --repo-type dataset
-
Mind2Web provides a structured dataset to evaluate cross-domain generalization.
-
Training Set: Available on Hugging Face
git clone [email protected]:datasets/osunlp/Mind2Web
-
** Test Set**: Available here
- Train: 1,009 instances
- Test:
- Cross Task: 252 instances (tasks from the same website seen during training)
- Cross Website: 177 instances (websites not seen during training)
- Cross Domain: 912 instances (entire domains not seen during training)
annotation_id
(str): Unique ID for each taskwebsite
(str): Website namedomain
(str): Website domainsubdomain
(str): Website subdomainconfirmed_task
(str): Task descriptionaction_reprs
(list[str]): Human-readable representation of the action sequenceactions
(list[dict]): List of actions (steps) to complete the taskaction_uid
(str): Unique ID for each action (step)raw_html
(str): Raw HTML of the page before the action is performedcleaned_html
(str): Cleaned HTML of the page before the action is performedoperation
(dict): Operation to performop
(str): Operation type (CLICK, TYPE, SELECT)original_op
(str): Original operation type, containing HOVER and ENTER mapped to CLICKvalue
(str): Optional value for the operation (e.g., text to type)
pos_candidates
(list[dict]): Ground truth elements after preprocessingtag
(str): Tag of the elementis_original_target
(bool): Whether the element is the original target labeled by the annotatoris_top_level_target
(bool): Whether the element is a top-level target found by the algorithmbackend_node_id
(str): Unique ID for the elementattributes
(str): Serialized attributes of the element (usejson.loads
to convert back to dict)neg_candidates
(list[dict]): Other candidate elements in the page after preprocessing
Natural Questions (NQ) is designed for training and evaluating question-answering systems, using real user queries and Wikipedia answers.
-
Visit the Dataset Page: Access the official dataset and leaderboard:
-
Download the Dataset: You can download the dataset directly from the official page linked above, which provides:
- Training Set: 307,372 examples
- Development Set: 7,830 examples
- Test Set: 7,842 examples (hidden)
-
Use Preprocessing Tools: The repository provides preprocessing utilities and functions to simplify the dataset format. Use the
simplify_nq_example
function found indata_utils.py
to transform the dataset into a more accessible format.
TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions.
-
Visit the TriviaQA Website: Download the dataset directly from the official website:
-
Requirements: Ensure you have the following installed:
- Python 3
- Python packages:
tensorflow
(if using BiDAF),nltk
,tqdm
-
Evaluate the Dataset: To evaluate a model using the TriviaQA dataset, use the following command:
python3 -m evaluation.triviaqa_evaluation --dataset_file samples/triviaqa_sample.json --prediction_file samples/sample_predictions.json
GSM8K (Grade School Math 8K) is a dataset of 8.5K linguistically diverse grade school math word problems, designed to support question answering tasks that require multi-step reasoning.
-
Visit the Dataset Page: You can access the GSM8K dataset directly on Hugging Face:
-
Install the Hugging Face Datasets Library: To use the dataset, first install the
datasets
library:pip install datasets
-
Load the Dataset: After installing the library, you can load the dataset with the following code:
from datasets import load_dataset dataset = load_dataset("openai/gsm8k")
-
Dataset Structure:
- Training Set: 7,473 examples
- Test Set: 1,319 examples
- Each example contains a math problem and a multi-step reasoning solution.