This is a CloudAI user guide to help users use CloudAI, covering topics such as adding new tests and downloading datasets for running NeMo-launcher.
-
Set Up the GitLab Repository Start by setting up a repository on GitLab to host your docker image. For this example, use
gitlab-url.com/cloudai/nccl-test
. -
Writing the Dockerfile The Dockerfile needs to specify the base image and detail the steps:
FROM nvcr.io/nvidia/pytorch:24.02-py3
-
Build and Push the Docker Image Build the docker image with the Dockerfile and upload it to the designated repository:
docker build -t gitlab-url.com/cloudai/nccl-test . docker push gitlab-url.com/cloudai/nccl-test
-
Verify the Docker Image Test the docker image by running it with
srun
to verify that the docker image runs correctly:srun \ --mpi=pmix \ --container-image=gitlab-url.com/cloudai/nccl-test \ /usr/local/bin/all_reduce_perf_mpi \ --nthreads 1 \ --ngpus 1 \ --minbytes 128 \ --maxbytes 16G \ --stepbytes 1M \ --op sum \ --datatype float \ --root 0 \ --iters 100 \ --warmup_iters 50 \ --agg_iters 1 \ --average 1 \ --parallel_init 0 \ --check 1 \ --blocking 0 \ --cudagraph 0 \ --stepfactor 2
CloudAI is fully configurable via set of TOML configuration files. You can find examples of these files under conf/common
. In this guide, we will use the following configuration files:
myconfig/system.toml
- Describes the system configuration.myconfig/tests/nccl_test.toml
- Describes the test to run.myconfig/scenario.toml
- Describes the test scenario configuration.
Test definition is a Pydantic model that describes the arguments of a test. Such models should be inherited from the TestDefinition
class:
class MyTestCmdArgs(CmdArgs):
an_arg: str
docker_image_url: str = "nvcr.io/nvidia/pytorch:24.02-py3"
class MyTestDefinition(TestDefinition):
cmd_args: MyTestCmdArgs
Notice that cmd_args.docker_image_url
uses nvcr.io/nvidia/pytorch:24.02-py3
, but you can use Docker image from Step 1.
A custom test definition should be registered to handle relevant Test Configs. For this, Registry()
object is used:
Registry().add_test_definition("MyTest", MyTestDefinition)
Registry().add_test_template("MyTest", MyTest)
Relevant Test Configs should specify test_template_name = MyTest
to use the custom test definition.
System config describes the system configuration. You can find more examples of system configs under conf/common/system/
. Our example will be small for demonstration purposes. Below is the myconfig/system.toml
file:
name = "my-cluster"
scheduler = "slurm"
install_path = "./install"
output_path = "./results"
cache_docker_images_locally = "True"
default_partition = "<YOUR PARTITION NAME>"
mpi = "pmix"
gpus_per_node = 8
ntasks_per_node = 8
[partitions]
[partitions.<YOUR PARTITION NAME>]
name = "<YOUR PARTITION NAME>"
nodes = ["<nodes-[01-10]>"]
Please replace <YOUR PARTITION NAME>
with the name of the partition you want to use. You can find the partition name by running sinfo
on the cluster. Replace <nodes-[01-10]>
with the node names you want to use.
Once all configs are ready, it is time to install test requirements. It is done once so that you can run multiple experiments without reinstalling the requirements. This step requires the system config file from the previous step.
cloudai install \
--system-config myconfig/system.toml \
--tests-dir myconfig/tests/
Test Config describes a particular test configuration to be run. It is based on Test definition and will be used in Test Sceanrio. Below is the myconfig/tests/nccl_test.toml
file, definition is based on built-in NcclTest
definition:
name = "nccl_test_all_reduce_single_node"
description = "all_reduce"
test_template_name = "NcclTest"
extra_cmd_args = "--stepfactor 2"
[cmd_args]
"subtest_name" = "all_reduce_perf_mpi"
"ngpus" = "1"
"minbytes" = "8M"
"maxbytes" = "16G"
"iters" = "5"
"warmup_iters" = "3"
You can find more examples under conf/common/test
. In a test schema file, you can adjust arguments as shown above. In the cmd_args
section, you can provide different values other than the default values for each argument. In extra_cmd_args
, you can provide additional arguments that will be appended after the NCCL test command. You can specify additional environment variables in the extra_env_vars
section.
Test Scenario uses Test description from the previous step. Below is the myconfig/scenario.toml
file:
name = "nccl-test"
[[Tests]]
id = "Tests.1"
test_name = "nccl_test_all_reduce_single_node"
time_limit = "00:20:00"
[[Tests]]
id = "Tests.2"
test_name = "nccl_test_all_reduce_single_node"
time_limit = "00:20:00"
[[Tests.dependencies]]
type = "start_post_comp"
id = "Tests.1"
Notes on the test scenario:
-
id
is a mandatory filed and must be uniq for each test. -
The
test_name
specifies test definition from one of the Test TOML files. Node lists and time limits are optional. -
If needed,
nodes
should be described as a list of node names as shown in a Slurm system. Alternatively, if groups are defined in the system schema, you can ask CloudAI to allocate a specific number of nodes from a specified partition and group. For examplenodes = ['PARTITION:GROUP:16']
: 16 nodes are allocated from a groupGROUP
, from a partitionPARTITION
. -
There are three types of dependencies:
start_post_comp
,start_post_init
andend_post_comp
.start_post_comp
means that the current test should be started after a specific delay of the completion of the depending test.start_post_init
means that the current test should start after the start of the depending test.end_post_comp
means that the current test should be completed after the completion of the depending test.
All dependencies are described as a pair of the depending test name and a delay. The name should be taken from the test name as set in the test scenario. The delay is described in the number of seconds.
To generate NCCL test commands without actual execution, use the dry-run
mode. You can review debug.log
(or other file specifued with --log-file
) to see the generated commands from CloudAI. Please note that group node allocations are not currently supported in the dry-run
mode.
cloudai dry-run \
--test-scenario myconfig/scenario.toml \
--system-config myconfig/system.toml \
--tests-dir myconfig/tests/
You can run NCCL test experiments with the following command. Whenever you run CloudAI in the run
mode, a new directory will be created under the results directory with the timestamp. In the directory, you can find the results from the test scenario including stdout and stderr. Once completed successfully, you can find generated reports under the directories as well.
cloudai run \
--test-scenario myconfig/scenario.toml \
--system-config myconfig/system.toml \
--tests-dir myconfig/tests/
Once the test scenario is completed, you can generate reports using the following command:
cloudai generate-report \
--test-scenario myconfig/scenario.toml \
--system-config myconfig/system.toml \
--tests-dir myconfig/tests/ \
--result-dir results/2024-06-18_17-40-13/
--result-dir
accepts one scenario run result directory.
In this section, we introduce the concept of the system schema, explain the meaning of each field, and describe how the fields should be used. The system schema is a TOML file that allows users to define a system's configuration.
name = "example-cluster"
scheduler = "slurm"
install_path = "./install"
output_path = "./results"
default_partition = "partition_1"
mpi = "pmix"
gpus_per_node = 8
ntasks_per_node = 8
cache_docker_images_locally = true
[partitions]
[partitions.partition_1]
name = "partition_1"
nodes = ["node-[001-100]"]
[partitions.partition_2]
name = "partition_2"
nodes = ["node-[101-200]"]
[partitions.partition_1.groups]
[partitions.partition_1.groups.group_1]
name = "group_1"
nodes = ["node-[001-025]"]
[partitions.partition_1.groups.group_2]
name = "group_2"
nodes = ["node-[026-050]"]
[partitions.partition_1.groups.group_3]
name = "group_3"
nodes = ["node-[051-075]"]
[partitions.partition_1.groups.group_4]
name = "group_4"
nodes = ["node-[076-100]"]
[global_env_vars]
# NCCL Specific Configurations
NCCL_IB_GID_INDEX = "3"
NCCL_IB_TIMEOUT = "20"
NCCL_IB_QPS_PER_CONNECTION = "4"
# Device Visibility Configuration
MELLANOX_VISIBLE_DEVICES = "0,3,4,5,6,9,10,11"
CUDA_VISIBLE_DEVICES = "0,1,2,3,4,5,6,7"
- name: Specifies the name of the system. Users can choose any name that is convenient for them.
- scheduler: Indicates the type of system. It should be one of the supported types, currently
slurm
orstandalone
.slurm
refers to a system with the Slurm scheduler, whilestandalone
refers to a single-node system without any slave nodes. - install_path: Specifies the path where test prerequisites are installed. Docker images are downloaded to this path if the user chooses to cache Docker images.
- output_path: Defines the default path where outputs are stored. Whenever a user runs a test scenario, a new subdirectory will be created under this path.
- default_partition: Specifies the default partition where jobs are scheduled.
- partitions: Describes the available partitions and nodes within those partitions.
- groups: Within the same partition, users can define groups of nodes. This is a logical grouping that does not overlap between groups. The group concept can be used to allocate nodes from specific groups in a test scenario schema.
- mpi: Indicates the Process Management Interface (PMI) implementation to be used for inter-process communication.
- gpus_per_node and ntasks_per_node: These are Slurm arguments passed to the
sbatch
script andsrun
. - cache_docker_images_locally: Specifies whether CloudAI should cache remote Docker images locally during installation. If set to
true
, CloudAI will cache the Docker images, enabling local access without needing to download them each time a test is run. This approach saves network bandwidth but requires more disk capacity. If set tofalse
, CloudAI will allow Slurm to download the Docker images as needed when they are not cached locally by Slurm. - global_env_vars: Lists all global environment variables that will be applied globally whenever tests are run.
A test scenario is a set of tests with specific dependencies between them. A test scenario is described in a TOML schema file. This is an example of a test scenario file:
name = "nccl-test"
[[Tests]]
id = "Tests.1"
test_name = "nccl_test_all_reduce"
num_nodes = "2"
time_limit = "00:20:00"
[[Tests]]
id = "Tests.2"
test_name = "nccl_test_all_gather"
num_nodes = "2"
time_limit = "00:20:00"
[[Tests.dependencies]]
type = "start_post_comp"
id = "Tests.1"
[[Tests]]
id = "Tests.3"
templat_test = "nccl_test_reduce_scatter"
num_nodes = "2"
time_limit = "00:20:00"
[[Tests.dependencies]]
type = "start_post_comp"
id = "Tests.2"
The name
field is the test scenario name, which can be any unique identifier for the scenario. Each test has a section name, following the convention Tests.1
, Tests.2
, etc., with an increasing index. The name
of a test should be specified in this section and must correspond to an entry in the test schema. If a test in a test scenario is not present in the test schema, CloudAI will not be able to identify it.
There are two ways to specify nodes. The first is using the num_nodes
field as shown in the example. The second is specifying nodes explicitly like nodes = ["node-001", "node-002"]
. Alternatively, you can utilize the groups feature in the system schema to specify nodes like nodes = ['PARTITION_NAME:GROUP_NAME:NUM_NODES']
, which allocates num_nodes
from the group name in the specified partition. You can also use nodes = ['PARTITION_NAME:GROUP_NAME:max_avail']
, which allocates all the available nodes from the group name in the specified partition.
You can optionally specify a time limit in the Slurm format. Tests can have dependencies. If no dependencies are specified, all tests will run in parallel. CloudAI supports three types of dependencies: start_post_init
, start_post_comp
, and end_post_comp
.
Dependencies of a test can be described as a subsection of the test. It requires other tests' id
and dependency type
.
start_post_init
means the test starts after the prior test begins, with a specified delay.start_post_comp
means the test starts after the prior test completes.end_post_comp
means the test ends when the prior test completes.
This section describes how you can download the NeMo datasets on your server. The install mode of CloudAI handles the installation of all test prerequisites, but downloading and installing datasets is not the responsibility of the install mode. This is because any large datasets should be installed globally by the administrator and shared with multiple users, even if a user does not use CloudAI. For CloudAI users, we provide a detailed guide about downloading and installing the NeMo datasets in this section. By default, the NeMo launcher uses mock datasets for testing purposes. If you want to run tests using real datasets, you must download the datasets and update the test .toml
files accordingly to locate the datasets and provide appropriate prefixes. To understand the datasets available in the NeMo framework, you can refer to the Data Preparation section of the document. According to the document, you can download and use the Pile dataset. The document also provides detailed instructions on how to download these datasets for various platforms. Let’s assume that we have a Slurm cluster.
You can download the datasets with the following command:
$ git clone https://github.com/NVIDIA/NeMo-Framework-Launcher.git
$ cd NeMo-Framework-Launcher
$ python3 launcher_scripts/main.py \
container=nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.11\
stages=["data_preparation"]\
launcher_scripts_path=$PWD/launcher_scripts\
base_results_dir=$PWD/result\
env_vars.TRANSFORMERS_OFFLINE=0\
data_dir=directory_path_to_download_dataset\
data_preparation.run.time_limit="96:00:00"
Once you submit a NeMo job with the data preparation stage, you should be able to find data downloading jobs with the squeue command. If this command does not work, please review the log files under $PWD/result. If you want to download the full Pile dataset, you should have at least 1TB of space in the directory to download the dataset because the Pile dataset size is 800GB. By default, NeMo will look at the configuration file under conf/config.yaml:
defaults:
- data_preparation: baichuan2/download_baichuan2_pile
stages:
- data_preparation
As the data preparation field points to baichuan2/download_baichuan2_pile, it will read the YAML file:
run:
name: download_baichuan2_pile
results_dir: ${base_results_dir}/${.name}
time_limit: "4:00:00"
dependency: "singleton"
node_array_size: 30
array: ${..file_numbers}
bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.
dataset: pile
download_the_pile: True # Whether to download the pile dataset from the internet.
the_pile_url: "https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/" # Source URL to download The Pile dataset from.
file_numbers: "0-29" # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True # True to preprocess the data from a jsonl file, False otherwise.
download_tokenizer_url: "https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/tokenizer.model"
tokenizer_typzer_library: "sentencepiece"
tokenizer_save_dir: ${data_dir}/baichuan2
tokenizer_model: ${.tokenizer_save_dir}/baichuan2_tokenizer.model
rm_downloaded: False # Extract script will remove downloaded zst after extraction
rm_extracted: False # Preprocess script will remove extracted files after preproc.
You can update the fields to adjust the behavior. For example, you can update the file_numbers field to adjust the number of dataset files to download. This will allow you to save disk space.
- Go to https://huggingface.co/docs/transformers/en/model_doc/llama#usage-tips.
- Follow the instructions under 'Usage Tips' on how to download the tokenizer.
- Replace "training.model.tokenizer.model=TOKENIZER_MODEL" with "training.model.tokenizer.model=YOUR_TOKENIZER_PATH" (the tokenizer should be a .model file) in conf/common/test/llama.toml.
In this section, we will guide you through identifying the root cause of issues, determining whether they stem from system infrastructure or a bug in CloudAI. Users should closely follow the USER_GUIDE.md and README.md for installation, tests, and test scenarios.
If you encounter issues running a command, start by reading the error message to understand the root cause. We strive to make our error messages and exception messages as readable and interpretable as possible.
To determine whether an issue is due to system infrastructure or a CloudAI bug, follow these steps:
-
Check stdout Messages If CloudAI fails to run a test successfully, it will be indicated in the stdout messages that a test has failed.
-
Review Log Files
- Navigate to the output directory and review
debug.log
, stdout, and stderr files. debug.log
contains detailed steps executed by CloudAI, including generated commands, executed commands, and error messages.
- Navigate to the output directory and review
-
Analyze Error Messages By examining the error messages in the log files, you can understand the type of errors CloudAI encountered.
-
Examine Output Directory If a test fails without explicit error messages, review the output directory of the failed test. Look for
stdout.txt
,stderr.txt
, or any generated files to understand the failure reason. -
Manual Rerun of Tests
- To manually rerun the test, consult the
debug.log
for the command CloudAI executed. - Look for an
sbatch
command with a generatedsbatch
script. - Execute the command manually to debug further.
- To manually rerun the test, consult the
If the problem persists, please report the issue at https://github.com/NVIDIA/cloudai/issues/new/choose. When you report an issue, ensure it is reproducible. Follow the issue template and provide any necessary details, such as the hash commit used, system settings, any changes in the schema files, and the command.
In addition to the general troubleshooting steps, this section provides specific troubleshooting guides for each test used in CloudAI. These guides help you identify and resolve issues unique to each template.
- If your run is not successful, please review the stderr and stdout files generated under the results directory. Within the output directory, locate the run directory, and under the run directory, you will find stderr files like log-nemo-megatron-run_[job_id].err. Please review these files for any meaningful error messages.
- Trying the CloudAI-generated NeMo launcher command can be helpful as well. You can find the executed command in your stdout and in your log file (debug.log) in your current working directory. Review and run the command, and you can modify the arguments to troubleshoot the issue.
If an error occurs, follow these steps sequentially:
-
Read the Error Messages:
- Begin by reading the error messages printed by CloudAI. We strive to make our error messages clear and informative, so they are a good starting point for troubleshooting.
-
Review
profile_stderr.txt
:- JaxToolbox operates in two stages: the profiling phase and the actual run phase. We follow the PGLE workflow as described in the PGLE workflow documentation. All stderr and stdout messages from the profiling phase are stored in
profile_stderr.txt
. If the profiling stage fails, you should find relevant error messages in this file. Attempt to understand the cause of the error from these messages.
- JaxToolbox operates in two stages: the profiling phase and the actual run phase. We follow the PGLE workflow as described in the PGLE workflow documentation. All stderr and stdout messages from the profiling phase are stored in
-
Check the Actual Run Phase:
- If the profiling stage completes successfully, CloudAI moves on to the actual run phase. The actual run generates stdout and stderr messages in separate files for each rank. Review these files to diagnose any issues during this phase.
- DEADLINE_EXCEEDED:
- When running JaxToolbox on multiple nodes, the nodes must be able to communicate to execute a training job collaboratively. The DEADLINE_EXCEEDED error indicates a failure in the connection during the initialization stage. Potential causes include:
- Hostname resolution failure by the slave nodes.
- The port opened by the master node is not accessible by other nodes.
- Network interface malfunctions.
- Significant time gap in the initialization phase among nodes. If one node starts early while others are still loading the Docker image, this error can occur. This can happen when a Docker image is not locally cached, and all nodes try to download it from a remote registry without sufficient network bandwidth. The resulting difference in initialization times can lead to a timeout on some nodes.
- When running JaxToolbox on multiple nodes, the nodes must be able to communicate to execute a training job collaboratively. The DEADLINE_EXCEEDED error indicates a failure in the connection during the initialization stage. Potential causes include: