NeMo 2.0 support #293

TaekyungHeo · 2024-10-29T20:32:21Z

Summary

This PR introduces support for NeMo 2.0 in CloudAI. Initially, we planned to dump fiddle configurations to a file and load them in NeMo-Run. However, I changed the approach to use NeMo-Run directly to execute a model. Marc Romejin informed me that we can run a task with a recipe without generating an sbatch script, known as a "direct executor" in NeMo-Run. To run NeMo 2.0, you can use the following command:

$ srun -t "60:00" --account=hw_nsw_misc --ntasks-per-node=8 --container-image=nvcr.io/nvidia/nemo:dev --pty nemo llm pretrain -y --factory llama3_8b trainer.max_steps=5 log.ckpt.save_on_train_epoch_end=False log.ckpt.save_last=False

Test Plan

CI passes
Ran on a server

$ cloudai run --system-config ~/cloudaix/conf/common/system/eos.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nemo_run_llama3_8b.toml                                 

/home/theo/scratch/miniconda3/envs/test4/lib/python3.9/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.20) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
[INFO] System Name: EOS
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b

Section Name: nemo_run_llama3_8b
  Test Name: nemo_run_llama3_8b
  Description: nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: nemo_run_llama3_8b
[INFO] Running test: nemo_run_llama3_8b

$ cd results/nemo_run_llama3_8b_2024-11-15_10-16-03/nemo_run_llama3_8b/0
$ tail stdout.txt 
        module.decoder.layers.0.self_attention.linear_proj.weight
        module.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight
    Params for bucket 98 (206045184 elements):
        module.embedding.word_embeddings.weight
[NeMo I 2024-11-15 10:22:04 utils:259] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')
Training epoch 0, iteration 0/4 | lr: 1.499e-07 | consumed_samples: 512 | global_batch_size: 512 | global_step: 0 | reduced_train_loss: 11.03 | train_step_timing in s: 61.94
Training epoch 0, iteration 1/4 | lr: 2.999e-07 | consumed_samples: 1024 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.03 | train_step_timing in s: 53.67
Training epoch 0, iteration 2/4 | lr: 4.498e-07 | consumed_samples: 1536 | global_batch_size: 512 | global_step: 2 | reduced_train_loss: 11.03 | train_step_timing in s: 52.45
Training epoch 0, iteration 3/4 | lr: 5.997e-07 | consumed_samples: 2048 | global_batch_size: 512 | global_step: 3 | reduced_train_loss: 11.03 | train_step_timing in s: 52.54
Training epoch 0, iteration 4/4 | lr: 7.496e-07 | consumed_samples: 2560 | global_batch_size: 512 | global_step: 4 | reduced_train_loss: 11.03 | train_step_timing in s: 53.16

src/cloudai/schema/test_template/nemo_run/slurm_command_gen_strategy.py

amaslenn

Please add a new case into test_acceptance.

src/cloudai/schema/test_template/nemo_run/grading_strategy.py

src/cloudai/schema/test_template/nemo_run/report_generation_strategy.py

src/cloudai/schema/test_template/nemo_run/slurm_command_gen_strategy.py

TaekyungHeo · 2024-11-18T16:52:16Z

Design Discussion (Nov 18th, 2024)

Srivatsan - Let's see if you can support more models with this PR; additional complexities may arise.

src/cloudai/test_definitions/nemo_run.py

src/cloudai/schema/test_template/nemo_run/slurm_command_gen_strategy.py

srivatsankrishnan

Based on the conversation in the last call, the direct executor method of directly calling the srun without sbatch is what we are calling as Nemo2.0 support.

#293 (comment)

Can we ensure this works with test hooks? (@TaekyungHeo If I recall, you mentioned this should be simpler in Nemo 2.0 integration with CloudIAI), if yes, as part of calling Nemo2.0 support complete, can we have an example configurations that is also tested with test hooks? Could be a different PR but I feel it should be there.
If direct executor is going to be generic feature in Nemo 2.0, can we test it with other models to ensure this simpler interface holds true across different models. Zsolt seems to be running more complex models via Nemo 2.0. Can we keep these models in the radar and ensure this approach works for those models as well?

((If I recall both @TaekyungHeo and @amaslenn mentioned they are okay with this PR as is and any future PR should address it.)

So I will approve this PR but those ^ should be added to call Nemo 2.0 support complete IMO.

cc: @srinivas212

TaekyungHeo · 2024-11-20T15:57:13Z

Thanks, @srivatsankrishnan.

The PR Update test_acceptance to handle pre-test and non-pre-test cases for nemo-run #305 shows how pre-test works with NeMo-run. Please check.
This PR shows that the direct executor works for NeMo 2.0. However, it's hard to say that we support all models in NeMo 2.0 with this PR. Some models may need additional arguments or mount points. Still, this is a valid starting point, and we can claim that the NeMo 2.0 POC is ready. When we supported NeMo 1.0, we did not support all models in the first PR. The first PR introduced the idea while supporting a single model, and we gradually improved and refactored the code when needed. We can take the same approach here, understanding that refactoring or additional changes may be required.

TaekyungHeo added feature Jan25 Jan'25 release feature labels Oct 29, 2024

TaekyungHeo changed the title ~~NeMo 2.0~~ NeMo 2.0 Support Oct 29, 2024

TaekyungHeo changed the title ~~NeMo 2.0 Support~~ NeMo 2.0 support Oct 29, 2024

TaekyungHeo force-pushed the nemo2.0 branch 3 times, most recently from 9349a3f to f07e0f1 Compare October 31, 2024 16:25

TaekyungHeo force-pushed the nemo2.0 branch 21 times, most recently from e3ca13b to 68025cc Compare November 15, 2024 19:37

TaekyungHeo requested review from amaslenn and artemry-nv November 15, 2024 19:38

TaekyungHeo force-pushed the nemo2.0 branch 2 times, most recently from a2e30ee to 15b2e87 Compare November 15, 2024 20:04

TaekyungHeo marked this pull request as ready for review November 15, 2024 21:20

Add NeMo 2.0 (NeMo-run)

e1c5787

TaekyungHeo force-pushed the nemo2.0 branch from f922581 to e1c5787 Compare November 15, 2024 21:21

amaslenn reviewed Nov 18, 2024

View reviewed changes

src/cloudai/schema/test_template/nemo_run/slurm_command_gen_strategy.py Outdated Show resolved Hide resolved

amaslenn reviewed Nov 18, 2024

View reviewed changes

src/cloudai/schema/test_template/nemo_run/grading_strategy.py Outdated Show resolved Hide resolved

src/cloudai/schema/test_template/nemo_run/report_generation_strategy.py Outdated Show resolved Hide resolved

TaekyungHeo added 3 commits November 18, 2024 07:01

Reflect Andrei's comments

0f0261b

Reflect Andrei's comments

6af8cd7

Remove grading and report strategy

fb5187b

amaslenn reviewed Nov 18, 2024

View reviewed changes

src/cloudai/schema/test_template/nemo_run/slurm_command_gen_strategy.py Outdated Show resolved Hide resolved

TaekyungHeo added 2 commits November 18, 2024 07:25

Add nemo-run to tests/test_acceptance.py

d01fd6d

Reflect Andrei's comments

a68349a

TaekyungHeo added 3 commits November 18, 2024 13:40

Merge branch 'main'

13bc217

Resolve merge conflicts

ede815e

Resolve merge conflicts

1344ec5

amaslenn reviewed Nov 18, 2024

View reviewed changes

src/cloudai/test_definitions/nemo_run.py Outdated Show resolved Hide resolved

TaekyungHeo added 3 commits November 18, 2024 15:44

Reflect Andrei's comments

da6037c

Reflect Andrei's comments

fdaef60

Reflect Andrei's comments

17fa827

amaslenn previously approved these changes Nov 19, 2024

View reviewed changes

src/cloudai/schema/test_template/nemo_run/slurm_command_gen_strategy.py Outdated Show resolved Hide resolved

Reflect Andrei's comments

12122d5

TaekyungHeo dismissed amaslenn’s stale review via 12122d5 November 19, 2024 11:17

Merge branch 'main' into nemo2.0

9ac9520

amaslenn approved these changes Nov 20, 2024

View reviewed changes

srivatsankrishnan approved these changes Nov 20, 2024

View reviewed changes

TaekyungHeo merged commit 5c3fd22 into NVIDIA:main Nov 20, 2024
2 checks passed

TaekyungHeo mentioned this pull request Nov 20, 2024

Update test_acceptance to handle pre-test and non-pre-test cases for nemo-run #305

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NeMo 2.0 support #293

NeMo 2.0 support #293

TaekyungHeo commented Oct 29, 2024 •

edited

Loading

amaslenn left a comment

TaekyungHeo commented Nov 18, 2024 •

edited

Loading

srivatsankrishnan left a comment

TaekyungHeo commented Nov 20, 2024

NeMo 2.0 support #293

NeMo 2.0 support #293

Conversation

TaekyungHeo commented Oct 29, 2024 • edited Loading

Summary

Test Plan

amaslenn left a comment

Choose a reason for hiding this comment

TaekyungHeo commented Nov 18, 2024 • edited Loading

srivatsankrishnan left a comment

Choose a reason for hiding this comment

TaekyungHeo commented Nov 20, 2024

TaekyungHeo commented Oct 29, 2024 •

edited

Loading

TaekyungHeo commented Nov 18, 2024 •

edited

Loading