Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Megatron version #85

Draft
wants to merge 428 commits into
base: nvidia_main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
428 commits
Select commit Hold shift + click to select a range
c7d0fb1
Include module parameters in default sharded_state_dict
mikolajblaz Jan 12, 2024
7bcb2e1
Integrate one-logger api for E2E app metrics tracking
PytLab Jan 15, 2024
97d9a50
Set --enable-onelogger action to 'store_true'
PytLab Jan 15, 2024
6e7ded3
Merge branch 'sliding_window_attention/akoumparouli' into 'main'
jaredcasper Jan 16, 2024
46ca3db
Refactor DistributedOptimizer for MoE model support
shjwudp Jan 17, 2024
d657a3e
Merge branch 'distopt_with_moe' into 'main'
deepakn94 Jan 17, 2024
6083743
Run black on megatron/optimizer
deepakn94 Jan 17, 2024
17545b3
Remove hardcoded data cache path
PytLab Jan 18, 2024
6c0e7a9
Change --enable-onelogger to --enable-one-logger for consistent naming
PytLab Jan 18, 2024
bf9c0a1
Add ImportError catch for one_logger
PytLab Jan 18, 2024
85c4034
Add message on how to install one_logger
PytLab Jan 18, 2024
54de98d
Better code formatting
PytLab Jan 18, 2024
909bda3
Fixed merge conflicts
Jan 18, 2024
3c44fb9
add is_first_microbatch for TE
jiemingz Jan 10, 2024
27879a7
add arg name
jiemingz Jan 10, 2024
7dc2ee8
add docstring and move set_is_first_microbatch
jiemingz Jan 12, 2024
3e19c76
Fixed formatting
Jan 18, 2024
bed60a8
Merge branch 'jiemingz/is_first_microbatch' into 'main'
jaredcasper Jan 19, 2024
cf1a1c6
fix a bug in branch and format
Jan 19, 2024
036605d
Merge branch 'main' into fuse_rope_swiglu_main
Jan 19, 2024
568da5a
fix tests
Jan 19, 2024
140642c
Merge branch megatron-lm:main into atomic_gemm_switch
sanandaraj5597 Jan 19, 2024
de9428a
enable swiglu and rope fusion by default and disable them in tests
Jan 19, 2024
599f558
Merge branch 'atomic_gemm_switch' into 'main'
jaredcasper Jan 19, 2024
ca8a00a
Merge branch 'mblaz/dist-ckpt-layernorms' into 'main'
jaredcasper Jan 19, 2024
79269fa
Docstring removed for context config
Jan 19, 2024
4b05862
Decoupled cpu offloading and SplitAlongDim imports
Jan 19, 2024
a5165ac
Merge branch 'cpu_offload' into 'main'
jaredcasper Jan 19, 2024
640af6b
Merge branch 'fuse_rope_swiglu_main' into 'main'
jaredcasper Jan 19, 2024
473225f
Add jit_fuser to switch between torch.jit.script and torch.compile
Jan 19, 2024
de4028a
Merge branch 'jaeminc/mcore-jit' into 'main'
jaredcasper Jan 19, 2024
716204e
misc
jlamypoirier Jan 19, 2024
8c2cd99
Merge branch 'black_on_optimizer' into 'main'
jaredcasper Jan 20, 2024
c795038
Router and communication refactoring.
yanring Dec 14, 2023
2016969
Add Z-loss and aux loss. Code cleanup.
yanring Dec 15, 2023
9b5cd88
Code clean.
yanring Dec 18, 2023
dc436f2
Add top-k router and documentation.
yanring Dec 18, 2023
a98c5ba
Add UT. Fix top-k >1 when EP is off.
yanring Dec 26, 2023
0f80408
Noramlize the token scores.
yanring Dec 26, 2023
de37485
Code clean.
yanring Dec 26, 2023
8efc8de
Fix moe aux loss.
yanring Dec 26, 2023
15e75b0
Fix UTs; Fix MoE Loss.
yanring Dec 28, 2023
dd0411b
Add Z loss UT.
yanring Dec 28, 2023
bfb7bbd
Add documentation.
yanring Jan 2, 2024
b506152
Add typing check.
yanring Jan 2, 2024
411bc27
Update CI.
yanring Jan 3, 2024
1ab146c
Fix grouped gemm UT.
yanring Jan 4, 2024
6d702cb
Compatible with previous MoE checkpoints.
yanring Jan 5, 2024
c656553
Fix Z Loss.
yanring Jan 7, 2024
8b41c9f
Merge the Sinkhorn and top-k routing.
yanring Jan 7, 2024
196b911
Update CI golden values.
yanring Jan 7, 2024
3ff8c7f
Swap topk and softmax.
yanring Jan 10, 2024
1ce5712
Update CI after rebasing.
yanring Jan 11, 2024
09accc8
Fix loss scale documentation and remove unused code
yanring Jan 15, 2024
5d0dbd3
Rename base_moe_layer.py to router.py
yanring Jan 15, 2024
a003610
Fix review comments.
yanring Jan 17, 2024
e2d3e4f
Renaming.
yanring Jan 19, 2024
b616497
Renaming.
yanring Jan 19, 2024
2038324
Move dispatcher and experts.
yanring Jan 20, 2024
eb47d69
Update CI golden value.
yanring Jan 20, 2024
3da7d1d
Rename to token_permutation and SequentialMLP.
yanring Jan 20, 2024
2afee76
Code clean.
yanring Jan 21, 2024
aed469f
Fix CI, Code clean and add readme.
yanring Jan 22, 2024
f1b6c96
Add input jitter.
yanring Jan 22, 2024
f24abd1
Moved offloading configs to Model parallel config from TF config
Jan 22, 2024
288134e
Fixed formatting and imports
Jan 22, 2024
1872385
Update retro doc
boxin-wbx Jan 22, 2024
8fb44df
Log progress (iterations, floating-point operations, tokens) to progr…
deepakn94 Dec 1, 2023
781d86a
Hide progress logging behind a command-line argument
deepakn94 Jan 22, 2024
be8011a
Merge branch 'progress' into 'main'
deepakn94 Jan 22, 2024
b03eae3
Updated CI value after removing kaiming_init.
yanring Jan 23, 2024
d2e5f78
Add one_logger commandline arguments
PytLab Jan 23, 2024
62a5a3e
Remove one_logger config file
PytLab Jan 23, 2024
49727de
Hardcode train_iterations_warmup to 5
PytLab Jan 23, 2024
0cb693a
Add clarification for internal one_logger
PytLab Jan 23, 2024
ae1cd89
Fix SwiGLU for input dimension 2 after rebased main.
yanring Jan 23, 2024
ebb1484
Update retro doc following the suggestion of Wei and Lawrence
boxin-wbx Jan 23, 2024
7298d15
Add distributed optimizer tests with --overlap-param-gather (and corr…
deepakn94 Jan 20, 2024
33111c9
Fix bug causing issues with fp16 and --overlap-param-gather by disabl…
deepakn94 Jan 20, 2024
f634cca
Add softmax for sinkhorn when k > 1.
yanring Jan 24, 2024
75120db
Merge branch 'fp16_overlap_param_gather' into 'main'
deepakn94 Jan 24, 2024
9e773fa
Change default value of --one-logger-run-name to None
PytLab Jan 24, 2024
95b2146
Packed Sequence
cuichenx Jan 24, 2024
773ad0f
Merge branch 'chcui/packed_seq_from_fuse_rope_swiglu_main' into 'main'
jaredcasper Jan 24, 2024
2c3468a
Merge branch 'documentation' into 'main'
deepakn94 Jan 24, 2024
51e936c
Merge branch megatron-lm:main into offload_patch
sanandaraj5597 Jan 24, 2024
83c0423
Add replica_id field to factories
mikolajblaz Jan 5, 2024
00358e5
Implement sharded_state_dict for SwitchMLP
mikolajblaz Jan 4, 2024
431ce99
Handle MoE with GeLU
mikolajblaz Jan 5, 2024
e2fd6ca
Add __init__ to resolve test name clash
mikolajblaz Jan 18, 2024
bd6f4ea
Merge branch 'boxin/retro-doc-fix' into 'main'
jaredcasper Jan 24, 2024
1e0e58e
Merge branch 'main' into compare_tensors_updated
jlamypoirier Jan 24, 2024
472d54e
Only print warning about fused rotary position embedding once.
jaredcasper Jan 24, 2024
98fbb42
Fix
jlamypoirier Jan 24, 2024
37e7dac
Merge branch 'fused-warning-fix' into 'main'
jaredcasper Jan 24, 2024
c4678ff
Update s_app_tag with {job_name}_{batch_size}_{gpu_req}
PytLab Jan 25, 2024
817b431
Merge branch 'offload_patch' into 'main'
jaredcasper Jan 25, 2024
de859b3
Log metrics in consistent order
PytLab Jan 25, 2024
7027a1d
Add app_tag_count tracking
PytLab Jan 25, 2024
a72388d
Merge branch 'feature/add-e2e-metrics-logging' of ssh://gitlab-master…
Jan 25, 2024
8344203
Resolve merging conflict
Jan 25, 2024
7af41ab
Use app tag logging wrapper api
PytLab Jan 25, 2024
e713cd7
Remove app_tag global var
PytLab Jan 25, 2024
9603e1f
Merge branch 'main' into feature/add-e2e-metrics-logging
PytLab Jan 25, 2024
fdafcc5
Add doc
mikolajblaz Jan 25, 2024
c40c047
Add no support info
mikolajblaz Jan 25, 2024
e25970f
Adding bert local spec test
Jan 25, 2024
2b0decc
Adding bert local spec test
Jan 25, 2024
559e82c
Merge branch 'zijiey/moe_api_clean' into 'main'
jaredcasper Jan 25, 2024
e6ef9ea
Adding bert local spec test
Jan 25, 2024
c2d44ff
Adding bert local spec test
Jan 26, 2024
fc316ff
Adding bert local spec test
Jan 26, 2024
8578800
update `apply_rope_fusion` in config after checking availability
cuichenx Jan 26, 2024
6e599dc
Adding bert local spec test
Jan 26, 2024
1e95136
add unit tests
cuichenx Jan 26, 2024
5c10cb4
Use new memory_efficient argument to fused layernorm functions when a…
jaredcasper Jan 24, 2024
4a08560
Add `num_floating_point_operations_so_far` arg to save_checkpoint cal…
mathemakitten Jan 26, 2024
3709708
Merge branch 'hn-save-checkpoint' into 'main'
jaredcasper Jan 26, 2024
88ddc36
Fixing the nightly ci for #1018.
yanring Jan 26, 2024
f5c5388
Merge branch 'zijie/fix_1018_nightly_tests' into 'main'
jaredcasper Jan 26, 2024
5cce2b5
Move e2e metrics tracking before training_log call
PytLab Jan 26, 2024
04d7b19
Merge branch 'main' into mblaz/moe-0.5-dist-ckpt
mikolajblaz Jan 26, 2024
1fc103f
formatting
cuichenx Jan 26, 2024
16e6e9b
typo
cuichenx Jan 26, 2024
3df96f1
Add _CPU_EXPERT_MODEL_PARALLEL_WORLD_SIZE flag in parallel-state to a…
akoumpa Jan 26, 2024
5cfe7b8
Merge branch 'akoumparouli/expert_model_parallel_world_size_setter' i…
ericharper Jan 26, 2024
567fab7
Fix formatting
shanmugamr1992 Jan 26, 2024
f2a49ba
Merge branch 'layernorm-apex-update' into 'main'
jaredcasper Jan 26, 2024
195171f
Merge branch 'chcui/fix_rope_fusion_config' into 'main'
jaredcasper Jan 26, 2024
8d8241a
Support for raw and mock datasets
Jan 26, 2024
803a018
Merge branch 'raw-dataset' into 'main'
jaredcasper Jan 26, 2024
4223649
Merge branch 'main' into mblaz/moe-0.5-dist-ckpt
mikolajblaz Jan 29, 2024
eaaf92f
Adding bert local spec test
Jan 29, 2024
a4b5a9e
Fix `qkv_format` in TEDotProductAttention
cuichenx Jan 30, 2024
83bb191
Merge branch 'chcui/fix_rope_fusion_config' into 'main'
ericharper Jan 30, 2024
25a9946
Add support for masked WordPiece datasets BERT and T5
Jan 30, 2024
8312a3e
Merge branch 'masked-datasets' into 'main'
ericharper Jan 30, 2024
e2ff3e6
Remove config file and hardcoded cache path
PytLab Jan 30, 2024
05342e7
Merge branch 'mblaz/moe-0.5-dist-ckpt' into 'main'
jaredcasper Jan 30, 2024
329baac
Merge branch 'main' into 'local_spec_bert'
shanmugamr1992 Jan 30, 2024
eef48ef
Fix the case when none token is allocated for local expert(s) with EP>1.
fanshiqing Jan 30, 2024
9f92da0
Merge branch 'moe_gmm_corner_case_fixw' into 'main'
ericharper Jan 30, 2024
0bfeeae
rename output layer
maxmatical Jan 30, 2024
a45805a
Generate causal mask for local layer spec
janekl Jan 30, 2024
d972605
Merge branch 'jlasek/generate_causal_mask_in_mcore' into 'main'
ericharper Jan 30, 2024
918d415
Update minor version
ericharper Jan 30, 2024
34c874e
Merge branch 'update_minor_version' into 'main'
jaredcasper Jan 30, 2024
bb53cf9
Merge pull request #3 from ServiceNow/max/rename-output-layer
maxmatical Jan 30, 2024
eeb1b21
use TE checkpointing when FP8
jiemingz Jan 30, 2024
530239b
Merge branch megatron-lm:main into fp8_recompute
jiemingz Jan 31, 2024
4bd4e74
Merge branch 'local_spec_bert' into 'main'
jaredcasper Jan 31, 2024
f8b277a
Remove unused hashlib
PytLab Jan 31, 2024
0fcbff0
Move grad-scale to loss.device
akoumpa Jan 30, 2024
ea52266
Merge branch 'feature/add-e2e-metrics-logging' into 'main'
jaredcasper Jan 31, 2024
c3d057f
code clean for moe.
fanshiqing Feb 1, 2024
a1ba50f
update readme.
fanshiqing Feb 1, 2024
2ee86c5
divide the selection_mean by top_k for normalization.
fanshiqing Feb 1, 2024
2e1f869
add license.
fanshiqing Feb 1, 2024
e5102e7
update readme.
fanshiqing Feb 1, 2024
6aad211
JET Migration Updates
maanug-nv Feb 1, 2024
3d201d7
Merge branch 'maanug/jet-recipes' into 'main'
jaredcasper Feb 1, 2024
50f8384
Fixing bugs in inference and adding mcore support
shanmugamr1992 Feb 1, 2024
7329f73
Fixing bugs in inference and adding mcore support
shanmugamr1992 Feb 1, 2024
376337d
Fixing bugs in inference and adding mcore support
shanmugamr1992 Feb 1, 2024
cb995d5
Merge branch 'fp8_recompute' into 'main'
jaredcasper Feb 1, 2024
d91c5a6
Fixing bugs in inference and adding mcore support
shanmugamr1992 Feb 1, 2024
7628c3a
Merge branch 'akoumparouli/loss_scale_fix' into 'main'
jaredcasper Feb 1, 2024
075d5b0
rename test_switch_mlp to test_sequential_mlp
fanshiqing Feb 2, 2024
680b67c
Move Megatron timer to core
Feb 2, 2024
8b691b9
Merge branch 'abhandare_timer' into 'main'
ericharper Feb 2, 2024
b87f069
Merge branch 'inference_fix' into 'main'
jaredcasper Feb 2, 2024
259f06e
Merge branch 'code_clean' into 'main'
jaredcasper Feb 2, 2024
aa96ab7
JET fix: Migrate tests and run functional results always not on success
maanug-nv Feb 3, 2024
3e1a635
Merge branch 'maanug/jet-hotfix' into 'main'
maanug-nv Feb 3, 2024
f89f388
MoE argument sanity checks
akoumpa Feb 6, 2024
487ba73
Merge branch 'akoumparouli/arg_sanity_check' into 'main'
ericharper Feb 6, 2024
f6995e5
add add_qkv_bias config
Feb 6, 2024
02d284d
Merge branch 'xueh/add_qkv_bias' into 'main'
jaredcasper Feb 6, 2024
c8f50b4
Minor fixes for JET CI
maanug-nv Feb 6, 2024
7c1dd65
Merge branch 'maanug/jet-minor-fixes' into 'main'
jaredcasper Feb 6, 2024
9760e11
Tokenizer fix
jlamypoirier Feb 6, 2024
94ce57b
Merge remote-tracking branch 'nvidia/main' into compare_tensors_updated
jlamypoirier Feb 6, 2024
bb235cc
Check if config has num_moe_experts
akoumpa Feb 6, 2024
b02e62e
Merge branch 'akoumparouli/moe_config_check' into 'main'
jaredcasper Feb 6, 2024
548e57a
Add dist ckpt package docs for Sphinx documentation
mikolajblaz Feb 6, 2024
240a8ef
Merge branch 'mblaz/dist-ckpt-docs' into 'main'
jaredcasper Feb 6, 2024
960c06b
Fix oob perf
wdykas Feb 6, 2024
1390944
Merge branch 'fix-oob-perf' into 'main'
jaredcasper Feb 6, 2024
260c4f2
Add interleaved rotary embedding in MCore
Feb 6, 2024
6d6f9af
Merge branch 'xueh/rotary_interleaved' into 'main'
jaredcasper Feb 6, 2024
6fdbfa7
fix activation checkpointing mutation
gshennvm Feb 6, 2024
169bfa4
Merge branch 'geshen/fix_activation_mutation' into 'main'
ericharper Feb 6, 2024
b22634d
fix
jlamypoirier Feb 6, 2024
2165919
Better wandb
jlamypoirier Feb 7, 2024
c478f48
misc
jlamypoirier Feb 7, 2024
b6ce193
[MoE] fix the convergence issue when EP>1 and K>1
yanring Feb 7, 2024
98da379
Merge branch 'zijiey/fix_top2_dispatcher' into 'main'
ericharper Feb 7, 2024
84c7af2
Use view() to set param_buffer from grad_buffer
wangxicoding Dec 26, 2023
2fb398c
Add missing num_floating_point_operations_so_far argument to save_che…
deepakn94 Feb 7, 2024
0f0279a
Merge branch 'save_checkpoint_fix' into 'main'
jaredcasper Feb 7, 2024
0052bf0
Merge branch 'fix_param_buffer_peak_memory' into 'main'
jaredcasper Feb 7, 2024
6e25554
Adding back the changes needed in timers.py for E2E work
Feb 9, 2024
a8182ee
Fixed atomic gemm defaults/fixed the offloading check
Feb 10, 2024
daf0006
Put embedding layers in separate buckets to make sure embedding tying…
deepakn94 Jan 28, 2024
a73b113
Ran black(19.10b0) on megatron/core
Feb 12, 2024
2482a4a
Use MCore for distributed optimizer tests
deepakn94 Feb 9, 2024
5566742
Merge branch 'main' into compare_tensors_updated
jlamypoirier Feb 13, 2024
9e17a15
Condition TE init_method on config.perform_initialization.
lmcafee-nvidia Feb 13, 2024
55f3502
Merge branch 'lmcafee/te-noinit-fix' into 'main'
jaredcasper Feb 13, 2024
32f9155
Move optimizers to MCore
deepakn94 Feb 13, 2024
eedfe53
Merge branch 'dist_optimizer_to_mcore' into 'main'
ericharper Feb 13, 2024
db2040f
Merge branch 'tied_embeddings' into 'main'
deepakn94 Feb 13, 2024
6f3d5a4
Merge branch 'Add_back_Timer_Code_changes_for_E2E' into 'main'
jaredcasper Feb 14, 2024
5b4bbd5
add support wrapper for TE TransformerLayer in mcore
sudhakarsingh27 Feb 14, 2024
5f9c870
Merge branch 'te_transformer_layer_wrapper_in_mcore' into 'main'
jaredcasper Feb 14, 2024
1b6ae27
Fixing examples
Feb 15, 2024
4ec7835
Merge branch 'bugfixexample' into 'main'
jaredcasper Feb 15, 2024
72a255a
[MoE] Expert data parallel w/ ZeRO-1 support
shjwudp Feb 21, 2024
90568ae
Merge branch 'edp_with_zero1' into 'main'
ericharper Feb 21, 2024
528d7cf
Merge branch 'config_default' into 'main'
jaredcasper Feb 22, 2024
a67ffda
Make sure data_end_index is padded when creating new buckets
deepakn94 Feb 16, 2024
5afa5da
Mcore CLIP ViT model
trintamaki Feb 24, 2024
6d14c7e
Merge branch 'trintamaki/clip-vit-model' into 'main'
ericharper Feb 24, 2024
ad53b1e
Merge branch 'dist_optimizer_bugfix' into 'main'
deepakn94 Feb 24, 2024
9530e19
Print number of transformer and embedding parameters separately
deepakn94 Feb 26, 2024
5f1f813
Unify resume and correctness functional tests
mikolajblaz Feb 27, 2024
70e469d
Merge branch 'mblaz/unify-resume-and-correctness-func-tests' into 'main'
maanug-nv Feb 27, 2024
1fcdc95
Mcore mock multimodal dataset
trintamaki Feb 27, 2024
1dada7e
Merge branch 'trintamaki/dummy-multimodal-dataset' into 'main'
jaredcasper Feb 27, 2024
d668077
Fix NaN checking in grads: should be performed before data-parallel c…
deepakn94 Dec 5, 2023
53a350e
Merge branch 'check_nan_in_grad' into 'main'
deepakn94 Feb 28, 2024
9677b3b
Make throughput and memory footprint formulae compatible with arbitra…
deepakn94 Feb 29, 2024
3dafc0e
Move to Draco OCI
maanug-nv Feb 29, 2024
17c487a
Merge branch 'maanug/jet-oci' into 'main'
maanug-nv Feb 29, 2024
3b0fcd1
Merge branch 'theoretical_memory_fix' into 'main'
jaredcasper Mar 1, 2024
7bc3c74
Mcore LLaVA model
trintamaki Mar 1, 2024
d1acce3
Merge branch 'trintamaki/llava-model-mr' into 'main'
jaredcasper Mar 1, 2024
80e180d
[OMNIML-614] AMMO ptq + TensorRT-LLM export examples for megatron-lm
ChenhanYu Mar 1, 2024
36e9b6b
Merge branch 'chenhany/ammo_ptq_example' into 'main'
jaredcasper Mar 1, 2024
0c1e53d
Merge branch 'variable_ffn_size' into 'main'
deepakn94 Mar 3, 2024
47cb630
Experimental Yaml configs
wdykas Mar 5, 2024
8957468
Merge branch 'yaml' into 'main'
jaredcasper Mar 5, 2024
63d9d3e
MOE support
jlamypoirier Mar 8, 2024
40a134a
stuff
jlamypoirier Mar 8, 2024
1a96a99
Merge branch 'main' into compare_tensors_updated
jlamypoirier Mar 8, 2024
fdd668c
Support megatron core models
jlamypoirier Mar 11, 2024
4238a80
Fix arg
jlamypoirier Mar 11, 2024
fe38434
fixes
jlamypoirier Mar 12, 2024
3c6652e
fix
jlamypoirier May 29, 2024
f6b9b4b
fix
jlamypoirier Sep 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
777 changes: 12 additions & 765 deletions .gitlab-ci.yml

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
include megatron/core/requirements.txt
8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ With full global batch size of 1536 on 1024 A100 GPUs, each iteration takes arou


Retro [(Borgeaud et al., 2022)](https://arxiv.org/abs/2112.04426) is an autoregressive decoder-only language model (LM) pretrained with retrieval-augmentation.
Retro features practical scalibility to support large-scale pretraining from scratch by retrieving from trillions of token.
Retro features practical scalability to support large-scale pretraining from scratch by retrieving from trillions of tokens.
Pretraining with retrieval provides a more efficient storage mechanism of factual knowledge, when compared to storing factual knowledge implicitly within the network's parameters, thus largely reducing model parameters while achieving lower perplexity than standard GPT.
Retro also provides the flexibility to update the
knowledge stored in LMs [(Wang et al., 2023a)](https://arxiv.org/abs/2304.06762)
Expand Down Expand Up @@ -519,6 +519,12 @@ The Llama-2 [family of models](https://ai.meta.com/llama/) are an open-source se

The Llama-2 checkpoints can be loaded into Megatron for inference and finetuning. See documentation [here](docs/llama2.md).

# Model Optimization and Deployment
Megatron-Core (MCore) `GPTModel` family supports advanced quantization algorithms and high-performance deployment through TensorRT-LLM.

## Quantization and TensorRT-LLM Deployment
See [Megatron Model Optimization and Deployment](examples/modelopt/README.md) for `llama2` and `nemotron3` examples.

# Datasets
We do not host any datasets for GPT or BERT training, however, we detail their collection so that our results may be reproduced.

Expand Down
38 changes: 24 additions & 14 deletions docs/source/api-guide/dist_checkpointing.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
dist\_checkpointing package
===========================

A library for saving and loading the distributed checkpoints.
A "distributed checkpoint" can have various underlying formats (current default format is based on Zarr)
but has a distinctive property - the checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism)
can be loaded in a different parallel configuration.

Using the library requires defining sharded state_dict dictionaries with functions from *mapping* and *optimizer* modules.
Those state dicts can be saved or loaded with a *serialization* module using strategies from *strategies* module.


Subpackages
-----------

Expand All @@ -12,18 +21,10 @@ Subpackages
Submodules
----------

dist\_checkpointing.core module
-------------------------------

.. automodule:: core.dist_checkpointing.core
:members:
:undoc-members:
:show-inheritance:

dist\_checkpointing.dict\_utils module
--------------------------------------
dist\_checkpointing.serialization module
----------------------------------------

.. automodule:: core.dist_checkpointing.dict_utils
.. automodule:: core.dist_checkpointing.serialization
:members:
:undoc-members:
:show-inheritance:
Expand All @@ -44,14 +45,23 @@ dist\_checkpointing.optimizer module
:undoc-members:
:show-inheritance:

dist\_checkpointing.serialization module
----------------------------------------
dist\_checkpointing.core module
-------------------------------

.. automodule:: core.dist_checkpointing.serialization
.. automodule:: core.dist_checkpointing.core
:members:
:undoc-members:
:show-inheritance:

dist\_checkpointing.dict\_utils module
--------------------------------------

.. automodule:: core.dist_checkpointing.dict_utils
:members:
:undoc-members:
:show-inheritance:


dist\_checkpointing.utils module
--------------------------------

Expand Down
5 changes: 5 additions & 0 deletions docs/source/api-guide/dist_checkpointing.strategies.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
dist\_checkpointing.strategies package
======================================

Package defining different checkpoint formats (backends) and saving/loading algorithms (strategies).

Strategies can be used for implementing new checkpoint formats or implementing new (more optimal for a given use case) ways of saving/loading of existing formats.
Strategies are passed to `dist_checkpointing.load` and `dist_checkpointing.save` functions and control the actual saving/loading procedure.

Submodules
----------

Expand Down
16 changes: 12 additions & 4 deletions docs/source/api-guide/distributed.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
distributed package
===================

This package contains various utilities to finalize model weight gradients
on each rank before the optimizer step. This includes a distributed data
parallelism wrapper to all-reduce or reduce-scatter the gradients across
data-parallel replicas, and a `finalize\_model\_grads` method to
synchronize gradients across different parallelism modes (e.g., 'tied'
layers on different pipeline stages, or gradients for experts in a MoE on
different ranks due to expert parallelism).

Submodules
----------

Expand All @@ -21,10 +29,10 @@ reduce-scatter on each bucket asynchronously.
distributed.finalize\_model\_grads
----------------------------------

Finalize model grads for optimizer step across all used parallelism modes.
Synchronizes the all-reduce / reduce-scatter of model grads across DP replicas,
and all-reduces the layernorm grads for sequence parallelism, embedding grads
across first and last pipeline stages (if not tied), and expert grads for expert
Finalize model gradients for optimizer step across all used parallelism modes.
Synchronizes the all-reduce / reduce-scatter of model gradients across DP replicas,
all-reduces the layernorm gradients for sequence parallelism, embedding gradients
across first and last pipeline stages (if not tied), and expert gradients for expert
parallelism.

.. automodule:: core.distributed.finalize_model_grads
Expand Down
18 changes: 18 additions & 0 deletions docs/source/api-guide/pipeline_parallel.rst
Original file line number Diff line number Diff line change
@@ -1,12 +1,22 @@
pipeline\_parallel package
==========================

This package contains implementations for two different pipeline parallelism
schedules (one without interleaving and one with interleaving, see `Efficient
Large-Scale Language Model Training on GPU Clusters Using Megatron-LM <https://arxiv.org/abs/2104.04473>`_
for details), and a default no-pipelining schedule. It also contains methods
for the point-to-point communication that is needed between pipeline stages.

Submodules
----------

pipeline\_parallel.p2p\_communication module
--------------------------------------------

Contains implementations for the various point-to-point communication needed
(e.g., `recv_forward` and `recv_backward`) in the different pipeline parallelism
schedules.

.. automodule:: core.pipeline_parallel.p2p_communication
:members:
:undoc-members:
Expand All @@ -15,6 +25,14 @@ pipeline\_parallel.p2p\_communication module
pipeline\_parallel.schedules module
-----------------------------------

Contains implementations for two pipeline parallelism schedules
(`forward_backward_pipelining_with_interleaving`for pipeline parallelism with
interleaving, `forward_backward_pipelining_without_interleaving` for pipeline
parallelism without interleaving) and a default no-pipelining schedule
(`forward_backward_no_pipelining`). `get_forward_backward_func` returns the right
scheduling function to use based on the configuration being trained
(e.g., if pipeline-parallel size is 1, use `forward_backward_no_pipelining`).

.. automodule:: core.pipeline_parallel.schedules
:members:
:undoc-members:
Expand Down
6 changes: 6 additions & 0 deletions docs/source/api-guide/tensor_parallel.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
tensor\_parallel package
========================

This package contains an implementation for tensor parallelism in transformer
models (see `Megatron-LM: Training Multi-Billion Parameter Language Models
Using Model Parallelism <https://arxiv.org/abs/1909.08053>`_ and `Reducing
Activation Recomputation in Large Transformer Models <https://arxiv.org/abs/2205.05198>`_
for details).

Submodules
----------

Expand Down
6 changes: 3 additions & 3 deletions examples/bert/train_bert_340m_distributed.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ NUM_NODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))

CHECKPOINT_PATH=$0 #<Specify path>
TENSORBOARD_LOGS_PATH=$1 #<Specify path>
VOCAB_FILE=$2 #<Specify path to file>/bert-vocab.json
CHECKPOINT_PATH=$1 #<Specify path>
TENSORBOARD_LOGS_PATH=$2 #<Specify path>
VOCAB_FILE=$3 #<Specify path to file>/bert-vocab.json
DATA_PATH=$4 #<Specify path and file prefix>_text_document

DISTRIBUTED_ARGS=(
Expand Down
132 changes: 132 additions & 0 deletions examples/deploy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Megatron Model Optimization and Deployment

## Installation
We recommend that users follow TensorRT-LLM's official installation guide to build it from source
and proceed with a containerized environment (`docker.io/tensorrt_llm/release:latest`):

```
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git checkout v0.7.1
make -C docker release_build
```

> **TROUBLE SHOOTING:** rather than copying each folder separately in `docker/Dockerfile.multi`,
> you may need to copy the entire dir as `COPY ./ /src/tensorrt_llm` since a `git submodule` is
> called later which requires `.git` to continue.

Once the container is built, install `nvidia-ammo` and additional dependencies for sharded checkpoint support:
```
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
pip install zarr tensorstore==0.1.45
```
TensorRT-LLM quantization functionalities are currently packaged in `nvidia-ammo`.
You can find more documentation about `nvidia-ammo` in [TensorRT-LLM's quantization
examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization).

## Support Matrix

The following matrix shows the current support for the PTQ + TensorRT-LLM export flow.

| model | fp16 | int8_sq | fp8 | int4_awq |
|-----------------------------|------|---------| ----| -------- |
| nextllm-2b | x | x | x | |
| nemotron3-8b | x | | x | |
| nemotron3-15b | x | | x | |
| llama2-text-7b | x | x | x | TP2 |
| llama2-chat-70b | x | x | x | TP4 |

Our PTQ + TensorRT-LLM flow has native support on MCore `GPTModel` with a mixed layer spec (native ParallelLinear
and Transformer-Engine Norm (`TENorm`). Note that this is not the default mcore gpt spec. You can still load the
following checkpoint formats with some remedy:

| GPTModel | sharded | remedy arguments |
|-----------------------------------|---------|-----------------------------------------|
| megatron.model | | `--ammo-load-classic-megatron-to-mcore` |
| TE-Fused (default mcore gpt spec) | | `--ammo-convert-te-to-local-spec` |
| TE-Fused (default mcore gpt spec) | x | |

> **TROUBLE SHOOTING:** If you are trying to load an unpacked `.nemo` sharded checkpoint, then typically you will
> need to adding `additional_sharded_prefix="model."` to `ammo_load_checkpoint()` since NeMo has an additional
> `model.` wrapper on top of the `GPTModel`.

> **NOTE:** flag `--ammo-load-classic-megatron-to-mcore` may not work on all legacy checkpoint versions.

## Examples

> **NOTE:** we only provide a simple text generation script to test the generated TensorRT-LLM engines. For
> a production-level API server or enterprise support, see [NeMo](https://github.com/NVIDIA/NeMo) and TensorRT-LLM's
> backend for [NVIDIA Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server).

### nemotron3-8B FP8 Quantization and TensorRT-LLM Deployment
First download the nemotron checkpoint from https://huggingface.co/nvidia/nemotron-3-8b-base-4k, extract the
sharded checkpoint from the `.nemo` tarbal and fix the tokenizer file name.

> **NOTE:** The following cloning method uses `ssh`, and assume you have registered the `ssh-key` in Hugging Face.
> If you are want to clone with `https`, then `git clone https://huggingface.co/nvidia/nemotron-3-8b-base-4k` with an access token.

```sh
git lfs install
git clone [email protected]:nvidia/nemotron-3-8b-base-4k
cd nemotron-3-8b-base-4k
tar -xvf Nemotron-3-8B-Base-4k.nemo
mv 586f3f51a9cf43bc9369bd53fa08868c_a934dc7c3e1e46a6838bb63379916563_3feba89c944047c19d5a1d0c07a85c32_mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model
cd ..
```

Now launch the PTQ + TensorRT-LLM export script,
```
bash examples/deploy/ptq_trtllm_nemotron3_8b ./nemotron-3-8b-base-4k None
```
By default, `cnn_dailymail` is used for calibration. The `GPTModel` will have quantizers for simulating the
quantization effect. The checkpoint will be saved optionally (with quantizers as additional states) and can
be restored for further evaluation. TensorRT-LLM engine is exported to `/tmo/ammo` by default.

The script expects `${CHECKPOINT_DIR}` (`./nemotron-3-8b-base-4k`) to have the following structure:
```
├── model_weights
│ ├── common.pt
│ ...
├── model_config.yaml
├── mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model
```

> **NOTE:** The script is using `TP=8`. Change `$TP` in the script if your checkpoint has a different tensor
> model parallelism.

> **KNOWN ISSUES:** The `mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model` in the checkpoint is for
> Megatron-LM's `GPTSentencePiece` tokenizer.
> For TensorRT-LLM, we are trying to load this tokenizer as a Hugging Face `T5Tokenizer` by changing
> some special tokens, `encode`, and `batch_decode`. As a result, the tokenizer behavior in TensorRT-LLM engine may
> not match exactly.

> **TROUBLE SHOOTING:** If you are loading `.nemo` sharded checkpoint here, call
> `ammo_load_checkpoint(..., additional_sharded_prefix="model.")` with additional sharded prefix in
> `text_generation_ptq.py` to align the sharded keys.

### llama2-text-7b INT8 SmoothQuant and TensorRT-LLM Deployment
> **NOTE:** Due to the LICENSE issue, we do not provide a MCore checkpoint to download. Users can follow
> the instruction in `docs/llama2.md` to convert the checkpoint to megatron classic `GPTModel` format and
> use `--ammo-load-classic-megatron-to-mcore` flag which will remap the checkpoint to the MCore `GPTModel` spec
> that we support.

```sh
bash examples/deploy/ptq_trtllm_llama_7b.sh ${CHECKPOINT_DIR}
```

The script expect `${CHECKPOINT_DIR}` to have the following structure:
```
├── hf
│ ├── tokenizer.config
│ ├── tokenizer.model
│ ...
├── iter_0000001
│ ├── mp_rank_00
│ ...
├── latest_checkpointed_iteration.txt
```
In short, other than the converted llama megatron checkpoint, also put the Hugging Face checkpoint inside as
the source of the tokenizer.
Loading