Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

From NVIDIA Megatron-LM for visibility #18

Open
wants to merge 3,598 commits into
base: multi-query-attention
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
3598 commits
Select commit Hold shift + click to select a range
71d5600
ADLR/megatron-lm!2315 - NVLM task encoders
trintamaki Nov 9, 2024
0343d03
Merge branch 'trintamaki/nvlm-task-encoders' into 'main'
trintamaki Nov 9, 2024
5ebcc5a
ADLR/megatron-lm!2317 - Keep tokenization args in sync between tools/…
sancha Nov 9, 2024
1b8fce7
Merge branch 'tokenizer_args' into 'main'
deepakn94 Nov 9, 2024
66b788a
ADLR/megatron-lm!2326 - ci: Deprecate torchrun
ko3n1g Nov 11, 2024
5e7b14d
Merge branch 'ko3n1g/ci/deprecate-torchrun' into 'main'
ko3n1g Nov 11, 2024
4e7adc2
ADLR/megatron-lm!2330 - ci: Less buckets for unit tests
ko3n1g Nov 11, 2024
acf4855
Merge branch 'ko3n1g/tests/optimize' into 'main'
ko3n1g Nov 11, 2024
d5b4f6a
ADLR/megatron-lm!2313 - build: Fix modelopt dependency
ko3n1g Nov 11, 2024
392bc05
Merge branch 'ko3n1g/build/fix-modelopt' into 'main'
ko3n1g Nov 11, 2024
fe43b46
ADLR/megatron-lm!2331 - ci: Add notifications for unit tests
ko3n1g Nov 11, 2024
e504eca
Merge branch 'ko3n1g/ci/fix-ut-notifications' into 'main'
ko3n1g Nov 11, 2024
a505e28
ADLR/megatron-lm!2332 - ci: Restart on NCCL failures
ko3n1g Nov 11, 2024
c887ae5
Merge branch 'ko3n1g/ci/restart-nccl-failures' into 'main'
ko3n1g Nov 11, 2024
a387779
ADLR/megatron-lm!2202 - all-reduce of conditional embedder grads acro…
Zhuoyao1012 Nov 11, 2024
134a1d5
Merge branch 'diffusion_pp_vpp' into 'main'
ko3n1g Nov 11, 2024
9684d5e
ADLR/megatron-lm!2334 - ci: Restart on infra issues
ko3n1g Nov 11, 2024
47ff44e
Merge branch 'ko3n1g/ci/restart-nccl-failures' into 'main'
ko3n1g Nov 11, 2024
bb30326
ADLR/megatron-lm!2321 - Fixing small stuff for consistancy
shanmugamr1992 Nov 12, 2024
5e4ee10
Merge branch 'small_changes' into 'main'
ericharper Nov 12, 2024
84931f4
ADLR/megatron-lm!2333 - ci: Autoformat files
ko3n1g Nov 12, 2024
9cc52ac
Merge branch 'ko3n1g/ci/auto-format' into 'main'
ko3n1g Nov 12, 2024
3c53037
ADLR/megatron-lm!2335 - ci: Always run formatting
ko3n1g Nov 12, 2024
343bdbc
Merge branch 'ko3n1g/ci/fix-rules' into 'main'
ko3n1g Nov 12, 2024
6b74ef9
ADLR/megatron-lm!2336 - ci: Fix weekly functional tests
ko3n1g Nov 12, 2024
e4e9141
Merge branch 'ko3n1g/ci/fix-weeklies' into 'main'
ko3n1g Nov 12, 2024
8666fdb
ADLR/megatron-lm!2337 - ci: Disable auto-format on forks
ko3n1g Nov 12, 2024
aded519
Merge branch 'ko3n1g/ci/fix-auto-format-forks' into 'main'
ko3n1g Nov 12, 2024
b94bbb4
ADLR/megatron-lm!2311 - NVLM tile tag support
trintamaki Nov 13, 2024
0e29f58
Merge branch 'trintamaki/nvlm-tile-tag' into 'main'
ericharper Nov 13, 2024
2e7030e
ADLR/megatron-lm!2085 - Check common state dict consistancy across ra…
shanmugamr1992 Nov 13, 2024
64cbae5
Merge branch 'dist_common_fix' into 'main'
shanmugamr1992 Nov 13, 2024
ff790ad
ADLR/megatron-lm!2267 - Llava pp > 0 fixes
trintamaki Nov 14, 2024
00e76ee
Merge branch 'trintamaki/llava-pp-fixes' into 'main'
trintamaki Nov 14, 2024
26b8b64
ADLR/megatron-lm!2240 - Rename optimizer's model_parallel_group -> gr…
lmcafee-nvidia Nov 14, 2024
ae9c141
Merge branch 'lmcafee/distopt-doc-oct24' into 'main'
deepakn94 Nov 14, 2024
e1993fa
ADLR/megatron-lm!2150 - Add support for PyTorch FSDP-2
BoxiangW Nov 14, 2024
4c4215f
Merge branch 'boxiangw/fsdp2' into 'main'
deepakn94 Nov 14, 2024
229e225
ADLR/megatron-lm!2345 - Update simple_text_generation_controller.py
shanmugamr1992 Nov 15, 2024
8e22e5b
Merge branch 'shanmugamr-main-patch-24278' into 'main'
jaredcasper Nov 15, 2024
c1728c1
ADLR/megatron-lm!2273 - Updating all T5 attention masks (encoder, dec…
huvunvidia Nov 15, 2024
2163865
Merge branch 'huvu/update_t5_attentionmasktype' into 'main'
ko3n1g Nov 15, 2024
645c329
ADLR/megatron-lm!2279 - Add hierarchical cp comm group
Nov 15, 2024
2bdc60c
Merge branch 'add_hierarchical_cp_comm_group' into 'main'
ericharper Nov 15, 2024
8b72751
ADLR/megatron-lm!2351 - Add missing arg to save_checkpoint call
jon-barker Nov 15, 2024
63b8520
Merge branch 'jbarker-main-patch-72619' into 'main'
jon-barker Nov 15, 2024
4131b07
ADLR/megatron-lm!2306 - NVLM example scripts
trintamaki Nov 16, 2024
ce507ee
Merge branch 'trintamaki/nvlm-example-scripts' into 'main'
trintamaki Nov 16, 2024
9e9d4f5
ADLR/megatron-lm!2348 - ci: Re-enable llava tests
ko3n1g Nov 17, 2024
6c88bfc
Merge branch 'ko3n1g/ci/re-enable-mm-tests' into 'main'
ko3n1g Nov 17, 2024
06c67b4
ADLR/megatron-lm!2357 - ci: Retry download assets
ko3n1g Nov 18, 2024
5438d15
Merge branch 'ko3n1g/ci/retry-download' into 'main'
ko3n1g Nov 18, 2024
57ed924
ADLR/megatron-lm!2260 - Support etp==tp when epp==0 and enforce torch…
jon-barker Nov 18, 2024
0f389f2
Merge branch 'jbarker/etp_equals_tp' into 'main'
jon-barker Nov 18, 2024
62e2e33
ADLR/megatron-lm!2347 - QKNorm to work with TENorm
shanmugamr1992 Nov 18, 2024
68e11fb
Merge branch 'qknorm' into 'main'
shanmugamr1992 Nov 18, 2024
693ae86
ADLR/megatron-lm!2015 - Support RMSNorm when TE and Apex are not inst…
ashors1 Nov 18, 2024
c4c9057
Merge branch 'torch-rms-norm' into 'main'
jaredcasper Nov 18, 2024
2e975f0
ADLR/megatron-lm!2343 - Clarifications for batch x pipeline parallel …
mathemakitten Nov 19, 2024
2138248
Merge branch 'helenn-fix-batch-pipeline-logic' into 'main'
jaredcasper Nov 19, 2024
cd1d30b
ADLR/megatron-lm!2293 - Add attention bias arg in MCore transformer f…
Nov 19, 2024
6033e95
Merge branch 'yuya/add_attn_bias' into 'main'
ko3n1g Nov 19, 2024
4f5aa6d
ADLR/megatron-lm!2360 - chore: Add mypy optionally
ko3n1g Nov 19, 2024
f214627
Merge branch 'ko3n1g/chore/add-mypy' into 'main'
jaredcasper Nov 19, 2024
a231b87
ADLR/megatron-lm!2365 - ci: JET improvements
ko3n1g Nov 19, 2024
b6866ae
Merge branch 'ko3n1g/ci/jet-fleet' into 'main'
ko3n1g Nov 19, 2024
886fd12
ADLR/megatron-lm!2364 - update golden values for nightly test
huvunvidia Nov 20, 2024
7b79d5b
Merge branch 'huvu/update_t5_attentionmask_nightly_goldenvalues' into…
ko3n1g Nov 20, 2024
69d5c71
ADLR/megatron-lm!2367 - ci: Try small runners
ko3n1g Nov 20, 2024
bbaa03a
Merge branch 'ko3n1g/ci/jet-fleet-2' into 'main'
ko3n1g Nov 20, 2024
2a34f2a
ADLR/megatron-lm!2371 - ci: Exempt non-core from legacy tests
ko3n1g Nov 20, 2024
cac3ec3
Merge branch 'ko3n1g/ci/exempt-non-core-from-legacy' into 'main'
ko3n1g Nov 20, 2024
ee929a5
ADLR/megatron-lm!2372 - ci: Increase interval time
ko3n1g Nov 20, 2024
81fee9b
Merge branch 'ko3n1g/ci/increase-interval-time' into 'main'
ko3n1g Nov 20, 2024
2fb82af
ADLR/megatron-lm!2323 - Fix torch native ckpt for TEGroupedLinear
yaox12 Nov 21, 2024
bd2cc55
Merge branch 'xiny/fix_ckpt_te_grouped_linear' into 'main'
ko3n1g Nov 21, 2024
c230e0d
ADLR/megatron-lm!2245 - Update MoE Doc
yanring Nov 21, 2024
e160988
Merge branch 'zijiey/moe_doc_0.9' into 'main'
ko3n1g Nov 21, 2024
cef4a41
ADLR/megatron-lm!2380 - ci: Increase interval time
ko3n1g Nov 21, 2024
779acc0
Merge branch 'ko3n1g/ci/swap-runners' into 'main'
ko3n1g Nov 21, 2024
ba7ea15
ADLR/megatron-lm!2374 - Fix loading args from checkpoint
deepakn94 Nov 21, 2024
8c6b9a4
Merge branch 'dnarayanan/fix_from_checkpoint_args' into 'main'
ko3n1g Nov 21, 2024
4821429
ADLR/megatron-lm!2327 - Small changes to export
shanmugamr1992 Nov 21, 2024
dbc7a18
Merge branch 'nemo_gpu_export' into 'main'
shanmugamr1992 Nov 21, 2024
62a032d
ADLR/megatron-lm!2361 - Multimodal example fixes
trintamaki Nov 21, 2024
69b3e05
Merge branch 'trintamaki/example_fixes' into 'main'
ko3n1g Nov 21, 2024
029025c
ADLR/megatron-lm!2236 - Fix multi tensor copy
yaox12 Nov 21, 2024
ddd920f
Merge branch 'xiny/fix_multi_tensor_copy' into 'main'
ko3n1g Nov 21, 2024
de7794c
ADLR/megatron-lm!2382 - tests: Add `jet-api`
ko3n1g Nov 22, 2024
5a86aa4
Merge branch 'ko3n1g/tests/add-jet-api' into 'main'
ko3n1g Nov 22, 2024
220302e
ADLR/megatron-lm!2383 - tests: Disable broken ckpts test
ko3n1g Nov 22, 2024
54d61d3
Merge branch 'ko3n1g/tests/disable-checkpoint' into 'main'
ko3n1g Nov 22, 2024
1033917
ADLR/megatron-lm!2384 - tests: Fully remove test
ko3n1g Nov 22, 2024
2e355b7
Merge branch 'ko3n1g/tests/disable-checkpoint' into 'main'
ko3n1g Nov 22, 2024
31a69e1
ADLR/megatron-lm!2385 - Make InternViTRMSNorm behave wrt sharded_stat…
jon-barker Nov 22, 2024
a9d040c
Merge branch 'jbarker/internvit_bugfix' into 'main'
jon-barker Nov 22, 2024
7f22e21
ADLR/megatron-lm!1940 - MoE parallel folding: separate MoE parallel s…
Victarry Nov 23, 2024
d392f9c
Merge branch 'denliu/moe_parallel_states' into 'main'
ko3n1g Nov 23, 2024
938e5c8
ADLR/megatron-lm!2289 - pp > 1 online evaluation
Nov 23, 2024
c10721e
Merge branch 'tpoon/pp_llava_evaluation' into 'main'
ko3n1g Nov 23, 2024
c913cd0
ADLR/megatron-lm!2244 - Clean up main MLM training loop
deepakn94 Nov 24, 2024
3a32fbc
Merge branch 'dnarayanan/training_loop_cleanup' into 'main'
deepakn94 Nov 24, 2024
9a3e331
ADLR/megatron-lm!2316 - respect perform_initialization
akoumpa Nov 24, 2024
cbbfa91
Merge branch 'akoumparouli/fix_te_skip_init' into 'main'
ko3n1g Nov 24, 2024
5a3bd5a
ADLR/megatron-lm!2350 - Add unit tests for mamba-hybrid-layer-allocation
Nov 24, 2024
9a75c72
Merge branch 'papakipos/mamba-hybrid-layer-allocation-testing' into '…
ko3n1g Nov 24, 2024
cc54e45
ADLR/megatron-lm!2354 - None: Update assertion for invalid layer_type…
brb-nv Nov 25, 2024
47806ab
Merge branch 'user/brb/minor-fix' into 'main'
ko3n1g Nov 25, 2024
2f2b1f1
ADLR/megatron-lm!2387 - ci: Use `curl-jq` for notify step
ko3n1g Nov 25, 2024
e21ce31
Merge branch 'ko3n1g/ci/fix-notify-image' into 'main'
ko3n1g Nov 25, 2024
a1fbf86
ADLR/megatron-lm!1913 - bugfix for multiple context managers
sudhakarsingh27 Nov 25, 2024
cc207f8
Merge branch 'bugfix_multiple_ctx_managers' into 'main'
ko3n1g Nov 25, 2024
072cac4
ADLR/megatron-lm!2390 - Remove interface test since we will allow mew…
mathemakitten Nov 25, 2024
081ab4d
Merge branch 'helenn-remove-interface-test' into 'main'
ko3n1g Nov 25, 2024
7e9ab5c
ADLR/megatron-lm!2373 - Support big blends by passing in filename of …
deepakn94 Nov 26, 2024
8d24655
Merge branch 'dnarayanan/add_json_data_args' into 'main'
deepakn94 Nov 26, 2024
71d670b
ADLR/megatron-lm!2389 - ci: Small improvements
ko3n1g Nov 26, 2024
3c17f5c
Merge branch 'ko3n1g/ci/small-improvements' into 'main'
ko3n1g Nov 26, 2024
c436712
ADLR/megatron-lm!2275 - Context Parallelism Support for LLaVA Model
parthmannan Nov 26, 2024
f5afc25
Merge branch 'pmannan/llava_cp_reformat' into 'main'
ko3n1g Nov 26, 2024
0be5646
ADLR/megatron-lm!1489 - loader_mcore.py local module support.
lmcafee-nvidia Nov 27, 2024
29535b9
Merge branch 'lmcafee/loader-mcore-local-partial' into 'main'
jaredcasper Nov 27, 2024
2ca57f5
ADLR/megatron-lm!2362 - Fix check_param_hashes_across_dp_replicas
deepakn94 Nov 27, 2024
44a64c0
Merge branch 'dnarayanan/fix_check_param_hashes' into 'main'
deepakn94 Nov 27, 2024
53654f7
ADLR/megatron-lm!2399 - ci: Restart failed pipeline submission
ko3n1g Nov 27, 2024
5b1196b
Merge branch 'ko3n1g/ci/restart-pipeline-submission' into 'main'
ko3n1g Nov 27, 2024
42070d2
ADLR/megatron-lm!2394 - chore: Set QAT approval to optional
ko3n1g Nov 27, 2024
48b1942
Merge branch 'ko3n1g/chore/codeowners' into 'main'
ko3n1g Nov 27, 2024
4e627b5
ADLR/megatron-lm!2284 - chore: pip install Mcore's dependencies
ko3n1g Nov 27, 2024
452d520
Merge branch 'ko3n1g/build/dependencies' into 'main'
ko3n1g Nov 27, 2024
b35cc1c
ADLR/megatron-lm!2400 - Make inference max sequence length configurable
mathemakitten Nov 27, 2024
3c2d6f8
Merge branch 'helenn-inference-max-seqlen-config' into 'main'
jaredcasper Nov 27, 2024
39f3bef
ADLR/megatron-lm!2406 - build: Improve caching
ko3n1g Nov 28, 2024
f3e1afb
Merge branch 'ko3n1g/build/caching' into 'main'
ko3n1g Nov 28, 2024
6bd9255
ADLR/megatron-lm!2393 - Fix compatibility error brought by !1940 for …
Victarry Nov 28, 2024
67a50f2
Merge branch 'denliu/fix_moe_parallel_states' into 'main'
ko3n1g Nov 28, 2024
1113758
ADLR/megatron-lm!2238 - Fix initialization for gates of router and sh…
yaox12 Nov 29, 2024
8e9d4dc
Merge branch 'xiny/fix_router_init' into 'main'
ko3n1g Nov 29, 2024
e842d46
ADLR/megatron-lm!2391 - Add TorchLayerNorm alias for backward compati…
ashors1 Nov 29, 2024
31a29b8
Merge branch 'torch_norm_alias' into 'main'
ko3n1g Nov 29, 2024
0c43280
ADLR/megatron-lm!2221 - Multimodal sequence packing support
trintamaki Nov 30, 2024
38f7a8c
Merge branch 'trintamaki/sequence-packing' into 'main'
trintamaki Nov 30, 2024
bb84eb9
ADLR/megatron-lm!2170 - MCore Partial DistOpt Feature
sanandaraj5597 Nov 30, 2024
64d816a
Merge branch 'partial_dp_distopt' into 'main'
ko3n1g Nov 30, 2024
9157970
ADLR/megatron-lm!2398 - Check if num_layers is divisible by PP size e…
deepakn94 Nov 30, 2024
99f999a
Merge branch 'dnarayanan/pp_assertion' into 'main'
deepakn94 Nov 30, 2024
0d3d317
ADLR/megatron-lm!2405 - Update distributed tests to only use public f…
deepakn94 Nov 30, 2024
090e2ee
Merge branch 'dnarayanan/fix_distributed_test' into 'main'
ko3n1g Nov 30, 2024
382fa6a
ADLR/megatron-lm!2395 - ci: Use cluster-specific runners
ko3n1g Nov 30, 2024
a794662
Merge branch 'ko3n1g/ci/cluster-runners' into 'main'
ko3n1g Nov 30, 2024
d5318c1
ADLR/megatron-lm!2411 - ci: Add coreutils to notify job
ko3n1g Nov 30, 2024
529404e
Merge branch 'ko3n1g/ci/add-coreutils' into 'main'
ko3n1g Nov 30, 2024
cd02b4b
ADLR/megatron-lm!2412 - ci: Fix job runners
ko3n1g Dec 1, 2024
d0dae2a
Merge branch 'ko3n1g/ci/job-runners' into 'main'
ko3n1g Dec 1, 2024
337c34f
ADLR/megatron-lm!2308 - Check if Gloo process group is already destro…
szmigacz Dec 1, 2024
4ad7a97
Merge branch 'destroy_pg_if_valid' into 'main'
deepakn94 Dec 1, 2024
443a193
ADLR/megatron-lm!2325 - Add `separation_hint` to support writing opti…
ashors1 Dec 1, 2024
1115e06
Merge branch 'drop-optim-async' into 'main'
ko3n1g Dec 1, 2024
7b43f73
ADLR/megatron-lm!2407 - Bugfix: allow both blend and blend_per_split …
deepakn94 Dec 2, 2024
7d7213d
Merge branch 'dnarayanan/converter_bugfix' into 'main'
ko3n1g Dec 2, 2024
2ed67b2
ADLR/megatron-lm!2402 - Add dist-ckpt support to InternViT
jon-barker Dec 2, 2024
22f9a79
Merge branch 'jbarker/internvit_dist_ckpt' into 'main'
jon-barker Dec 2, 2024
522e567
ADLR/megatron-lm!2410 - ci: Run unit tests on Slurm
ko3n1g Dec 3, 2024
9f1ef85
Merge branch 'ko3n1g/ci/unit-tests-on-slurm' into 'main'
ko3n1g Dec 3, 2024
9ceaab6
ADLR/megatron-lm!2415 - ci: Unlock all cluster runners
ko3n1g Dec 3, 2024
ae832c7
Merge branch 'ko3n1g/ci/job-runners-2' into 'main'
ko3n1g Dec 3, 2024
21cc9b0
ADLR/megatron-lm!2416 - tests: Add barrier for destroy
ko3n1g Dec 3, 2024
844119f
Merge branch 'ko3n1g/ci/job-runners-2' into 'main'
ko3n1g Dec 3, 2024
1e51980
ADLR/megatron-lm!2423 - ci: Adjust model config path
ko3n1g Dec 4, 2024
daa54ea
Merge branch 'ko3n1g/ci/fix-skip-tests' into 'main'
ko3n1g Dec 4, 2024
d65f7e6
ADLR/megatron-lm!2424 - ci: Fix notifications
ko3n1g Dec 4, 2024
2f67f35
Merge branch 'ko3n1g/ci/unit-tests-extended' into 'main'
ko3n1g Dec 4, 2024
ca1a3df
ADLR/megatron-lm!2179 - TRT-LLM export for TE FP8-trained checkpoints
Dec 5, 2024
e97d486
Merge branch 'pikaminski/fp8-export' into 'main'
shanmugamr1992 Dec 5, 2024
2b6b8ac
ADLR/megatron-lm!2425 - Fix test after new inference default added
mathemakitten Dec 5, 2024
bd677bf
Merge branch 'helenn-fix-inference-test-20241204' into 'main'
ko3n1g Dec 5, 2024
3357c82
ADLR/megatron-lm!2422 - Fix golden values of fp8 weekly tests
kunlunl Dec 7, 2024
dc7fea9
Merge branch 'fix_golden_values_of_weekly' into 'main'
ko3n1g Dec 7, 2024
47ab878
ADLR/megatron-lm!2230 - Enhance MoE Architecture: Support MoE Layer F…
Shunkangz Dec 8, 2024
60d5b38
Merge branch 'moe_freq_and_moe_offset' into 'main'
ko3n1g Dec 8, 2024
fa0dcc4
ADLR/megatron-lm!2168 - Resolve "Attention as a config option in mcore"
shanmugamr1992 Dec 8, 2024
9dc7fef
Merge branch '326-attention-as-a-config-option-in-mcore' into 'main'
shanmugamr1992 Dec 8, 2024
e059614
ADLR/megatron-lm!2381 - sample index helper function, no unnecessary …
Dec 8, 2024
9665f2d
Merge branch 'return-type-sample-idx' into 'main'
jaredcasper Dec 8, 2024
7da20af
ADLR/megatron-lm!2388 - Fix peak memory consumption for NeMo
yaox12 Dec 8, 2024
44fd429
Merge branch 'xiny/fix_peak_mem' into 'main'
jaredcasper Dec 8, 2024
e7503a4
ADLR/megatron-lm!2413 - [dist ckpt] Use gather object instead of all …
ananthsub Dec 8, 2024
d677ca3
Merge branch 'debug-ckpt-oom' into 'main'
ericharper Dec 8, 2024
cf84356
ADLR/megatron-lm!2282 - Add functionality to re-run iterations
Dec 8, 2024
43fa44c
Merge branch 'rerun_step' into 'main'
ko3n1g Dec 8, 2024
f6f8434
ADLR/megatron-lm!2418 - Bugfix in multimodal dataloader_provider
jon-barker Dec 8, 2024
6dfeb25
Merge branch 'jbarker-main-patch-95366' into 'main'
jon-barker Dec 8, 2024
aa2a45d
ADLR/megatron-lm!2101 - Refactor MoE specs: move all submodules of Mo…
hxbai Dec 9, 2024
37cd8f2
Merge branch 'hongxiaob/moe_spec' into 'main'
ko3n1g Dec 9, 2024
44b6480
ADLR/megatron-lm!2414 - Remove all-gather before first iteration to n…
deepakn94 Dec 9, 2024
d4e72c0
Merge branch 'dnarayanan/skip_all_gather_first_iteration' into 'main'
deepakn94 Dec 9, 2024
40fb590
ADLR/megatron-lm!2404 - move get_batch_on_this_cp_rank to mcore utils
xrennvidia Dec 11, 2024
215a2eb
Merge branch 'xren/cp_llava' into 'main'
ko3n1g Dec 11, 2024
2aa3522
ADLR/megatron-lm!2432 - Small VLM example
trintamaki Dec 11, 2024
371feef
Merge branch 'trintamaki/small-model-example' into 'main'
trintamaki Dec 11, 2024
2816445
ADLR/megatron-lm!2443 - Fix assert warning in !2282
Dec 12, 2024
fd69c2f
Merge branch 'fix-assert-warning' into 'main'
ericharper Dec 12, 2024
ebfc79b
ADLR/megatron-lm!2453 - Fix wrapping of external dataloaders
Dec 12, 2024
99f23d2
Merge branch 'fix-external-dataloader' into 'main'
ericharper Dec 12, 2024
17b92eb
ADLR/megatron-lm!2449 - Fix moe dist-ckpt compatibility for !2230
Shunkangz Dec 13, 2024
40db706
Merge branch 'moe_distckpt_compatibility' into 'main'
jaredcasper Dec 13, 2024
de18820
ADLR/megatron-lm!2441 - Llava pp > 1 fix
trintamaki Dec 13, 2024
183f568
Merge branch 'trintamaki/llava-pp-fix' into 'main'
ericharper Dec 13, 2024
acba19c
ADLR/megatron-lm!2421 - Reduce CPU overhead of TEDotProductAttention …
Victarry Dec 13, 2024
3f5d5d4
Merge branch 'denliu/te_attention_cpu_improve' into 'main'
ericharper Dec 13, 2024
be8534a
ADLR/megatron-lm!2444 - Fix checkpointing of rerun state machine
Dec 14, 2024
71c394b
Merge branch 'fix-rerun-checkpoint' into 'main'
deepakn94 Dec 14, 2024
f33d9fe
ADLR/megatron-lm!2440 - MCore generate: read vocab size from model, n…
cuichenx Dec 16, 2024
3d2297e
Merge branch 'main' into 'main'
ericharper Dec 16, 2024
de25d48
ADLR/megatron-lm!2448 - Updating nightly
shanmugamr1992 Dec 16, 2024
14ca285
Merge branch 'fixnightly' into 'main'
ko3n1g Dec 16, 2024
fba26d2
ADLR/megatron-lm!2340 - Cudagraph memory optimizations and mcore opti…
jiemingz Dec 18, 2024
1339cda
Merge branch 'jiemingz/cudagraph_memfix' into 'main'
ko3n1g Dec 18, 2024
e9cc9ac
ADLR/megatron-lm!2472 - ci: Swap image for cherry-pick automation
ko3n1g Dec 18, 2024
d995e9c
Merge branch 'ko3n1g/ci/cherry-pick-fix' into 'main'
ko3n1g Dec 18, 2024
1e49c9d
ADLR/megatron-lm!2478 - Fix accidental inference pipelining when it s…
mathemakitten Dec 18, 2024
584e4f9
Merge branch 'helenn-pipeline-parallel-fix-flash-decode' into 'main'
jaredcasper Dec 18, 2024
66c63df
ADLR/megatron-lm!2461 - Clarify tokenizer use in VLM example
trintamaki Dec 18, 2024
3224cf8
Merge branch 'trintamaki/example-tokenizer' into 'main'
trintamaki Dec 18, 2024
ef84846
ADLR/megatron-lm!2433 - fix: Guard Bert TE layer specs
ko3n1g Dec 18, 2024
319c8aa
Merge branch 'ko3n1g/fix/bert-te-import-check' into 'main'
ko3n1g Dec 18, 2024
474f9c5
ADLR/megatron-lm!2409 - Improved flattened tensors validation
mikolajblaz Dec 18, 2024
1b7553e
Merge branch 'mblaz/fix-flat-validation' into 'main'
ericharper Dec 18, 2024
281cbe6
ADLR/megatron-lm!2439 - MCore Inference misc changes
mathemakitten Dec 18, 2024
8d2bc43
Merge branch 'helenn-refactor-textgen' into 'main'
jaredcasper Dec 18, 2024
64e065c
ADLR/megatron-lm!2470 - Fixed grad scale assertion
sanandaraj5597 Dec 19, 2024
7449d66
Merge branch 'grad_scale_assert_fix' into 'main'
ericharper Dec 19, 2024
7e99c5b
ADLR/megatron-lm!2438 - Multi image dataloader
Dec 19, 2024
aff6e38
Merge branch 'multi_image_dataloader' into 'main'
ericharper Dec 19, 2024
31e8bfa
ADLR/megatron-lm!2301 - Allow empty partial load
mikolajblaz Dec 19, 2024
7efaa73
Merge branch 'mblaz/allow-empty-partial-load' into 'main'
ko3n1g Dec 19, 2024
47a175b
ADLR/megatron-lm!1879 - Add MX-FP16
kunlunl Dec 20, 2024
ca87dcd
Merge branch 'mx_fp16' into 'main'
ericharper Dec 20, 2024
d0df563
ADLR/megatron-lm!1934 - Support Device-Limited Routing and Sequence A…
Shunkangz Dec 21, 2024
b8420a1
Merge branch 'group_topk' into 'main'
ko3n1g Dec 21, 2024
7bb5379
ADLR/megatron-lm!2469 - Correct strides for bshd layout and revert Ro…
mathemakitten Dec 21, 2024
25b1f33
Merge branch 'helenn-rope-fusion-mem-layout' into 'main'
ericharper Dec 21, 2024
1da9dad
ADLR/megatron-lm!2494 - Add model checkpoint links
boxin-wbx Dec 21, 2024
cf25d44
Merge branch 'boxin/nvlm_ckpt_release' into 'main'
jon-barker Dec 21, 2024
1468ab0
ADLR/megatron-lm!2285 - Support --freeze-LM and --freeze-ViT with ran…
jon-barker Dec 21, 2024
d3c585e
Merge branch 'jbarker/pp_unfreeze' into 'main'
jon-barker Dec 21, 2024
e51a3ac
ADLR/megatron-lm!2491 - Move mmodal evaluation code to its own folder
Dec 23, 2024
2da43ef
Merge branch 'mmodal_eval_in_folder' into 'main'
jon-barker Dec 23, 2024
48103f4
ADLR/megatron-lm!2471 - Updating T5 codes to fix bugs
huvunvidia Dec 30, 2024
076972e
Merge branch 'huvu/t5_fixes_updates' into 'main'
ericharper Dec 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 4 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8]
max-line-length = 100
extend-ignore = E203,E501,F401,E402,E714
per-file-ignores = __init__.py:F401
32 changes: 32 additions & 0 deletions .github/ISSUE_TEMPLATE/bug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
name: BUG
about: Report a bug that needs attention
title: "[BUG]"
labels: ''
assignees: ''

---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Stack trace/logs**
If applicable, add the stack trace or logs from the time of the error.

**Environment (please complete the following information):**
- Megatron-LM commit ID
- PyTorch version
- CUDA version
- NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
23 changes: 23 additions & 0 deletions .github/ISSUE_TEMPLATE/enhancement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
name: ENHANCEMENT
about: Suggest an idea to improve this project
title: "[ENHANCEMENT]"
labels: ''
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Proposed implementation**
If you have a proposed implementation for the feature state it here or link to a PR.

**Additional context**
Add any other context or screenshots about the feature request here.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: QUESTION
about: Ask a question about Megatron-LM that is not a bug, regression or enhancement
request
title: "[QUESTION]"
labels: ''
assignees: ''

---

**Your question**
Ask a clear and concise question about Megatron-LM.
39 changes: 39 additions & 0 deletions .github/ISSUE_TEMPLATE/regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
name: REGRESSION
about: Report a regression in speed or accuracy due to a Megatron-LM update
title: "[REGRESSION]"
labels: ''
assignees: ''

---

**Describe the regression**
A clear and concise description of what the regression is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Previous performance**
What speed or accuracy did you previously see.

**New performance**
What speed or accuracy do you see after the update.

**Stack trace/logs**
If applicable, add the stack trace or logs related to the regression.

**Environment (please complete the following information):**
- Previous Megatron-LM commit ID
- New Megatron-LM commit ID
- Previous PyTorch version
- New PyTorch version
- Previous CUDA version
- New CUDA version
- Previous NCCL version
- New NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
31 changes: 31 additions & 0 deletions .github/workflows/stale.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
#
# You can adjust the behavior by modifying this file.
# For more information, see:
# https://github.com/actions/stale
name: Mark stale issues and pull requests

on:
schedule:
- cron: '15 18 * * *'

jobs:
stale:

runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write

steps:
- uses: actions/stale@v5
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
days-before-stale: 60
stale-issue-message: 'Marking as stale. No activity in 60 days.'
stale-pr-message: 'Marking as stale. No activity in 60 days.'
stale-issue-label: 'stale'
stale-pr-label: 'stale'
remove-stale-when-updated: true
operations-per-run: 1000
days-before-close: -1
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,6 @@ build
*~
slurm*
logs
.vscode
local/
.gitmodules
Loading