You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered a problem in pretrain that the lm loss become 0.0 after hundreds iterations and hold then, but there was no nan/inf/skip iteration according to the train log.
I'm wondering whether loss_mask may modify self.cached_loss_mask unexpectedly in Line206 (since we want to cache loss mask, but loss_mask is just a reference to the original tensor), which finally results in zeros accumulated in self.cached_loss_mask and an all-zero loss_mask?
The text was updated successfully, but these errors were encountered:
shmily326
changed the title
[QUESTION] The cached_loss_mask maybe modified unexpectedly in GPTDataset? Whether a .clone() is needed?
[QUESTION] The cached_loss_mask maybe modified unexpectedly in GPTDataset?
Nov 1, 2024
shmily326
changed the title
[QUESTION] The cached_loss_mask maybe modified unexpectedly in GPTDataset?
[BUG] The cached_loss_mask maybe modified unexpectedly in GPTDataset?
Nov 4, 2024
I encountered a problem in pretrain that the
lm loss
become 0.0 after hundreds iterations and hold then, but there was no nan/inf/skip iteration according to the train log.Megatron-LM/megatron/core/datasets/gpt_dataset.py
Lines 183 to 207 in 2e2bdf6
I'm wondering whether
loss_mask
may modifyself.cached_loss_mask
unexpectedly in Line206 (since we want to cache loss mask, butloss_mask
is just a reference to the original tensor), which finally results in zeros accumulated inself.cached_loss_mask
and an all-zeroloss_mask
?The text was updated successfully, but these errors were encountered: