[BUG] The `cached_loss_mask` maybe modified unexpectedly in GPTDataset? #1269

shmily326 · 2024-11-01T03:42:27Z

I encountered a problem in pretrain that the lm loss become 0.0 after hundreds iterations and hold then, but there was no nan/inf/skip iteration according to the train log.

Megatron-LM/megatron/core/datasets/gpt_dataset.py

Lines 183 to 207 in 2e2bdf6

    
           if ( 
        
               not self.masks_and_position_ids_are_cacheable 
        
               or not self.masks_and_position_ids_are_cached 
        
           ): 
        
               attention_mask, loss_mask, position_ids = _get_ltor_masks_and_position_ids( 
        
                   tokens, 
        
                   self.config.tokenizer.eod, 
        
                   self.config.reset_position_ids, 
        
                   self.config.reset_attention_mask, 
        
                   self.config.eod_mask_loss, 
        
                   self.config.create_attention_mask, 
        
               ) 
        
               if self.masks_and_position_ids_are_cacheable: 
        
                   self.cached_attention_mask = attention_mask 
        
                   self.cached_loss_mask = loss_mask 
        
                   self.cached_position_ids = position_ids 
        
                   self.masks_and_position_ids_are_cached = True 
        
           else: 
        
               attention_mask = self.cached_attention_mask 
        
               loss_mask = self.cached_loss_mask 
        
               position_ids = self.cached_position_ids 
        
           # For padded sequences, mask the loss 
        
           loss_mask[labels == self._pad_token_id] = 0.0

I'm wondering whether loss_mask may modify self.cached_loss_mask unexpectedly in Line206 (since we want to cache loss mask, but loss_mask is just a reference to the original tensor), which finally results in zeros accumulated in self.cached_loss_mask and an all-zero loss_mask？

The text was updated successfully, but these errors were encountered:

shmily326 changed the title ~~[QUESTION] The cached_loss_mask maybe modified unexpectedly in GPTDataset? Whether a .clone() is needed?~~ [QUESTION] The cached_loss_mask maybe modified unexpectedly in GPTDataset? Nov 1, 2024

shmily326 closed this as completed Nov 1, 2024

shmily326 reopened this Nov 3, 2024

shmily326 changed the title ~~[QUESTION] The cached_loss_mask maybe modified unexpectedly in GPTDataset?~~ [BUG] The cached_loss_mask maybe modified unexpectedly in GPTDataset? Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] The `cached_loss_mask` maybe modified unexpectedly in GPTDataset? #1269

[BUG] The `cached_loss_mask` maybe modified unexpectedly in GPTDataset? #1269

shmily326 commented Nov 1, 2024 •

edited

Loading

[BUG] The cached_loss_mask maybe modified unexpectedly in GPTDataset? #1269

[BUG] The cached_loss_mask maybe modified unexpectedly in GPTDataset? #1269

Comments

shmily326 commented Nov 1, 2024 • edited Loading

[BUG] The `cached_loss_mask` maybe modified unexpectedly in GPTDataset? #1269

[BUG] The `cached_loss_mask` maybe modified unexpectedly in GPTDataset? #1269

shmily326 commented Nov 1, 2024 •

edited

Loading