You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have found using -np.inf in the inter-attention module (attend part) often leads to nan loss computations even with gradient clipping or very low learning rates. Replacing it with some large value like -1e18 helps my case.
Could it be because there is some error in masking before calculating attention scores?
The text was updated successfully, but these errors were encountered:
I have found using -np.inf in the inter-attention module (attend part) often leads to nan loss computations even with gradient clipping or very low learning rates. Replacing it with some large value like
-1e18
helps my case.Could it be because there is some error in masking before calculating attention scores?
The text was updated successfully, but these errors were encountered: