Why is Mamba-minimal able to train but not mamba? #2569
Unanswered
Corallus-Caninus
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
It is not obvious to me why mamba minimal is able to train nor is it explicitly documented. Is it due to the state selection implementation? I noticed in the mamba implementation there is a mention of discarding all but the last state. I am able to train mamba locally (not the mamba-minimal but the original mamba example) from some older code I have of Candle that I rigged up with lbfgs from candle-optimisers but it is admittedly not very convex. Any help would be appreciated or pointing me in the right direction (papers?).
Is it due to the gradients for prior steps in the sequence not being part of the backpropagation graph? I feel like I am guessing when the answer might be obvious.
Beta Was this translation helpful? Give feedback.
All reactions