-
-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flash attention and multipack failing for qwen and mistral #1966
Comments
May I ask what is |
@NanoCode012 something like this: https://pytorch-lightning.readthedocs.io/en/1.1.1/training_tricks.html#auto-scaling-of-batch-size , I believe |
Just posting what I got so far here. I tested on 2 GPU setup and after it got stuck for like an hour I got the following error, which seems to be the standard timeout message. And that is before I even got to the first training step. `frame #6: clone + 0x44 (0x7f021b924bf4 in /usr/lib/x86_64-linux-gnu/libc.so.6) W1014 22:35:07.647000 138711449323328 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 618 closing signal SIGTERM
|
@tiger241 did you get a similar error message or does it just stay idle without further output? |
I got a different one....NCCL timeout in my case I am referring to the changes in the multipack file that where introduced here. According to the commit title, its related to auto_batch_size Inside the reduce_and_broadcast function (the broadcast operation did not go through) I have also tried the nccl commands suggested in the readme (the bug did not go away) |
@tiger241 Just to clarify, you were able to get it work on multi-GPU prior to the commit 4e5400c#diff-26bb7717d9a9a9a1ae328e55ef90344b64f2d768a9f072cfafe80a0912537515 ? |
Yep...basically i just rewrote that multipack file to the previous version (before the commit) while having the rest of the code base in the same state as the current main branch and it worked fine. The broadcast operation called in the multipack is somehow causing the issue. I just have no bug other than idle and nccl timeout for this. Before the code was estimating the batch by assuming uniformity in the data distribution across the gpus . i am guessing this was added so that the deepspeed auto batch feature could be used, however that would require a precise value of the batch size per gpu As you see the code....i used the _def len_est(self): function and not the self.gather_len_batches(len_batches) function . But also removed the gather operation inside _len_est to the way it was before that commit. In that case there were no issues and it worked fine |
@tiger241 Thank you for the clarification! We were able to get around the idling by setting |
@tiger241 @NanoCode012 Started a PR that implemented a fix for this issue so that |
Hey @tiger241 , the linked PR has been merged for a while now. May I ask if you still experience this issue? If not, could we close this Issue? |
I apologize for the late reply...I ll check the pr today...from a quick look it looks okish |
Hi @tiger241 any updates? |
Please check that this issue hasn't been reported before.
Expected Behavior
I can run the example config yamls in main without significant bugs
Current behaviour
First issue i saw was with multipack. I noticed that the feature introduced during auto_batch_size made the training code stay idle after a few steps (usually at the eval step after 1 training step).
I used the older version of multipack and it worked fine. Overall that newer version of multipack worked for single gpu but not multi-gpu.
Steps to reproduce
I tried running the code in a cuda 12.4 environment with torch 2.4.1 and flash attention at 2.6.3 (tirton 3.0.0).
(Older versions of axolotl worked without issue from August, maybe something switched and i am unable to find the cause)
examples/mistral/qlora.yml [ found the multipack bug here ]
I used this dataset
datasets:
type: alpaca
Config yaml
No response
Possible solution
Older versioning of multipack works. I am not sure of the compatibility of the new multipack within deepspeed at least. Maybe the communication added in that step interferes something in deepspeed (I tried zero 2 and zero3_bf16 config).
Which Operating Systems are you using?
Python Version
python 3.11
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: