Flash attention and multipack failing for qwen and mistral #1966

tiger241 · 2024-10-12T02:08:16Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I can run the example config yamls in main without significant bugs

Current behaviour

First issue i saw was with multipack. I noticed that the feature introduced during auto_batch_size made the training code stay idle after a few steps (usually at the eval step after 1 training step).

I used the older version of multipack and it worked fine. Overall that newer version of multipack worked for single gpu but not multi-gpu.

Steps to reproduce

I tried running the code in a cuda 12.4 environment with torch 2.4.1 and flash attention at 2.6.3 (tirton 3.0.0).
(Older versions of axolotl worked without issue from August, maybe something switched and i am unable to find the cause)
examples/mistral/qlora.yml [ found the multipack bug here ]

I used this dataset
datasets:

path: teknium/GPT4-LLM-Cleaned
type: alpaca

Config yaml

No response

Possible solution

Older versioning of multipack works. I am not sure of the compatibility of the new multipack within deepspeed at least. Maybe the communication added in that step interferes something in deepspeed (I tried zero 2 and zero3_bf16 config).

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

python 3.11

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

NanoCode012 · 2024-10-14T17:37:14Z

May I ask what is auto_batch_size?

bursteratom · 2024-10-14T17:49:11Z

@NanoCode012 something like this: https://pytorch-lightning.readthedocs.io/en/1.1.1/training_tricks.html#auto-scaling-of-batch-size , I believe

bursteratom · 2024-10-14T23:19:51Z

Just posting what I got so far here. I tested on 2 GPU setup and after it got stuck for like an hour I got the following error, which seems to be the standard timeout message. And that is before I even got to the first training step.

`frame #6: clone + 0x44 (0x7f021b924bf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W1014 22:35:07.647000 138711449323328 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 618 closing signal SIGTERM
E1014 22:35:07.761000 138711449323328 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 617) of binary: /root/miniconda3/en
vs/py3.11/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/py3.11/bin/accelerate", line 8, in
sys.exit(main())
^^^^^^
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

axolotl.cli.train FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-10-14_22:35:07
host : d2eeb895ba3d
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 617)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 617`

bursteratom · 2024-10-14T23:21:40Z

@tiger241 did you get a similar error message or does it just stay idle without further output?

tiger241 · 2024-10-14T23:45:01Z

I got a different one....NCCL timeout in my case

I am referring to the changes in the multipack file that where introduced here. According to the commit title, its related to auto_batch_size
4e5400c#diff-26bb7717d9a9a9a1ae328e55ef90344b64f2d768a9f072cfafe80a0912537515

Inside the reduce_and_broadcast function (the broadcast operation did not go through)
I tested this by adding some print statements and noticed nothing happened at the broadcast function (like it would not go forward from there)

I have also tried the nccl commands suggested in the readme (the bug did not go away)

bursteratom · 2024-10-15T01:38:30Z

@tiger241 Just to clarify, you were able to get it work on multi-GPU prior to the commit 4e5400c#diff-26bb7717d9a9a9a1ae328e55ef90344b64f2d768a9f072cfafe80a0912537515 ?

tiger241 · 2024-10-15T01:47:44Z

Yep...basically i just rewrote that multipack file to the previous version (before the commit) while having the rest of the code base in the same state as the current main branch and it worked fine.

The broadcast operation called in the multipack is somehow causing the issue. I just have no bug other than idle and nccl timeout for this.

Before the code was estimating the batch by assuming uniformity in the data distribution across the gpus . i am guessing this was added so that the deepspeed auto batch feature could be used, however that would require a precise value of the batch size per gpu

As you see the code....i used the _def len_est(self): function and not the self.gather_len_batches(len_batches) function . But also removed the gather operation inside _len_est to the way it was before that commit.

In that case there were no issues and it worked fine

bursteratom · 2024-10-15T03:05:19Z

@tiger241 Thank you for the clarification! We were able to get around the idling by setting eval_sample_packing to false. Can you try this and let us know if that temporary fix works for you?

bursteratom · 2024-10-16T19:06:21Z

@tiger241 @NanoCode012 Started a PR that implemented a fix for this issue so that eval_sample_packing=True is unstuck.

NanoCode012 · 2024-10-29T08:27:09Z

Hey @tiger241 , the linked PR has been merged for a while now. May I ask if you still experience this issue? If not, could we close this Issue?

tiger241 · 2024-10-29T15:10:07Z

I apologize for the late reply...I ll check the pr today...from a quick look it looks okish

bursteratom · 2024-11-16T01:50:14Z

Hi @tiger241 any updates?

tiger241 added the bug Something isn't working label Oct 12, 2024

bursteratom self-assigned this Oct 14, 2024

bursteratom mentioned this issue Oct 16, 2024

memoize dataset length for eval sample packing #1974

Merged

NanoCode012 added the waiting for reporter label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash attention and multipack failing for qwen and mistral #1966

Flash attention and multipack failing for qwen and mistral #1966

tiger241 commented Oct 12, 2024 •

edited

Loading

NanoCode012 commented Oct 14, 2024

bursteratom commented Oct 14, 2024

bursteratom commented Oct 14, 2024 •

edited

Loading

bursteratom commented Oct 14, 2024

tiger241 commented Oct 14, 2024 •

edited

Loading

bursteratom commented Oct 15, 2024 •

edited

Loading

tiger241 commented Oct 15, 2024

bursteratom commented Oct 15, 2024

bursteratom commented Oct 16, 2024 •

edited

Loading

NanoCode012 commented Oct 29, 2024

tiger241 commented Oct 29, 2024

bursteratom commented Nov 16, 2024

Flash attention and multipack failing for qwen and mistral #1966

Flash attention and multipack failing for qwen and mistral #1966

Comments

tiger241 commented Oct 12, 2024 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

NanoCode012 commented Oct 14, 2024

bursteratom commented Oct 14, 2024

bursteratom commented Oct 14, 2024 • edited Loading

axolotl.cli.train FAILED

Failures: <NO_OTHER_FAILURES>

bursteratom commented Oct 14, 2024

tiger241 commented Oct 14, 2024 • edited Loading

bursteratom commented Oct 15, 2024 • edited Loading

tiger241 commented Oct 15, 2024

bursteratom commented Oct 15, 2024

bursteratom commented Oct 16, 2024 • edited Loading

NanoCode012 commented Oct 29, 2024

tiger241 commented Oct 29, 2024

bursteratom commented Nov 16, 2024

tiger241 commented Oct 12, 2024 •

edited

Loading

bursteratom commented Oct 14, 2024 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

tiger241 commented Oct 14, 2024 •

edited

Loading

bursteratom commented Oct 15, 2024 •

edited

Loading

bursteratom commented Oct 16, 2024 •

edited

Loading