Issue in train in colab #42

fermions75 · 2023-04-09T12:04:50Z

While I run the train in colab, this error is shown -

Something went wrong
Connection errored out.

How can I solve this?

The text was updated successfully, but these errors were encountered:

alior101 · 2023-04-09T13:52:05Z

Getting this error too..

lxe · 2023-04-11T04:16:26Z

I'm guessing it's running out of RAM? Are you using high ram env?

fermions75 · 2023-04-11T08:57:57Z

No, I did not. I just tried using colab pro. I used the base model cerebras/Cerebras-GPT-2.7B. When I press train, the following error shows in colab -

MillionthOdin16 · 2023-04-12T00:25:57Z

No, it's broken. It works on hugging face now but can't download loras xD.

rs189 · 2023-04-22T20:04:39Z

I have the same issue, I even tried running it without Gradio's tunnel but rather with another 3rd party but I get the same error.

Clybius · 2023-04-23T19:22:33Z

Should note that for me colab does in fact work, but only in an A100 colab instance with more than 64 GB of RAM. It seemed to spike to ~36+ GB, which is more than the maximum for the free tier/standard RAM profile. This leads me to think it's just due to the RAM limitation of lower colab tiers.

Trying it on the generic RAM profile with a V100 (provides me with ~20-24 GB of RAM), and I had the issue listed in the original post.
Trying it locally on a machine with 32 GB of RAM and a P100, I have the same problem where the RAM spikes, which leads to the machine starting the OOM killer and ending the process.

rs189 · 2023-04-24T10:36:09Z

Should note that for me colab does in fact work, but only in an A100 colab instance with more than 64 GB of RAM. It seemed to spike to ~36+ GB, which is more than the maximum for the free tier/standard RAM profile. This leads me to think it's just due to the RAM limitation of lower colab tiers.

Trying it on the generic RAM profile with a V100 (provides me with ~20-24 GB of RAM), and I had the issue listed in the original post. Trying it locally on a machine with 32 GB of RAM and a P100, I have the same problem where the RAM spikes, which leads to the machine starting the OOM killer and ending the process.

What model and dataset are you using to generate and train? Because this is happening even with a half-precision 7b LLaMa model with default "unhelpful" example in my case, I can even generate with it on my PC which has only 8GB of VRAM, I can't train however, but I don't believe that fine tunning half-precision 7b LLaMa should be more demanding that 15GB of VRAM that Colab provides for free? As you can see the crash/"Connection errored out" error occurs way before RAM and/or VRAM is saturated.

lxe added the colab label Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue in train in colab #42

Issue in train in colab #42

fermions75 commented Apr 9, 2023

alior101 commented Apr 9, 2023

lxe commented Apr 11, 2023

fermions75 commented Apr 11, 2023

MillionthOdin16 commented Apr 12, 2023

rs189 commented Apr 22, 2023

Clybius commented Apr 23, 2023

rs189 commented Apr 24, 2023

Issue in train in colab #42

Issue in train in colab #42

Comments

fermions75 commented Apr 9, 2023

alior101 commented Apr 9, 2023

lxe commented Apr 11, 2023

fermions75 commented Apr 11, 2023

MillionthOdin16 commented Apr 12, 2023

rs189 commented Apr 22, 2023

Clybius commented Apr 23, 2023

rs189 commented Apr 24, 2023