Flux.1.dev.fp8 CKPT trainning the avr_loss keep 'nan' #76

Lecho303 · 2024-09-29T06:25:10Z

i was trying to trainning a lora which use flux.1.dev.fp8 CKPT,and the log keep telling me that avr_loss is nan,i do not know where i setting wrong or someting?

the system & version:
[START] Security scan
[DONE] Security scan

ComfyUI-Manager: installing dependencies done.

** ComfyUI startup time: 2024-09-28 16:52:30.478163
** Platform: Windows
** Python version: 3.11.8 (tags/v3.11.8:db85d51, Feb 6 2024, 22:03:32) [MSC v.1937 64 bit (AMD64)]
** Python executable: D:\comfyUI\ComfyUI_windows_portable_nvidia.7z\python_embeded\python.exe
** ComfyUI Path: D:\comfyUI\ComfyUI_windows_portable_nvidia.7z\ComfyUI
** Log path: D:\comfyUI\ComfyUI_windows_portable_nvidia.7z\comfyui.log

Prestartup times for custom nodes:
0.0 seconds: D:\comfyUI\ComfyUI_windows_portable_nvidia.7z\ComfyUI\custom_nodes\rgthree-comfy
0.0 seconds: D:\comfyUI\ComfyUI_windows_portable_nvidia.7z\ComfyUI\custom_nodes\ComfyUI-Easy-Use
4.2 seconds: D:\comfyUI\ComfyUI_windows_portable_nvidia.7z\ComfyUI\custom_nodes\ComfyUI-Manager

Total VRAM 6144 MB, total RAM 32461 MB
pytorch version: 2.3.1+cu121
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3060 Laptop GPU : cudaMallocAsync
Using pytorch cross attention

kijai · 2024-09-29T09:47:45Z

Not seen that happen myself, I'd recommend updating to torch 2.4.1 though, it's what kohya recommends to be used and it has solved lots of memory and speed issues for many who have updated.

Lecho303 · 2024-09-29T11:16:10Z

我自己没见过这种情况，但我建议更新到 torch 2.4.1，这是 kohya 建议使用的，它已经为许多更新过的人解决了许多内存和速度问题。

ok ,i will try to update,thank you so much

Lecho303 · 2024-10-01T02:10:34Z

Not seen that happen myself, I'd recommend updating to torch 2.4.1 though, it's what kohya recommends to be used and it has solved lots of memory and speed issues for many who have updated.

hi, i am upgrade the pytorch to 2.4.1,but the loss still keep "nan"……

Lecho303 · 2024-10-01T02:10:54Z

Lecho303 · 2024-10-01T02:11:22Z

Orenji-Tangerine · 2024-10-07T10:30:20Z

What is your optimizer used? Or maybe u attach your workflow here.

RaySteve312 · 2024-10-08T18:27:18Z

大概率是学习率问题

你是怎么做到这么慢的，如果是batch size太大按理说这个速度早就oom了

RaySteve312 · 2024-10-08T18:30:39Z

速度应该和笔记本而非台式机有关。nan和你的batch size,alpha,lr这几个有关

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux.1.dev.fp8 CKPT trainning the avr_loss keep 'nan' #76

Flux.1.dev.fp8 CKPT trainning the avr_loss keep 'nan' #76

Lecho303 commented Sep 29, 2024

kijai commented Sep 29, 2024

Lecho303 commented Sep 29, 2024

Lecho303 commented Oct 1, 2024 •

edited

Loading

Lecho303 commented Oct 1, 2024

Lecho303 commented Oct 1, 2024

Orenji-Tangerine commented Oct 7, 2024

RaySteve312 commented Oct 8, 2024 •

edited

Loading

RaySteve312 commented Oct 8, 2024 •

edited

Loading

Flux.1.dev.fp8 CKPT trainning the avr_loss keep 'nan' #76

Flux.1.dev.fp8 CKPT trainning the avr_loss keep 'nan' #76

Comments

Lecho303 commented Sep 29, 2024

ComfyUI-Manager: installing dependencies done.

kijai commented Sep 29, 2024

Lecho303 commented Sep 29, 2024

Lecho303 commented Oct 1, 2024 • edited Loading

Lecho303 commented Oct 1, 2024

Lecho303 commented Oct 1, 2024

Orenji-Tangerine commented Oct 7, 2024

RaySteve312 commented Oct 8, 2024 • edited Loading

RaySteve312 commented Oct 8, 2024 • edited Loading

Lecho303 commented Oct 1, 2024 •

edited

Loading

RaySteve312 commented Oct 8, 2024 •

edited

Loading

RaySteve312 commented Oct 8, 2024 •

edited

Loading