-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training either wont start or trains a while then suddenly stops with no errors #3021
Comments
ok tried again.. this time the gitclone link when isntalling produce a new error altoghether.. [notice] A new release of pip is available: 23.0.1 -> 24.3.1 × python setup.py egg_info did not run successfully. note: This error originates from a subprocess, and is likely not a problem with pip. × Encountered error while generating package metadata. note: This is an issue with the package mentioned above, not pip. |
Seem like every time i try installing, i get more and more errors. |
and this time: no action at all: (venv) D:\AI_pics\kohya_ss> |
ok i got it running, but now i am back at it stopping with errors after some time. This is just a 4 repeats, 1 batch, 5 epoch train. I also see that it is compeletly different errors now, com pared to last time i almost got it working. Then it was more about "accellerate", not it seems to be about cuda/torch optimizer etc.. epoch 3/5 epoch 4/5 epoch 5/5 |
I am tearing my hair here. I HAVE NO KNOWLEDGE OF EITHER git(hub) or PYTHON, i have no idea what to do.
Googlefu does not give any hints to this error
I have Intel I9, 96Gb of RAM, 4080super 16Gb. Win 10.
I HAD a working kohya for about a week (successfully? trained a dozen LORA's) but had problems with most LR_schedulers not working and no blip captioning.
so i reinstalled. I should never have done that.
after the first handful of reinstalls of EVERYTHING: it trained for a couple of epochs, then just errors out about what i think "accellerate_cli.py" and and "training has ended". i have installed cuda software and the nr 2 in the install menu.
i have set accellerate a couple of times, and also tried not doing it.
I have tried python 3.10.6, 3.10.9 and 3.10.11, completly uninstalled every time and also deleted in users/appdata
No difference. event tried deleting everything about python in the registry.
I have tried gitcloning, using the zipfile, and even the "portable kohya" package.
sometimes it won't install sd-scripts.. i then did that manually. No difference.
(one weird thing is that even if i have uninstalled all python versions, automatic1111 still works, saying it uses Python 3.10.6.. even though it is not installed. does it use a local one?)
in the dump below it says "OSError: image file is truncated (44 bytes not processed)", but i have tried other pictures. I tried those i already have trained, but same error.
Now i am at a position that it starts setting all up, but the step after "caching latents" it craps out AGAIN with the following:
0%| | 0/34 [00:00<?, ?it/s]
Traceback (most recent call last):
File "D:\AI_pics\kohya_ss\sd-scripts\sdxl_train_network.py", line 185, in
trainer.train(args)
File "D:\AI_pics\kohya_ss\sd-scripts\train_network.py", line 272, in train
train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process)
File "D:\AI_pics\kohya_ss\sd-scripts\library\train_util.py", line 2324, in cache_latents
dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process, file_suffix)
File "D:\AI_pics\kohya_ss\sd-scripts\library\train_util.py", line 1146, in cache_latents
cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.alpha_mask, subset.random_crop)
File "D:\AI_pics\kohya_ss\sd-scripts\library\train_util.py", line 2734, in cache_batch_latents
image = load_image(info.absolute_path, use_alpha_mask) if info.image is None else np.array(info.image, np.uint8)
File "D:\AI_pics\kohya_ss\sd-scripts\library\train_util.py", line 2637, in load_image
img = np.array(image, np.uint8)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\PIL\Image.py", line 681, in array_interface
new["data"] = self.tobytes()
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\PIL\Image.py", line 740, in tobytes
self.load()
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\PIL\ImageFile.py", line 288, in load
raise OSError(msg)
OSError: image file is truncated (44 bytes not processed)
Traceback (most recent call last):
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\AI_pics\kohya_ss\venv\Scripts\accelerate.EXE_main.py", line 7, in
sys.exit(main())
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
args.func(args)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command
simple_launcher(args)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\AI_pics\kohya_ss\venv\Scripts\python.exe', 'D:/AI_pics/kohya_ss/sd-scripts/sdxl_train_network.py', '--config_file', 'D:/AI_pics/Blonde_tensor\model/config_lora-20241225-144614.toml']' returned non-zero exit status 1.
14:46:25-434957 INFO Training has ended.
The text was updated successfully, but these errors were encountered: