Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training either wont start or trains a while then suddenly stops with no errors #3021

Open
Erlandsson opened this issue Dec 25, 2024 · 4 comments

Comments

@Erlandsson
Copy link

I am tearing my hair here. I HAVE NO KNOWLEDGE OF EITHER git(hub) or PYTHON, i have no idea what to do.
Googlefu does not give any hints to this error

I have Intel I9, 96Gb of RAM, 4080super 16Gb. Win 10.

I HAD a working kohya for about a week (successfully? trained a dozen LORA's) but had problems with most LR_schedulers not working and no blip captioning.
so i reinstalled. I should never have done that.

after the first handful of reinstalls of EVERYTHING: it trained for a couple of epochs, then just errors out about what i think "accellerate_cli.py" and and "training has ended". i have installed cuda software and the nr 2 in the install menu.
i have set accellerate a couple of times, and also tried not doing it.

I have tried python 3.10.6, 3.10.9 and 3.10.11, completly uninstalled every time and also deleted in users/appdata
No difference. event tried deleting everything about python in the registry.

I have tried gitcloning, using the zipfile, and even the "portable kohya" package.
sometimes it won't install sd-scripts.. i then did that manually. No difference.

(one weird thing is that even if i have uninstalled all python versions, automatic1111 still works, saying it uses Python 3.10.6.. even though it is not installed. does it use a local one?)

in the dump below it says "OSError: image file is truncated (44 bytes not processed)", but i have tried other pictures. I tried those i already have trained, but same error.

Now i am at a position that it starts setting all up, but the step after "caching latents" it craps out AGAIN with the following:

               INFO     caching latents...                                                       train_util.py:1144

0%| | 0/34 [00:00<?, ?it/s]
Traceback (most recent call last):
File "D:\AI_pics\kohya_ss\sd-scripts\sdxl_train_network.py", line 185, in
trainer.train(args)
File "D:\AI_pics\kohya_ss\sd-scripts\train_network.py", line 272, in train
train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process)
File "D:\AI_pics\kohya_ss\sd-scripts\library\train_util.py", line 2324, in cache_latents
dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process, file_suffix)
File "D:\AI_pics\kohya_ss\sd-scripts\library\train_util.py", line 1146, in cache_latents
cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.alpha_mask, subset.random_crop)
File "D:\AI_pics\kohya_ss\sd-scripts\library\train_util.py", line 2734, in cache_batch_latents
image = load_image(info.absolute_path, use_alpha_mask) if info.image is None else np.array(info.image, np.uint8)
File "D:\AI_pics\kohya_ss\sd-scripts\library\train_util.py", line 2637, in load_image
img = np.array(image, np.uint8)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\PIL\Image.py", line 681, in array_interface
new["data"] = self.tobytes()
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\PIL\Image.py", line 740, in tobytes
self.load()
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\PIL\ImageFile.py", line 288, in load
raise OSError(msg)
OSError: image file is truncated (44 bytes not processed)
Traceback (most recent call last):
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\AI_pics\kohya_ss\venv\Scripts\accelerate.EXE_main
.py", line 7, in
sys.exit(main())
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
args.func(args)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command
simple_launcher(args)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\AI_pics\kohya_ss\venv\Scripts\python.exe', 'D:/AI_pics/kohya_ss/sd-scripts/sdxl_train_network.py', '--config_file', 'D:/AI_pics/Blonde_tensor\model/config_lora-20241225-144614.toml']' returned non-zero exit status 1.
14:46:25-434957 INFO Training has ended.

@Erlandsson
Copy link
Author

ok tried again.. this time the gitclone link when isntalling produce a new error altoghether..

[notice] A new release of pip is available: 23.0.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip
14:59:09-049338 INFO Requirements from requirements_pytorch_windows.txt installed.
14:59:09-064919 INFO Installing requirements from requirements_windows.txt...
Obtaining file:///D:/AI_pics/kohya_ss/sd-scripts (from -r requirements.txt (line 35))
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [90 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 14, in
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\setuptools_init_.py", line 16, in
import setuptools.version
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\setuptools\version.py", line 1, in
import pkg_resources
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_init_.py", line 84, in
import('pkg_resources.extern.packaging.requirements')
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\packaging\requirements.py", line 84, in
REQUIREMENT.parseString("x[]")
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 1131, in parse_string
loc, tokens = self._parse(instring, 0)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 817, in _parseNoCache
loc, tokens = self.parseImpl(instring, pre_loc, doActions)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 3886, in parseImpl
loc, exprtokens = e._parse(instring, loc, doActions)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 817, in _parseNoCache
loc, tokens = self.parseImpl(instring, pre_loc, doActions)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 4114, in parseImpl
return e._parse(
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 817, in _parseNoCache
loc, tokens = self.parseImpl(instring, pre_loc, doActions)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 3864, in parseImpl
loc, resultlist = self.exprs[0]._parse(
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 817, in _parseNoCache
loc, tokens = self.parseImpl(instring, pre_loc, doActions)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 3886, in parseImpl
loc, exprtokens = e._parse(instring, loc, doActions)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 817, in _parseNoCache
loc, tokens = self.parseImpl(instring, pre_loc, doActions)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 4959, in parseImpl
loc, tokens = self_expr._parse(instring, loc, doActions, callPreParse=False)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 817, in _parseNoCache
loc, tokens = self.parseImpl(instring, pre_loc, doActions)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 4114, in parseImpl
return e._parse(
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 817, in _parseNoCache
loc, tokens = self.parseImpl(instring, pre_loc, doActions)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 4375, in parseImpl
return self.expr._parse(instring, loc, doActions, callPreParse=False)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 817, in _parseNoCache
loc, tokens = self.parseImpl(instring, pre_loc, doActions)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 3864, in parseImpl
loc, resultlist = self.exprs[0]._parse(
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 817, in _parseNoCache
loc, tokens = self.parseImpl(instring, pre_loc, doActions)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 3958, in parseImpl
loc2 = e.try_parse(instring, loc, raise_fatal=True)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 880, in try_parse
return self._parse(instring, loc, doActions=False)[0]
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 817, in _parseNoCache
loc, tokens = self.parseImpl(instring, pre_loc, doActions)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 2985, in parseImpl
result = self.re_match(instring, loc)
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\functools.py", line 981, in get
val = self.func(instance)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 2975, in re_match
return self.re.match
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\functools.py", line 981, in get
val = self.func(instance)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\pkg_resources_vendor\pyparsing\core.py", line 2967, in re
return re.compile(self.pattern, self.flags)
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\re.py", line 251, in compile
return _compile(pattern, flags)
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\re.py", line 303, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\sre_compile.py", line 788, in compile
p = sre_parse.parse(p, flags)
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\sre_parse.py", line 955, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\sre_parse.py", line 444, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\sre_parse.py", line 841, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\sre_parse.py", line 444, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\sre_parse.py", line 841, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\sre_parse.py", line 444, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\sre_parse.py", line 841, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\sre_parse.py", line 444, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\sre_parse.py", line 668, in _parse
if not item or item[0][0] is AT:
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\sre_parse.py", line 168, in getitem
return self.data[index]
TypeError: 'type' object is not subscriptable
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

@Erlandsson
Copy link
Author

Seem like every time i try installing, i get more and more errors.

@Erlandsson
Copy link
Author

and this time: no action at all:
5:06:57-490783 INFO Kohya_ss GUI version: v24.1.7
15:06:57-907424 INFO Submodule initialized and updated.
15:06:57-907424 INFO nVidia toolkit detected
15:06:59-007271 INFO Torch 2.1.2+cu118
15:06:59-028934 INFO Torch backend: nVidia CUDA 11.8 cuDNN 8905
15:06:59-028934 INFO Torch detected GPU: NVIDIA GeForce RTX 4080 SUPER VRAM 16376 Arch (8, 9) Cores 80
15:06:59-028934 INFO Python version is 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit
(AMD64)]
15:06:59-028934 INFO Verifying modules installation status from requirements_pytorch_windows.txt...
15:06:59-028934 INFO Verifying modules installation status from requirements_windows.txt...
15:06:59-028934 INFO Verifying modules installation status from requirements.txt...

(venv) D:\AI_pics\kohya_ss>
(venv) D:\AI_pics\kohya_ss>

@Erlandsson
Copy link
Author

ok i got it running, but now i am back at it stopping with errors after some time. This is just a 4 repeats, 1 batch, 5 epoch train.
15 min totalt, but fails after epoch 4.. just 2 minutes from finish.. It gets awful when training should be 7 hours, i go to sleep and wake up to it has stopped after 30 minutes.

I also see that it is compeletly different errors now, com pared to last time i almost got it working. Then it was more about "accellerate", not it seems to be about cuda/torch optimizer etc..

epoch 3/5
2024-12-25 15:17:02 INFO epoch is incremented. current_epoch: 2, epoch: 3 train_util.py:703
steps: 60%|█████████████████████████████████▌ | 732/1220 [07:16<04:50, 1.68it/s, avr_loss=0.108]
saving checkpoint: D:/AI_pics/Katka_tensors\model\test-000003.safetensors

epoch 4/5
2024-12-25 15:19:27 INFO epoch is incremented. current_epoch: 3, epoch: 4 train_util.py:703
steps: 80%|████████████████████████████████████████████▊ | 976/1220 [09:42<02:25, 1.68it/s, avr_loss=0.103]
saving checkpoint: D:/AI_pics/Katka_tensors\model\test-000004.safetensors

epoch 5/5
2024-12-25 15:21:53 INFO epoch is incremented. current_epoch: 4, epoch: 5 train_util.py:703
steps: 83%|█████████████████████████████████████████████▌ | 1010/1220 [10:03<02:05, 1.67it/s, avr_loss=0.101]Traceback (most recent call last):
File "D:\AI_pics\kohya_ss\sd-scripts\sdxl_train_network.py", line 185, in
trainer.train(args)
File "D:\AI_pics\kohya_ss\sd-scripts\train_network.py", line 1012, in train
optimizer.step()
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\accelerate\optimizer.py", line 132, in step
self.scaler.step(self.optimizer, closure)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 416, in step
retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 315, in _maybe_opt_step
retval = optimizer.step(*args, **kwargs)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\accelerate\optimizer.py", line 185, in patched_step
return method(*args, **kwargs)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\torch\optim\lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\torch\optim\optimizer.py", line 373, in wrapper
out = func(*args, **kwargs)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\torch\optim\optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\torch\optim\adamw.py", line 184, in step
adamw(
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\torch\optim\adamw.py", line 335, in adamw
func(
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\torch\optim\adamw.py", line 530, in _multi_tensor_adamw
device_params = [torch.view_as_real(x) if torch.is_complex(x) else x for x in device_params]
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\torch\optim\adamw.py", line 530, in
device_params = [torch.view_as_real(x) if torch.is_complex(x) else x for x in device_params]
TypeError: 'Parameter' object is not callable
steps: 83%|█████████████████████████████████████████████▌ | 1010/1220 [10:03<02:05, 1.67it/s, avr_loss=0.101]
Traceback (most recent call last):
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\Anders\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\AI_pics\kohya_ss\venv\Scripts\accelerate.EXE_main
.py", line 7, in
sys.exit(main())
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
args.func(args)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command
simple_launcher(args)
File "D:\AI_pics\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\AI_pics\kohya_ss\venv\Scripts\python.exe', 'D:/AI_pics/kohya_ss/sd-scripts/sdxl_train_network.py', '--config_file', 'D:/AI_pics/Katka_tensors\model/config_lora-20241225-151134.toml']' returned non-zero exit status 3221225477.
15:22:18-633084 INFO Training has ended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant