Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training randomly quit working. #3024

Open
Deejay85 opened this issue Dec 30, 2024 · 3 comments
Open

Training randomly quit working. #3024

Deejay85 opened this issue Dec 30, 2024 · 3 comments

Comments

@Deejay85
Copy link

It's been a while, and training up and quit working for some unknown reason, so I tried a fresh install, and nothing changed. To make matters worse I tried to delete the default_config.yaml file, and reconfigure it through setup.bat, and unlike last time, no dice. Sadly I'm a bit stupid when it comes to this programming stuff, but I gave a look through the logs, realized I didn't have SD-Scripts in my folder, and that didn't fix it either...nice try though. I'm running short on what worked, so after getting everything back to it's pre-configured state, here is the latest log.

20:40:03-867615 INFO     Kohya_ss GUI version: v24.1.7
fatal: not a git repository (or any of the parent directories): .git
20:40:04-132615 ERROR    Error during Git operation: Command '['git', 'submodule', 'update', '--init', '--recursive',
                         '--quiet']' returned non-zero exit status 128.
20:40:04-135615 INFO     nVidia toolkit detected
20:40:05-446617 INFO     Torch 2.1.2+cu118
20:40:05-460618 INFO     Torch backend: nVidia CUDA 11.8 cuDNN 8905
20:40:05-462615 INFO     Torch detected GPU: NVIDIA GeForce RTX 4090 VRAM 24564 Arch (8, 9) Cores 128
20:40:05-466618 INFO     Python version is 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit
                         (AMD64)]
20:40:05-467617 INFO     Verifying modules installation status from requirements_pytorch_windows.txt...
20:40:05-469619 INFO     Verifying modules installation status from requirements_windows.txt...
20:40:05-470616 INFO     Verifying modules installation status from requirements.txt...
20:40:12-111332 INFO     headless: False
20:40:12-146331 INFO     Using shell=True when running external commands...
M:\kohya_ss\venv\lib\site-packages\gradio\analytics.py:106: UserWarning: IMPORTANT: You are using gradio version 4.43.0, however version 4.44.1 is available, please upgrade.
--------
  warnings.warn(
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
20:40:23-479989 INFO     Loading config...
20:40:25-151985 INFO     Start training LoRA Standard ...
20:40:25-152985 INFO     Validating lr scheduler arguments...
20:40:25-153984 INFO     Validating optimizer arguments...
20:40:25-154984 INFO     Validating M:\SampleImages\SDXL\Konata\Log existence and writability... SUCCESS
20:40:25-154984 INFO     Validating M:\SampleImages\SDXL\Konata\Model existence and writability... SUCCESS
20:40:25-155985 INFO     Validating M:/Forge/webui/models/Stable-diffusion/SDXL/hentaiMixXLRoadTo_v50.safetensors
                         existence... SUCCESS
20:40:25-156985 INFO     Validating M:\SampleImages\SDXL\Konata\Images existence... SUCCESS
20:40:25-158985 INFO     Folder 2_Konata: 2 repeats found
20:40:25-159985 INFO     Folder 2_Konata: 20 images found
20:40:25-160985 INFO     Folder 2_Konata: 20 * 2 = 40 steps
20:40:25-161986 INFO     Regulatization factor: 1
20:40:25-161986 INFO     Total steps: 40
20:40:25-162984 INFO     Train batch size: 1
20:40:25-163984 INFO     Gradient accumulation steps: 1
20:40:25-163984 INFO     Epoch: 60
20:40:25-164984 INFO     max_train_steps (40 / 1 / 1 * 60 * 1) = 2400
20:40:25-165985 INFO     stop_text_encoder_training = 0
20:40:25-166983 INFO     lr_warmup_steps = 0
20:40:25-167985 INFO     Saving training config to M:\SampleImages\SDXL\Konata\Model\Konata_20241229-204025.json...
20:40:25-169986 INFO     Executing command: M:\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no
                         --dynamo_mode default --mixed_precision bf16 --num_processes 1 --num_machines 1
                         --num_cpu_threads_per_process 2 M:/kohya_ss/sd-scripts/sdxl_train_network.py --config_file
                         M:\SampleImages\SDXL\Konata\Model/config_lora-20241229-204025.toml
20:40:25-172985 INFO     Command executed.
2024-12-29 20:40:33 INFO     Loading settings from                                                    train_util.py:4519
                             M:\SampleImages\SDXL\Konata\Model/config_lora-20241229-204025.toml...
                    INFO     M:\SampleImages\SDXL\Konata\Model/config_lora-20241229-204025            train_util.py:4538
2024-12-29 20:40:34 INFO     Using DreamBooth method.                                               train_network.py:325
                    INFO     prepare images.                                                          train_util.py:1971
                    INFO     get image size from name of cache files                                  train_util.py:1886
100%|██████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<?, ?it/s]
                    INFO     set image size from cache files: 0/20                                    train_util.py:1916
                    INFO     found directory M:\SampleImages\SDXL\Konata\Images\2_Konata contains 20  train_util.py:1918
                             image files
read caption: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 3275.52it/s]
                    INFO     40 train images with repeating.                                          train_util.py:2012
                    INFO     0 reg images.                                                            train_util.py:2015
                    WARNING  no regularization images / 正則化画像が見つかりませんでした              train_util.py:2020
                    INFO     [Dataset 0]                                                              config_util.py:567
                               batch_size: 1
                               resolution: (1024, 1024)
                               enable_bucket: True
                               network_multiplier: 1.0
                               min_bucket_reso: 64
                               max_bucket_reso: 2048
                               bucket_reso_steps: 64
                               bucket_no_upscale: True

                               [Subset 0 of Dataset 0]
                                 image_dir: "M:\SampleImages\SDXL\Konata\Images\2_Konata"
                                 image_count: 20
                                 num_repeats: 2
                                 shuffle_caption: True
                                 keep_tokens: 1
                                 keep_tokens_separator:
                                 caption_separator: ,
                                 secondary_separator: None
                                 enable_wildcard: False
                                 caption_dropout_rate: 0.0
                                 caption_dropout_every_n_epochs: 0
                                 caption_tag_dropout_rate: 0.0
                                 caption_prefix: None
                                 caption_suffix: None
                                 color_aug: False
                                 flip_aug: False
                                 face_crop_aug_range: None
                                 random_crop: False
                                 token_warmup_min: 1
                                 token_warmup_step: 0
                                 alpha_mask: False
                                 custom_attributes: {}
                                 is_reg: False
                                 class_tokens: Konata
                                 caption_extension: .txt


                    INFO     [Dataset 0]                                                              config_util.py:573
                    INFO     loading image sizes.                                                      train_util.py:923
100%|████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 4115.90it/s]
                    INFO     make buckets                                                              train_util.py:946
                    WARNING  min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is   train_util.py:963
                             set, because bucket reso is defined by image size automatically /
                             bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
                             算されるため、min_bucket_resoとmax_bucket_resoは無視されます
                    INFO     number of images (including repeats) /                                    train_util.py:992
                             各bucketの画像枚数(繰り返し回数を含む)
                    INFO     bucket 0: resolution (768, 1216), count: 2                                train_util.py:997
                    INFO     bucket 1: resolution (768, 1280), count: 2                                train_util.py:997
                    INFO     bucket 2: resolution (832, 1152), count: 12                               train_util.py:997
                    INFO     bucket 3: resolution (832, 1216), count: 4                                train_util.py:997
                    INFO     bucket 4: resolution (896, 1088), count: 2                                train_util.py:997
                    INFO     bucket 5: resolution (896, 1152), count: 2                                train_util.py:997
                    INFO     bucket 6: resolution (960, 1024), count: 2                                train_util.py:997
                    INFO     bucket 7: resolution (1024, 960), count: 2                                train_util.py:997
                    INFO     bucket 8: resolution (1024, 1024), count: 2                               train_util.py:997
                    INFO     bucket 9: resolution (1088, 832), count: 2                                train_util.py:997
                    INFO     bucket 10: resolution (1152, 832), count: 2                               train_util.py:997
                    INFO     bucket 11: resolution (1216, 768), count: 4                               train_util.py:997
                    INFO     bucket 12: resolution (1216, 832), count: 2                               train_util.py:997
                    INFO     mean ar error (without repeats): 0.013208133072419337                    train_util.py:1002
                    WARNING  clip_skip will be unexpected / SDXL学習ではclip_skipは動作しません   sdxl_train_util.py:351
                    INFO     preparing accelerator                                                  train_network.py:379
accelerator device: cuda
                    INFO     loading model for process 0/1                                         sdxl_train_util.py:32
                    INFO     load StableDiffusion checkpoint:                                      sdxl_train_util.py:73
                             M:/Forge/webui/models/Stable-diffusion/SDXL/hentaiMixXLRoadTo_v50.saf
                             etensors
2024-12-29 20:40:35 INFO     building U-Net                                                       sdxl_model_util.py:198
                    INFO     loading U-Net from checkpoint                                        sdxl_model_util.py:202
2024-12-29 20:40:52 INFO     U-Net: <All keys matched successfully>                               sdxl_model_util.py:208
                    INFO     building text encoders                                               sdxl_model_util.py:211
                    INFO     loading text encoders from checkpoint                                sdxl_model_util.py:264
2024-12-29 20:40:53 INFO     text encoder 1: <All keys matched successfully>                      sdxl_model_util.py:278
2024-12-29 20:40:59 INFO     text encoder 2: <All keys matched successfully>                      sdxl_model_util.py:282
                    INFO     building VAE                                                         sdxl_model_util.py:285
                    INFO     loading VAE from checkpoint                                          sdxl_model_util.py:290
                    INFO     VAE: <All keys matched successfully>                                 sdxl_model_util.py:293
import network module: networks.lora
2024-12-29 20:41:00 INFO     [Dataset 0]                                                              train_util.py:2495
                    INFO     caching latents with caching strategy.                                   train_util.py:1048
                    INFO     caching latents...                                                       train_util.py:1097
100%|████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 2500.18it/s]
2024-12-29 20:41:02 INFO     create LoRA network. base dim (rank): 32, alpha: 32                             lora.py:935
                    INFO     neuron dropout: p=None, rank dropout: p=None, module dropout: p=None            lora.py:936
                    INFO     create LoRA for Text Encoder 1:                                                lora.py:1027
                    INFO     create LoRA for Text Encoder 2:                                                lora.py:1027
                    INFO     create LoRA for Text Encoder: 264 modules.                                     lora.py:1035
2024-12-29 20:41:03 INFO     create LoRA for U-Net: 722 modules.                                            lora.py:1043
                    INFO     enable LoRA for text encoder: 264 modules                                      lora.py:1084
                    INFO     enable LoRA for U-Net: 722 modules                                             lora.py:1089
prepare optimizer, data loader etc.
                    INFO     use AdamW optimizer | {}                                                 train_util.py:4872
Traceback (most recent call last):
  File "M:\kohya_ss\sd-scripts\sdxl_train_network.py", line 228, in <module>
    trainer.train(args)
  File "M:\kohya_ss\sd-scripts\train_network.py", line 571, in train
    lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)
  File "M:\kohya_ss\sd-scripts\library\train_util.py", line 5128, in get_scheduler_fix
    if name == SchedulerType.COSINE_WITH_MIN_LR:
  File "C:\Users\Ande\AppData\Local\Programs\Python\Python310\lib\enum.py", line 437, in __getattr__
    raise AttributeError(name) from None
AttributeError: COSINE_WITH_MIN_LR
Traceback (most recent call last):
  File "C:\Users\Ande\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Ande\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "M:\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
    sys.exit(main())
  File "M:\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
    args.func(args)
  File "M:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command
    simple_launcher(args)
  File "M:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['M:\\kohya_ss\\venv\\Scripts\\python.exe', 'M:/kohya_ss/sd-scripts/sdxl_train_network.py', '--config_file', 'M:\\SampleImages\\SDXL\\Konata\\Model/config_lora-20241229-204025.toml']' returned non-zero exit status 1.
20:41:05-828023 INFO     Training has ended.
@Sylsatra
Copy link

You can try to use Cosine/Cosine with restart instead of Cosine with min LR.

@Erlandsson
Copy link

it sometimes happens to me, i just press start train again , and it starts. But my trainings stops with same errors in the middle of a training, completely random times.

@Deejay85
Copy link
Author

Somehow that worked. I don't know why it worked, since it worked fine the last time I used it, but somehow it works now. I don't think I updated anything, so I'm at a loss as to what is going on here.

Also, when I say random, I mean it worked fine a month ago, and then up and dies on me today, with nothing being changed in between. This program, as useful as it is, is temperamental at best, and seems to only work when it wants to. At least it wasn't something I overlooked like running activate.bat with admin privileges. 🙄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants