Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nccl timeout on train_controlnet_flux.py when doing multigpu training #9936

Open
neuron-party opened this issue Nov 15, 2024 · 5 comments
Open
Labels
bug Something isn't working

Comments

@neuron-party
Copy link
Contributor

Describe the bug

Running train_controlnet_flux.py with multiple gpus results in a NCCL timeout error after N iterations of train_dataset.map(). This error can be partially solved by initializing Accelerator with a greater timeout argument in the following way:

from accelerate import InitProcessGroupKwargs
from datetime import timedelta

x = InitProcessGroupKwargs(timeout=timedelta(seconds=N)))

accelerator = Accelerator(
   ...,
   kwargs_handlers = [x]
)

however, the NCCL timeout error reoccurs at a later iteration of train_dataset.map().

Reproduction

accelerate launch --config_file configs/distributed train_controlnet_flux.py
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"
--conditioning_image_column=conditioning_image
--image_column=image
--caption_column=text
--output_dir="path"
--mixed_precision="bf16"
--resolution=1024
--learning_rate=5e-6
--max_train_steps=100000
--validation_steps=1000
--checkpointing_steps=25000
--validation_image "placeholder"
--validation_prompt "placeholder"
--train_batch_size=4
--gradient_accumulation_steps=1
--report_to="tensorboard"
--seed=42
--jsonl_for_train="path"
--cache_dir="path"

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
use_cpu: false

Logs

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

System Info

diffusers from source
accelerate == 1.1.1
datasets == 3.1.0
transformers == 4.46.2

Who can help?

No response

@neuron-party neuron-party added the bug Something isn't working label Nov 15, 2024
@sayakpaul
Copy link
Member

Can you try to increase the NCCL timeout value and see if that helps?

@neuron-party
Copy link
Contributor Author

@sayakpaul i did by passing the timeout arg when initializing the Accelerator object. increasing it to a reasonable number delays the error to a later iteration, increasing it to too large a number causes a timeout of its own

@sayakpaul
Copy link
Member

Okay. Then maybe precomputing dataset processing step outputs would be more useful in this setup?

@xduzhangjiayu
Copy link
Contributor

I trained SD3 controlnet with the same issue. And I also found that during multi-GPU training, the computation of text embedding will only be computed on one GPU, and I really don't know why.

@sayakpaul
Copy link
Member

Yeah that is how it's coded. For full-blown distributed support, I welcome you to check out https://github.com/huggingface/diffusers/blob/main/examples/research_projects/controlnet/train_controlnet_webdataset.py as a reference.

The training script is meant to serve as an educational reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants