Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: "Some background workers are no longer alive" during nnUNetv2_train Validation #2607

Open
mtw2156 opened this issue Nov 18, 2024 · 5 comments
Assignees

Comments

@mtw2156
Copy link

mtw2156 commented Nov 18, 2024

Hi,

I wanted to share an issue regarding validation cases. I am using nnUNet on the HaNSeg dataset, I trained my model on all 5 folds using a custom configuration with no issues, and the training logs do not indicate any problems in training or predicting the validation cases. Needless to say, the validation cases did not save (or only some are available/saved), and I am running the validation cases using the "nnUNetv2_train --val." It works for some of the cases, but usually crashes before reaching the end, it is very expensive computationally and often switches to the CPU. I decided to create a new configuration and ran the dataset through preprocessing for this new configuration, then transferred the model files to the new configuration, and ran it through. Still get the crash, but the predictions stay on the GPU. This applies for other folds as well.

Any help would be appreciated!

Here is my input and output.

CUDA_VISIBLE_DEVICES=0 nnUNetv2_train 999 3d_fullres_v3 4 --val
Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-11-15 14:51:26.293600: Using splits from existing split file: /data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/splits_final.json
2024-11-15 14:51:26.302730: The split file contains 5 splits.
2024-11-15 14:51:26.302880: Desired fold for training: 4
2024-11-15 14:51:26.302988: This split has 34 training and 6 validation cases.
2024-11-15 14:51:26.303285: predicting case_02
2024-11-15 14:51:26.643660: case_02, shape torch.Size([1, 136, 466, 466]), rank 0
2024-11-15 14:53:46.883094: predicting case_37
2024-11-15 14:53:47.123915: case_37, shape torch.Size([1, 124, 385, 385]), rank 0
2024-11-15 14:54:49.823392: predicting case_38
2024-11-15 14:54:50.046503: case_38, shape torch.Size([1, 136, 357, 357]), rank 0
2024-11-15 14:56:14.719611: predicting case_39
2024-11-15 14:56:15.031698: case_39, shape torch.Size([1, 135, 425, 425]), rank 0
2024-11-15 14:58:25.082591: predicting case_40
2024-11-15 14:58:25.407599: case_40, shape torch.Size([1, 126, 400, 400]), rank 0
Traceback (most recent call last):
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/resource_sharer.py", line 138, in _serve
with self._listener.accept() as conn:
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 466, in accept
answer_challenge(c, self._authkey)
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 757, in answer_challenge
response = connection.recv_bytes(256) # reject large message
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "/home/mtw2156/anaconda3/envs/env/bin/nnUNetv2_train", line 8, in
sys.exit(run_training_entry())
File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/run/run_training.py", line 268, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/run/run_training.py", line 208, in run_training
nnunet_trainer.perform_actual_validation(export_validation_probabilities)
File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1183, in perform_actual_validation
proceed = not check_workers_alive_and_busy(segmentation_export_pool, worker_list, results,
File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/utilities/file_path_utilities.py", line 103, in check_workers_alive_and_busy
raise RuntimeError('Some background workers are no longer alive')
RuntimeError: Some background workers are no longer alive

@mtw2156 mtw2156 changed the title Issue regarding validation cases RuntimeError: "Some background workers are no longer alive" during nnUNetv2_train Validation Nov 18, 2024
@sunburstillend
Copy link

Hello,

I am also experiencing the issue mentioned above. Here is a part of my logs for context :

2024-11-21 15:41:56.820905: This split has 157 training and 39 validation cases.
2024-11-21 15:41:56.821133: predicting case_196_0004
2024-11-21 15:41:56.823359: case_196_0004, shape torch.Size([1, 448, 946, 448]), rank 0
2024-11-21 15:43:39.985436: predicting case_196_0007
2024-11-21 15:43:40.018088: case_196_0007, shape torch.Size([1, 512, 755, 512]), rank 0
2024-11-21 15:45:33.126272: predicting case_196_0009
2024-11-21 15:45:33.154442: case_196_0009, shape torch.Size([1, 575, 399, 575]), rank 0
2024-11-21 15:46:56.187559: predicting case_196_0010
2024-11-21 15:46:56.204679: case_196_0010, shape torch.Size([1, 575, 399, 575]), rank 0
2024-11-21 15:48:36.251001: predicting case_196_0019
2024-11-21 15:48:36.273754: case_196_0019, shape torch.Size([1, 512, 647, 512]), rank 0
2024-11-21 15:50:14.743588: predicting case_196_0026
2024-11-21 15:50:14.771919: case_196_0026, shape torch.Size([1, 467, 642, 467]), rank 0
2024-11-21 15:51:34.419982: predicting case_196_0027
2024-11-21 15:51:34.440981: case_196_0027, shape torch.Size([1, 636, 722, 636]), rank 0
2024-11-21 15:54:19.972141: predicting case_196_0031
2024-11-21 15:54:20.014876: case_196_0031, shape torch.Size([1, 447, 2073, 447]), rank 0
Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU
2024-11-21 16:02:21.664565: predicting case_196_0033
2024-11-21 16:02:21.720253: case_196_0033, shape torch.Size([1, 639, 940, 639]), rank 0
W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] torch._dynamo hit config.cache_size_limit (8)
W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] function: 'forward' (/home/user/anaconda3/envs/nnunet-2.4/lib/python3.10/site-packages/dynamic_network_architectures/architectures/unet.py:116)
W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] last reason: tensor 'L['x']' stride mismatch at index 0. expected 189865984, actual 383821740
W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html.
Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU

It seems like the memory issue is forcing the process to switch to CPU, and I am also seeing warnings related to torch._dynamo, what could it be?

Thank you in advance !!

@mtw2156
Copy link
Author

mtw2156 commented Nov 21, 2024

Additionally for reference,

Hello,

I am also experiencing the issue mentioned above. Here is a part of my logs for context :

2024-11-21 15:41:56.820905: This split has 157 training and 39 validation cases. 2024-11-21 15:41:56.821133: predicting case_196_0004 2024-11-21 15:41:56.823359: case_196_0004, shape torch.Size([1, 448, 946, 448]), rank 0 2024-11-21 15:43:39.985436: predicting case_196_0007 2024-11-21 15:43:40.018088: case_196_0007, shape torch.Size([1, 512, 755, 512]), rank 0 2024-11-21 15:45:33.126272: predicting case_196_0009 2024-11-21 15:45:33.154442: case_196_0009, shape torch.Size([1, 575, 399, 575]), rank 0 2024-11-21 15:46:56.187559: predicting case_196_0010 2024-11-21 15:46:56.204679: case_196_0010, shape torch.Size([1, 575, 399, 575]), rank 0 2024-11-21 15:48:36.251001: predicting case_196_0019 2024-11-21 15:48:36.273754: case_196_0019, shape torch.Size([1, 512, 647, 512]), rank 0 2024-11-21 15:50:14.743588: predicting case_196_0026 2024-11-21 15:50:14.771919: case_196_0026, shape torch.Size([1, 467, 642, 467]), rank 0 2024-11-21 15:51:34.419982: predicting case_196_0027 2024-11-21 15:51:34.440981: case_196_0027, shape torch.Size([1, 636, 722, 636]), rank 0 2024-11-21 15:54:19.972141: predicting case_196_0031 2024-11-21 15:54:20.014876: case_196_0031, shape torch.Size([1, 447, 2073, 447]), rank 0 Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU 2024-11-21 16:02:21.664565: predicting case_196_0033 2024-11-21 16:02:21.720253: case_196_0033, shape torch.Size([1, 639, 940, 639]), rank 0 W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] torch._dynamo hit config.cache_size_limit (8) W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] function: 'forward' (/home/user/anaconda3/envs/nnunet-2.4/lib/python3.10/site-packages/dynamic_network_architectures/architectures/unet.py:116) W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] last reason: tensor 'L['x']' stride mismatch at index 0. expected 189865984, actual 383821740 W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] To log all recompilation reasons, use TORCH_LOGS="recompiles". W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html. Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU

It seems like the memory issue is forcing the process to switch to CPU, and I am also seeing warnings related to torch._dynamo, what could it be?

Thank you in advance !!

@mtw2156 mtw2156 closed this as completed Nov 21, 2024
@mtw2156
Copy link
Author

mtw2156 commented Nov 21, 2024

I have also seen warnings related to torch._dynamo!

Thanks to anyone who provides help!!

@mtw2156 mtw2156 reopened this Nov 21, 2024
@FloTenPadel
Copy link

Hello all.
I experience some similar problems, did anyone find a solution to this ?
Thank you @seziegler

@seziegler
Copy link
Member

Hi all,
I'm not sure what's the exact reason but first of all make sure to monitor your RAM and VRAM during the inference.
Another thing you can try is running nnunet without compilation, torch dynamo errors result from there. You can do that by setting a nnUNet_compile environment variable, e.g.
nnUNet_compile=False nnUNetv2_train --val
Hope it helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants