RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:317 #9

jalajthanaki · 2020-07-21T10:56:34Z

❓ Questions and Help

Hi Shamil Chollampatt, Weiqi Wang, and Hwee Tou Ng

I'm referring to the CrossSent paper and trying out this approach. When I run the code for a smaller dataset it ran perfectly but when I increase my dataset by 10 times. I'm getting the following error.

pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [37,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCReduceAll.cuh line=317 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "fairseq/train.py", line 352, in <module>
    multiprocessing_main(args)
  File "fairseq/multiprocessing_train.py", line 40, in main
    p.join()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 124, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "fairseq/multiprocessing_train.py", line 82, in signal_handler
    raise Exception(msg)
Exception:

-- Tracebacks above this line can probably be ignored --

Traceback (most recent call last):
  File "fairseq/multiprocessing_train.py", line 46, in run
    single_process_main(args)
  File "fairseq/train.py", line 87, in main
    train(args, trainer, task, epoch_itr)
  File "fairseq/train.py", line 125, in train
    log_output = trainer.train_step(sample, update_params=True)
  File "fairseq/fairseq/trainer.py", line 117, in train_step
    loss, sample_size, logging_output, oom_fwd = self._forward(sample)
  File "fairseq/fairseq/trainer.py", line 205, in _forward
    raise e
  File "fairseq/fairseq/trainer.py", line 197, in _forward
    loss, sample_size, logging_output_ = self.task.get_loss(self.model, self.criterion, sample)
  File "fairseq/fairseq/tasks/fairseq_task.py", line 49, in get_loss
    return criterion(model, sample)
  File "python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "fairseq/fairseq/criterions/label_smoothed_cross_entropy.py", line 36, in forward
    net_output = model(**sample['net_input'])
  File "python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "fairseq/fairseq/models/fairseq_model.py", line 146, in forward
    auxencoder_out = self.auxencoder(ctx_tokens, ctx_lengths)
  File "python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "fairseq/fairseq/models/fconv_dualenc_gec_gatedaux.py", line 193, in forward
    if not encoder_padding_mask.any():
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:317

What have you tried?

As mentioned in a few GitHub issues and PyTorch forums questions, I have run my code using CUDA_LAUNCH_BLOCKING=1 and the following is my error log

Traceback (most recent call last):                                                                                                                      
File "/fairseq/train.py", line 352, in <module>
    multiprocessing_main(args)
  File "/fairseq/multiprocessing_train.py", line 40, in main
    p.join()
  File "/opt/anaconda/lib/python3.7/multiprocessing/process.py", line 140, in join
    res = self._popen.wait(timeout)
  File "/opt/anaconda/lib/python3.7/multiprocessing/popen_fork.py", line 48, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/opt/anaconda/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/fairseq/multiprocessing_train.py", line 82, in signal_handler
    raise Exception(msg)
Exception:
-- Tracebacks above this line can probably be ignored --

Traceback (most recent call last):
  File "/fairseq/multiprocessing_train.py", line 46, in run
    single_process_main(args)
  File "/fairseq/train.py", line 35, in main
    load_dataset_splits(args, task, ['train', 'valid'])
  File "/fairseq/train.py", line 333, in load_dataset_splits
    task.load_dataset(split_k)
  File "/fairseq/fairseq/tasks/translation_ctx.py", line 105, in load_dataset
    ctx_dataset = indexed_dataset(prefix + 'ctx', self.ctx_dict)
  File "/fairseq/fairseq/tasks/translation_ctx.py", line 98, in indexed_dataset
    return IndexedRawTextDataset(path, dictionary)
  File "/fairseq/fairseq/data/indexed_dataset.py", line 130, in __init__
    self.read_data(path, dictionary)
  File "/fairseq/fairseq/data/indexed_dataset.py", line 136, in read_data
    self.lines.append(line.strip('\n'))
MemoryError

Question/Help?

According to me if there is memory constrain then CUDA should throw out of memory error and not this error. Based on the reading, I came to know cuda runtime error (59) : device-side asset error triggered due to out-of-bound issue or due to faulty loss function. This shouldn't be a case here because the entire code is running smoothly for the smaller dataset and failing to process large datasets. hence, I'm putting this question here for getting further help.

Is there anything that I need to check in order to resolve this issue or something am I missing?

What's your environment?

fairseq Version (e.g., 1.0 or master): 0.5
PyTorch Version (e.g., 1.0) : 0.4.1
OS (e.g., Linux): Ubutu 18.04
How you installed fairseq (pip, source): pip
Build command you used (if compiling from source): NA
Python version: 3.6
CUDA/cuDNN version: 9.2
GPU models and configuration: 8 GPUs (V100)

Any help, support, and direction is highly appreciable

Thanks

The text was updated successfully, but these errors were encountered:

jalajthanaki · 2020-08-11T08:42:54Z

Hi @shamilcm and @pidugusundeep,

Here is the link of fairseq GitHub issue and Pytorch GitHub issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:317 #9

RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:317 #9

jalajthanaki commented Jul 21, 2020 •

edited

Loading

jalajthanaki commented Aug 11, 2020

RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:317 #9

RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:317 #9

Comments

jalajthanaki commented Jul 21, 2020 • edited Loading

❓ Questions and Help

What have you tried?

Question/Help?

What's your environment?

jalajthanaki commented Aug 11, 2020

jalajthanaki commented Jul 21, 2020 •

edited

Loading