Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:317 #9

Open
jalajthanaki opened this issue Jul 21, 2020 · 1 comment

Comments

@jalajthanaki
Copy link

jalajthanaki commented Jul 21, 2020

❓ Questions and Help

Hi Shamil Chollampatt, Weiqi Wang, and Hwee Tou Ng

I'm referring to the CrossSent paper and trying out this approach. When I run the code for a smaller dataset it ran perfectly but when I increase my dataset by 10 times. I'm getting the following error.

pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [37,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCReduceAll.cuh line=317 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "fairseq/train.py", line 352, in <module>
    multiprocessing_main(args)
  File "fairseq/multiprocessing_train.py", line 40, in main
    p.join()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 124, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "fairseq/multiprocessing_train.py", line 82, in signal_handler
    raise Exception(msg)
Exception:

-- Tracebacks above this line can probably be ignored --

Traceback (most recent call last):
  File "fairseq/multiprocessing_train.py", line 46, in run
    single_process_main(args)
  File "fairseq/train.py", line 87, in main
    train(args, trainer, task, epoch_itr)
  File "fairseq/train.py", line 125, in train
    log_output = trainer.train_step(sample, update_params=True)
  File "fairseq/fairseq/trainer.py", line 117, in train_step
    loss, sample_size, logging_output, oom_fwd = self._forward(sample)
  File "fairseq/fairseq/trainer.py", line 205, in _forward
    raise e
  File "fairseq/fairseq/trainer.py", line 197, in _forward
    loss, sample_size, logging_output_ = self.task.get_loss(self.model, self.criterion, sample)
  File "fairseq/fairseq/tasks/fairseq_task.py", line 49, in get_loss
    return criterion(model, sample)
  File "python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "fairseq/fairseq/criterions/label_smoothed_cross_entropy.py", line 36, in forward
    net_output = model(**sample['net_input'])
  File "python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "fairseq/fairseq/models/fairseq_model.py", line 146, in forward
    auxencoder_out = self.auxencoder(ctx_tokens, ctx_lengths)
  File "python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "fairseq/fairseq/models/fconv_dualenc_gec_gatedaux.py", line 193, in forward
    if not encoder_padding_mask.any():
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:317

What have you tried?

As mentioned in a few GitHub issues and PyTorch forums questions, I have run my code using CUDA_LAUNCH_BLOCKING=1 and the following is my error log

Traceback (most recent call last):                                                                                                                      
File "/fairseq/train.py", line 352, in <module>
    multiprocessing_main(args)
  File "/fairseq/multiprocessing_train.py", line 40, in main
    p.join()
  File "/opt/anaconda/lib/python3.7/multiprocessing/process.py", line 140, in join
    res = self._popen.wait(timeout)
  File "/opt/anaconda/lib/python3.7/multiprocessing/popen_fork.py", line 48, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/opt/anaconda/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/fairseq/multiprocessing_train.py", line 82, in signal_handler
    raise Exception(msg)
Exception:
-- Tracebacks above this line can probably be ignored --

Traceback (most recent call last):
  File "/fairseq/multiprocessing_train.py", line 46, in run
    single_process_main(args)
  File "/fairseq/train.py", line 35, in main
    load_dataset_splits(args, task, ['train', 'valid'])
  File "/fairseq/train.py", line 333, in load_dataset_splits
    task.load_dataset(split_k)
  File "/fairseq/fairseq/tasks/translation_ctx.py", line 105, in load_dataset
    ctx_dataset = indexed_dataset(prefix + 'ctx', self.ctx_dict)
  File "/fairseq/fairseq/tasks/translation_ctx.py", line 98, in indexed_dataset
    return IndexedRawTextDataset(path, dictionary)
  File "/fairseq/fairseq/data/indexed_dataset.py", line 130, in __init__
    self.read_data(path, dictionary)
  File "/fairseq/fairseq/data/indexed_dataset.py", line 136, in read_data
    self.lines.append(line.strip('\n'))
MemoryError

Question/Help?

According to me if there is memory constrain then CUDA should throw out of memory error and not this error. Based on the reading, I came to know cuda runtime error (59) : device-side asset error triggered due to out-of-bound issue or due to faulty loss function. This shouldn't be a case here because the entire code is running smoothly for the smaller dataset and failing to process large datasets. hence, I'm putting this question here for getting further help.

Is there anything that I need to check in order to resolve this issue or something am I missing?

What's your environment?

  • fairseq Version (e.g., 1.0 or master): 0.5
  • PyTorch Version (e.g., 1.0) : 0.4.1
  • OS (e.g., Linux): Ubutu 18.04
  • How you installed fairseq (pip, source): pip
  • Build command you used (if compiling from source): NA
  • Python version: 3.6
  • CUDA/cuDNN version: 9.2
  • GPU models and configuration: 8 GPUs (V100)

Any help, support, and direction is highly appreciable

Thanks

@jalajthanaki
Copy link
Author

Hi @shamilcm and @pidugusundeep,

Here is the link of fairseq GitHub issue and Pytorch GitHub issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant