You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi Shamil Chollampatt, Weiqi Wang, and Hwee Tou Ng
I'm referring to the CrossSent paper and trying out this approach. When I run the code for a smaller dataset it ran perfectly but when I increase my dataset by 10 times. I'm getting the following error.
pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [37,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCReduceAll.cuh line=317 error=59 : device-side assert triggered
Traceback (most recent call last):
File "fairseq/train.py", line 352, in <module>
multiprocessing_main(args)
File "fairseq/multiprocessing_train.py", line 40, in main
p.join()
File "/usr/lib/python3.6/multiprocessing/process.py", line 124, in join
res = self._popen.wait(timeout)
File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
File "fairseq/multiprocessing_train.py", line 82, in signal_handler
raise Exception(msg)
Exception:
-- Tracebacks above this line can probably be ignored --
Traceback (most recent call last):
File "fairseq/multiprocessing_train.py", line 46, in run
single_process_main(args)
File "fairseq/train.py", line 87, in main
train(args, trainer, task, epoch_itr)
File "fairseq/train.py", line 125, in train
log_output = trainer.train_step(sample, update_params=True)
File "fairseq/fairseq/trainer.py", line 117, in train_step
loss, sample_size, logging_output, oom_fwd = self._forward(sample)
File "fairseq/fairseq/trainer.py", line 205, in _forward
raise e
File "fairseq/fairseq/trainer.py", line 197, in _forward
loss, sample_size, logging_output_ = self.task.get_loss(self.model, self.criterion, sample)
File "fairseq/fairseq/tasks/fairseq_task.py", line 49, in get_loss
return criterion(model, sample)
File "python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "fairseq/fairseq/criterions/label_smoothed_cross_entropy.py", line 36, in forward
net_output = model(**sample['net_input'])
File "python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "fairseq/fairseq/models/fairseq_model.py", line 146, in forward
auxencoder_out = self.auxencoder(ctx_tokens, ctx_lengths)
File "python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "fairseq/fairseq/models/fconv_dualenc_gec_gatedaux.py", line 193, in forward
if not encoder_padding_mask.any():
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:317
What have you tried?
As mentioned in a few GitHub issues and PyTorch forums questions, I have run my code using CUDA_LAUNCH_BLOCKING=1 and the following is my error log
Traceback (most recent call last):
File "/fairseq/train.py", line 352, in <module>
multiprocessing_main(args)
File "/fairseq/multiprocessing_train.py", line 40, in main
p.join()
File "/opt/anaconda/lib/python3.7/multiprocessing/process.py", line 140, in join
res = self._popen.wait(timeout)
File "/opt/anaconda/lib/python3.7/multiprocessing/popen_fork.py", line 48, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/opt/anaconda/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
File "/fairseq/multiprocessing_train.py", line 82, in signal_handler
raise Exception(msg)
Exception:
-- Tracebacks above this line can probably be ignored --
Traceback (most recent call last):
File "/fairseq/multiprocessing_train.py", line 46, in run
single_process_main(args)
File "/fairseq/train.py", line 35, in main
load_dataset_splits(args, task, ['train', 'valid'])
File "/fairseq/train.py", line 333, in load_dataset_splits
task.load_dataset(split_k)
File "/fairseq/fairseq/tasks/translation_ctx.py", line 105, in load_dataset
ctx_dataset = indexed_dataset(prefix + 'ctx', self.ctx_dict)
File "/fairseq/fairseq/tasks/translation_ctx.py", line 98, in indexed_dataset
return IndexedRawTextDataset(path, dictionary)
File "/fairseq/fairseq/data/indexed_dataset.py", line 130, in __init__
self.read_data(path, dictionary)
File "/fairseq/fairseq/data/indexed_dataset.py", line 136, in read_data
self.lines.append(line.strip('\n'))
MemoryError
Question/Help?
According to me if there is memory constrain then CUDA should throw out of memory error and not this error. Based on the reading, I came to know cuda runtime error (59) : device-side asset error triggered due to out-of-bound issue or due to faulty loss function. This shouldn't be a case here because the entire code is running smoothly for the smaller dataset and failing to process large datasets. hence, I'm putting this question here for getting further help.
Is there anything that I need to check in order to resolve this issue or something am I missing?
What's your environment?
fairseq Version (e.g., 1.0 or master): 0.5
PyTorch Version (e.g., 1.0) : 0.4.1
OS (e.g., Linux): Ubutu 18.04
How you installed fairseq (pip, source): pip
Build command you used (if compiling from source): NA
Python version: 3.6
CUDA/cuDNN version: 9.2
GPU models and configuration: 8 GPUs (V100)
Any help, support, and direction is highly appreciable
Thanks
The text was updated successfully, but these errors were encountered:
❓ Questions and Help
Hi Shamil Chollampatt, Weiqi Wang, and Hwee Tou Ng
I'm referring to the CrossSent paper and trying out this approach. When I run the code for a smaller dataset it ran perfectly but when I increase my dataset by 10 times. I'm getting the following error.
What have you tried?
As mentioned in a few GitHub issues and PyTorch forums questions, I have run my code using
CUDA_LAUNCH_BLOCKING=1
and the following is my error logQuestion/Help?
According to me if there is memory constrain then CUDA should throw out of memory error and not this error. Based on the reading, I came to know
cuda runtime error (59) : device-side asset
error triggered due to out-of-bound issue or due to faulty loss function. This shouldn't be a case here because the entire code is running smoothly for the smaller dataset and failing to process large datasets. hence, I'm putting this question here for getting further help.Is there anything that I need to check in order to resolve this issue or something am I missing?
What's your environment?
pip
, source): pipAny help, support, and direction is highly appreciable
Thanks
The text was updated successfully, but these errors were encountered: