You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi everyone! I ran into a strange bug which confused me several days. Sometimes the model will run into this error after dozens of epochs(like 40, 80 or 100). Sometimes this error disappears. When the model is resumed from the checkpoints saved before the error, this error may or may not appear again. Does anyone know the situation? Any reply will be appreciated.
When I use py36+torch1.4+cuda10.0, it shows:
Traceback (most recent call last):
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/functional.py", line 1370, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: misaligned address
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: misaligned address (insert_events at /opt/conda/conda-bld/pytorch_1579027003190/work/c10/cuda/CUDACachingAllocator.cpp:764)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f172f1a1627 in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x1ab04 (0x7f172f3e1b04 in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1cbd1 (0x7f172f3e3bd1 in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7f172f18eb9d in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: + 0x6871fa (0x7f17606161fa in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #20: __libc_start_main + 0xe7 (0x7f1772122b97 in /lib/x86_64-linux-gnu/libc.so.6)
When I use py35+torch0.4+cuda9.0, it shows:
Traceback (most recent call last):
File "main.py", line 329, in
main()
File "main.py", line 128, in main
train(train_loader, model, criterion, optimizer, epoch, log_training)
File "main.py", line 170, in train
output = model(input_var)
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
raise output
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
output = module(*input, **kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/project/TRN-pytorch/models.py", line 220, in forward
base_out = self.base_model(input.view((-1, sample_len) + input.size()[-2:]))
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/project/TRN-pytorch/model_zoo/bninception/pytorch_load.py", line 57, in forward
data_dict[op[2]] = torch.cat(tuple(data_dict[x] for x in op[-1]), 1)
RuntimeError: cuda runtime error (74) : misaligned address at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/THCCachingHostAllocator.cpp:271
terminate called after throwing an instance of 'at::Error'
what(): CUDA error: invalid device pointer (CudaCachingDeleter at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/THCCachingAllocator.cpp:498)
frame #0: THStorage_free + 0x44 (0x7fc7bba51a04 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #1: THTensor_free + 0x2f (0x7fc7bbaff66f in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #2: at::CUDAFloatTensor::~CUDAFloatTensor() + 0x9 (0x7fc7a64ac609 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: torch::autograd::Variable::Impl::~Impl() + 0x1f7 (0x7fc7bd6c62d7 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so)
frame #4: torch::autograd::Variable::Impl::~Impl() + 0x9 (0x7fc7bd6c6429 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so)
frame #5: + 0x6e8a44 (0x7fc7bd6dda44 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so)
frame #6: + 0x6e8b24 (0x7fc7bd6ddb24 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so)
frame #23: __libc_start_main + 0xe7 (0x7fc7cec3bb97 in /lib/x86_64-linux-gnu/libc.so.6)
The text was updated successfully, but these errors were encountered:
Hi everyone! I ran into a strange bug which confused me several days. Sometimes the model will run into this error after dozens of epochs(like 40, 80 or 100). Sometimes this error disappears. When the model is resumed from the checkpoints saved before the error, this error may or may not appear again. Does anyone know the situation? Any reply will be appreciated.
When I use py36+torch1.4+cuda10.0, it shows:
Traceback (most recent call last):
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/functional.py", line 1370, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: misaligned address
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: misaligned address (insert_events at /opt/conda/conda-bld/pytorch_1579027003190/work/c10/cuda/CUDACachingAllocator.cpp:764)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f172f1a1627 in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x1ab04 (0x7f172f3e1b04 in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1cbd1 (0x7f172f3e3bd1 in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7f172f18eb9d in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: + 0x6871fa (0x7f17606161fa in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #20: __libc_start_main + 0xe7 (0x7f1772122b97 in /lib/x86_64-linux-gnu/libc.so.6)
When I use py35+torch0.4+cuda9.0, it shows:
Traceback (most recent call last):
File "main.py", line 329, in
main()
File "main.py", line 128, in main
train(train_loader, model, criterion, optimizer, epoch, log_training)
File "main.py", line 170, in train
output = model(input_var)
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
raise output
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
output = module(*input, **kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/project/TRN-pytorch/models.py", line 220, in forward
base_out = self.base_model(input.view((-1, sample_len) + input.size()[-2:]))
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/project/TRN-pytorch/model_zoo/bninception/pytorch_load.py", line 57, in forward
data_dict[op[2]] = torch.cat(tuple(data_dict[x] for x in op[-1]), 1)
RuntimeError: cuda runtime error (74) : misaligned address at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/THCCachingHostAllocator.cpp:271
terminate called after throwing an instance of 'at::Error'
what(): CUDA error: invalid device pointer (CudaCachingDeleter at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/THCCachingAllocator.cpp:498)
frame #0: THStorage_free + 0x44 (0x7fc7bba51a04 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #1: THTensor_free + 0x2f (0x7fc7bbaff66f in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #2: at::CUDAFloatTensor::~CUDAFloatTensor() + 0x9 (0x7fc7a64ac609 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: torch::autograd::Variable::Impl::~Impl() + 0x1f7 (0x7fc7bd6c62d7 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so)
frame #4: torch::autograd::Variable::Impl::~Impl() + 0x9 (0x7fc7bd6c6429 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so)
frame #5: + 0x6e8a44 (0x7fc7bd6dda44 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so)
frame #6: + 0x6e8b24 (0x7fc7bd6ddb24 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so)
frame #23: __libc_start_main + 0xe7 (0x7fc7cec3bb97 in /lib/x86_64-linux-gnu/libc.so.6)
The text was updated successfully, but these errors were encountered: