You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to do ColBERT model inferencing via Triton server in multiple GPUs instance.
GPU 0 works fine. However, other GPU devices (1,2,3,... etc) crash when running to this line
D_packed @ Q.to(dtype=D_packed.dtype).T
with no error message.
Did anyone see the same error before?
The text was updated successfully, but these errors were encountered:
Jimmy9507
changed the title
Is it possible to do ColBERT model inferencing via triton server in multiple GPUs instace?
Is it possible to do ColBERT model inferencing via triton server in multiple GPUs instance?
Jun 3, 2024
Jimmy9507
changed the title
Is it possible to do ColBERT model inferencing via triton server in multiple GPUs instance?
GPU crashes when running "D_packed @ Q.to(dtype=D_packed.dtype).T" with no error message
Jun 5, 2024
restrict cpp method decompress_residuals_cuda on GPU device 0 only. decompress_residuals_cuda will crash when running on other GPUs.
After update it to .device(torch::kCUDA, residuals.device().index()). The crash problem is resolved.
Should we update to .device(torch::kCUDA, residuals.device().index())? This should also significantly increase the model inferencing efficiency by enabling model inference on multiple GPUs.
Wondering if this is a bug or designed intentionally.
Hey,
I tried to do ColBERT model inferencing via Triton server in multiple GPUs instance.
GPU 0 works fine. However, other GPU devices (1,2,3,... etc) crash when running to this line
D_packed @ Q.to(dtype=D_packed.dtype).T
with no error message.
Did anyone see the same error before?
The text was updated successfully, but these errors were encountered: