GPU crashes when running "D_packed @ Q.to(dtype=D_packed.dtype).T" with no error message #348

Jimmy9507 · 2024-06-03T06:47:54Z

Hey,

I tried to do ColBERT model inferencing via Triton server in multiple GPUs instance.

GPU 0 works fine. However, other GPU devices (1,2,3,... etc) crash when running to this line

D_packed @ Q.to(dtype=D_packed.dtype).T

with no error message.

Did anyone see the same error before?

The text was updated successfully, but these errors were encountered:

Jimmy9507 · 2024-06-06T00:51:40Z

After diving deep in the code, look like this line

ColBERT/colbert/indexing/codecs/decompress_residuals.cu

Lines 42 to 50 in 862edcf

    
           torch::Tensor decompress_residuals_cuda( 
        
               const torch::Tensor binary_residuals, const torch::Tensor bucket_weights, 
        
               const torch::Tensor reversed_bit_map, 
        
               const torch::Tensor bucket_weight_combinations, const torch::Tensor codes, 
        
               const torch::Tensor centroids, const int dim, const int nbits) { 
        
               auto options = torch::TensorOptions() 
        
                                  .dtype(torch::kFloat16) 
        
                                  .device(torch::kCUDA, 0) 
        
                                  .requires_grad(false);

restrict cpp method decompress_residuals_cuda on GPU device 0 only. decompress_residuals_cuda will crash when running on other GPUs.

After update it to .device(torch::kCUDA, residuals.device().index()). The crash problem is resolved.

Should we update to .device(torch::kCUDA, residuals.device().index())? This should also significantly increase the model inferencing efficiency by enabling model inference on multiple GPUs.

Wondering if this is a bug or designed intentionally.

Jimmy9507 changed the title ~~Is it possible to do ColBERT model inferencing via triton server in multiple GPUs instace?~~ Is it possible to do ColBERT model inferencing via triton server in multiple GPUs instance? Jun 3, 2024

Jimmy9507 changed the title ~~Is it possible to do ColBERT model inferencing via triton server in multiple GPUs instance?~~ GPU crashes when running "D_packed @ Q.to(dtype=D_packed.dtype).T" with no error message Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU crashes when running "D_packed @ Q.to(dtype=D_packed.dtype).T" with no error message #348

GPU crashes when running "D_packed @ Q.to(dtype=D_packed.dtype).T" with no error message #348

Jimmy9507 commented Jun 3, 2024 •

edited

Loading

Jimmy9507 commented Jun 6, 2024 •

edited

Loading

GPU crashes when running "D_packed @ Q.to(dtype=D_packed.dtype).T" with no error message #348

GPU crashes when running "D_packed @ Q.to(dtype=D_packed.dtype).T" with no error message #348

Comments

Jimmy9507 commented Jun 3, 2024 • edited Loading

Jimmy9507 commented Jun 6, 2024 • edited Loading

Jimmy9507 commented Jun 3, 2024 •

edited

Loading

Jimmy9507 commented Jun 6, 2024 •

edited

Loading