No way to copy a tensor from gpu to cpu to pre allocated array. #1388

LukePoga · 2024-10-19T22:56:47Z

Doesnt appear to be any way to transfer a result tensor in to an existing cpu float array. Below requires new memory allocation.

    var cpuResult = gpuResult.cpu();
    float[] result = cpuResult.data<float>().ToArray();

If this is part of a loop, this is a lot of wasted memory allocation and time! Below is how libraries normally do things. eg. CUDA.

float[] cpuResult ..... (pre allocated further up)
gpuResult.CopyToHost(cpuResult);

Maybe I missed this CopyTo because its kinda essential for any gpu type library (?!)

Is this project maintained?

The text was updated successfully, but these errors were encountered:

haytham2597 · 2024-10-21T00:08:08Z

One way to make a "fast" copy without overloading memory is make contiguous and add this on TensorAccessor.cs in function ToArray()

if (_tensor.is_contiguous()) {
    //This is very fast. And work VERY WELL
    var shps = _tensor.shape;
    long TempCount = 1;
    for (int i = 0; i < shps.Length; i++)
        TempCount *= shps[i]; //Theorically the numel is simple as product of each element shape
    unsafe {
        return new Span<T>(_tensor_data_ptr.ToPointer(), Convert.ToInt32(TempCount)).ToArray();
    }
}

I Added these in one comit of my Pull Request Autocast. I try to figure out how make same idea if the tensor is not contiguous. Because this way for faster copy i always i need make the tensor as contiguous.

torch.Tensor te /*blablabla*/;
te = te.contiguous().data<float>().ToArray()

I noticed that if the tensor is not contiguous call always the method Numel so always computed.

Edit: Oh sorry i misunderstood what you mean, i think with CopyTo will work. You mean like this?

float[] data = new float[h*w*3]; //PreAllocated in top of function for example

//Intense functions and process blablabla

tenGPU.data<float>().CopyTo(data); //`tenGPU` is a variable of torch.Tensor that is allocated in GPU

I will test this. If that not work, soon i investigate how do that.

haytham2597 · 2024-10-22T17:32:17Z

I recently test this and work well.

LukePoga · 2024-10-24T10:42:17Z

Great thanks. I don't know why I didnt see CopyTo before.

tenGPU.data<float>().CopyTo(data);

But its not faster. This takes 340ms for 12,000,000 floats. This is 150MB/s which is extremely slow for PCIE bandwidth. whys it so slow?

haytham2597 · 2024-10-24T20:48:32Z

@LukePoga
Because in TensorAccessor.cs L41 Can see that call _tensor.numel() and inside of loop in GetSubsequentIndices for example, call always the Numel. That use so much CPU and may slow too, so many times that call the function and also iterate over Ptr array one by one assign in preallocated array. So my solution was modified that TensorAccessor for fast copy but only will work if the tensor is contiguous so before CopyTo or ToArray() should create a contiguous tensor like this:

torch.Tensor tenGPU;
//Blablabla
tenGPU = tenGPU.contiguous();
//After that you can call tenGPU.data<T>().ToArray() or a CopyTo.
tenGPU.data<T>().ToArray() //Or CopyTo

My Fast TensorAccessor pre-compute the Numel 1 time (that is multiply all element) and then create a complete copy without loop Ptr or Pointer Array

//From my branch of TorchSharp/Utils/TensorAccessor.cs
unsafe {
    return new Span<T>(_tensor_data_ptr.ToPointer(), Convert.ToInt32(TempCount)).ToArray();
}

That this not iterate over array and assign value on index. This create a complete copy.

Soon i will make a PR for a Fast TensorAccessor but reminder that only will work fast if tensor is contiguous.
Now for the not contiguous i need to see how figure out, maybe it can be a bit quick with pre-compute Numel.
Because not contiguous is more complex due a Stride.

LukePoga · 2024-11-01T11:25:10Z

Thanks for working on the PR. Do you know who can approve it?

#1396

NiklasGustafsson · 2024-11-01T17:37:35Z

Thanks for working on the PR. Do you know who can approve it?

#1396

Working on it. Changes remain to be made before it can be approved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No way to copy a tensor from gpu to cpu to pre allocated array. #1388

No way to copy a tensor from gpu to cpu to pre allocated array. #1388

LukePoga commented Oct 19, 2024

haytham2597 commented Oct 21, 2024 •

edited

Loading

haytham2597 commented Oct 22, 2024

LukePoga commented Oct 24, 2024

haytham2597 commented Oct 24, 2024

LukePoga commented Nov 1, 2024

NiklasGustafsson commented Nov 1, 2024

No way to copy a tensor from gpu to cpu to pre allocated array. #1388

No way to copy a tensor from gpu to cpu to pre allocated array. #1388

Comments

LukePoga commented Oct 19, 2024

haytham2597 commented Oct 21, 2024 • edited Loading

haytham2597 commented Oct 22, 2024

LukePoga commented Oct 24, 2024

haytham2597 commented Oct 24, 2024

LukePoga commented Nov 1, 2024

NiklasGustafsson commented Nov 1, 2024

haytham2597 commented Oct 21, 2024 •

edited

Loading