You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently implemented the Llama 3.1 model logic with TorchSharp. It works, and I was able to load pretrained bf16 weights using TorchSharp.PyBridge. However, I found that it takes on the order of 60 seconds to instantiate the model on my development machine. By instantiate I just mean create an instance of the model class, which just entails all of the object and tensor allocations, but does not include the IO time to read model weights from disk (which is actually really fast with my SSD). Note that these tensor allocations are happening on the CPU device, I only transfer to my GPU after I've loaded the model weights. I observed the same behavior on Linux and Windows, and I determined that it was TorchSharp specific, because the (roughly) equivalent PyTorch code is able to instantiate much faster.
I profiled my program on Windows to get an idea of what's taking so long, and it appears the process is spending almost all of its time in the kernel:
I haven't dug into this further, but my hunch is that a ton of small allocations are being performed. I remember reading somewhere (I think it was in the PyTorch docs) about how PyTorch caches memory, and uses the cache to respond to allocation requests from user code. Model instantiation performance isn't huge problem for me, because I just need to instantiate the model in lifetime of my application, but it would be nice to reach performance parity with python.
Would it be possible to implement a similar allocation system in TorchSharp, and avoid the problem of context switching into the OS for lots of small allocations? By "possible" I just mean technically possible, I'm not looking for a commitment.
Finally, if it is possible what would be the right way to go about it?
The text was updated successfully, but these errors were encountered:
Hello!
I recently implemented the Llama 3.1 model logic with TorchSharp. It works, and I was able to load pretrained bf16 weights using TorchSharp.PyBridge. However, I found that it takes on the order of 60 seconds to instantiate the model on my development machine. By instantiate I just mean create an instance of the model class, which just entails all of the object and tensor allocations, but does not include the IO time to read model weights from disk (which is actually really fast with my SSD). Note that these tensor allocations are happening on the CPU device, I only transfer to my GPU after I've loaded the model weights. I observed the same behavior on Linux and Windows, and I determined that it was TorchSharp specific, because the (roughly) equivalent PyTorch code is able to instantiate much faster.
I profiled my program on Windows to get an idea of what's taking so long, and it appears the process is spending almost all of its time in the kernel:
I haven't dug into this further, but my hunch is that a ton of small allocations are being performed. I remember reading somewhere (I think it was in the PyTorch docs) about how PyTorch caches memory, and uses the cache to respond to allocation requests from user code. Model instantiation performance isn't huge problem for me, because I just need to instantiate the model in lifetime of my application, but it would be nice to reach performance parity with python.
Would it be possible to implement a similar allocation system in TorchSharp, and avoid the problem of context switching into the OS for lots of small allocations? By "possible" I just mean technically possible, I'm not looking for a commitment.
Finally, if it is possible what would be the right way to go about it?
The text was updated successfully, but these errors were encountered: