Memory Allocation Performance #1416

mdcarr941 · 2024-11-26T19:25:02Z

Hello!

I recently implemented the Llama 3.1 model logic with TorchSharp. It works, and I was able to load pretrained bf16 weights using TorchSharp.PyBridge. However, I found that it takes on the order of 60 seconds to instantiate the model on my development machine. By instantiate I just mean create an instance of the model class, which just entails all of the object and tensor allocations, but does not include the IO time to read model weights from disk (which is actually really fast with my SSD). Note that these tensor allocations are happening on the CPU device, I only transfer to my GPU after I've loaded the model weights. I observed the same behavior on Linux and Windows, and I determined that it was TorchSharp specific, because the (roughly) equivalent PyTorch code is able to instantiate much faster.

I profiled my program on Windows to get an idea of what's taking so long, and it appears the process is spending almost all of its time in the kernel:

I haven't dug into this further, but my hunch is that a ton of small allocations are being performed. I remember reading somewhere (I think it was in the PyTorch docs) about how PyTorch caches memory, and uses the cache to respond to allocation requests from user code. Model instantiation performance isn't huge problem for me, because I just need to instantiate the model in lifetime of my application, but it would be nice to reach performance parity with python.

Would it be possible to implement a similar allocation system in TorchSharp, and avoid the problem of context switching into the OS for lots of small allocations? By "possible" I just mean technically possible, I'm not looking for a commitment.

Finally, if it is possible what would be the right way to go about it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Allocation Performance #1416

Memory Allocation Performance #1416

mdcarr941 commented Nov 26, 2024

Memory Allocation Performance #1416

Memory Allocation Performance #1416

Comments

mdcarr941 commented Nov 26, 2024