Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Allocation Performance #1416

Open
mdcarr941 opened this issue Nov 26, 2024 · 0 comments
Open

Memory Allocation Performance #1416

mdcarr941 opened this issue Nov 26, 2024 · 0 comments

Comments

@mdcarr941
Copy link

Hello!

I recently implemented the Llama 3.1 model logic with TorchSharp. It works, and I was able to load pretrained bf16 weights using TorchSharp.PyBridge. However, I found that it takes on the order of 60 seconds to instantiate the model on my development machine. By instantiate I just mean create an instance of the model class, which just entails all of the object and tensor allocations, but does not include the IO time to read model weights from disk (which is actually really fast with my SSD). Note that these tensor allocations are happening on the CPU device, I only transfer to my GPU after I've loaded the model weights. I observed the same behavior on Linux and Windows, and I determined that it was TorchSharp specific, because the (roughly) equivalent PyTorch code is able to instantiate much faster.

I profiled my program on Windows to get an idea of what's taking so long, and it appears the process is spending almost all of its time in the kernel:
Image
I haven't dug into this further, but my hunch is that a ton of small allocations are being performed. I remember reading somewhere (I think it was in the PyTorch docs) about how PyTorch caches memory, and uses the cache to respond to allocation requests from user code. Model instantiation performance isn't huge problem for me, because I just need to instantiate the model in lifetime of my application, but it would be nice to reach performance parity with python.

Would it be possible to implement a similar allocation system in TorchSharp, and avoid the problem of context switching into the OS for lots of small allocations? By "possible" I just mean technically possible, I'm not looking for a commitment.

Finally, if it is possible what would be the right way to go about it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant