Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about parallel computing on GPU #46

Open
Johnxiaoming opened this issue May 15, 2023 · 3 comments
Open

Question about parallel computing on GPU #46

Johnxiaoming opened this issue May 15, 2023 · 3 comments

Comments

@Johnxiaoming
Copy link

I have two GPUs on our server. But when I run the model, the second GPU always does not work. I know my job is not very big, so that the first GPU hasn't been occupied fully. But I think if you can support parallel computing, it can save much more time. Even just adding a function that can let us select the GPU I want to use, it will also help a lot. Because I can run two jobs on each GPUs. Thanks.

If you feel difficult, I will fork it and do it by myself.

@kevinjohncutler
Copy link
Owner

@Johnxiaoming I have been using two GPUs very heavily recently and both are utilized. Although the latest Github version should work, I have made enormous optimizations in the last two weeks that will post to github soon. I use DataParallel currently, and this works fine for single servers and AWS innstances. DistributedDataParallel might be better. To debug, what version of Omnipose and PyTorch do you have and what is your hardware?

@Johnxiaoming
Copy link
Author

a tiny intel + double nvidia RTX GPU server.
With cuda 11.8 + pytorch 2.0 cuda112py311h13fee9e_200 (python 3.11)

Thank you!

@kevinjohncutler
Copy link
Owner

@Johnxiaoming I just realized that you said running the model, not training (that's where my head has been at...). I know for sure that training uses both GPUs, but evaluating is another story. Some of my optimizations for training should apply to evaluation. The model itself is initialized with dataparallel in both training and evaluation, so my guess is that we simply are not saturating GPU0, and so GPU1 never gets called. Can you tell me what your typical image set looks like in number and resolution? If you monitored your GPU0 memory, I am curious to know if its VRAM was completely used up or if was well under the maximum capacity during evaluation.

Some more explanation and planning: the behavior now is to process images in sequence. The batch_size parameter only applies to tiled mode now, where the image is split into 224x224px patches and run in parallel. That should always be slower than running the whole image, so long as the full image fits on the GPU. I think the reason why Cellpose did not build in the ability to run multiple images at once on the GPU is because each image in the batch must be the same size (this is guaranteed during training via cropping), and they were typically evaluating on very diverse datasets. However, in real applications we usually have the same size images from a given sensor or even a cropped time lapse, so it makes a LOT of sense to run whole images in batches.

Moreover, it makes sense to run the mask reconstruction in parallel - again, VRAM permitting. Doing the Euler integration in one loop for all images simultaneously instead of multiple loops in sequence is virtually guaranteed to be faster. I already figured out much of the code for this to parallelize training, and I can see exactly what we need to do for evaluation. I just need to find the time to implement it. I suspect I will do it by the end of the month, so stay tuned!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants