-
Notifications
You must be signed in to change notification settings - Fork 688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to get af3 to mobilize multiple GPUs to use more memory #213
Comments
The model is not parallelizable across multiple GPUs, one can only run it on one GPU at a time. It is possible to run multiple copies of the model in parallel on many GPU (note the docker launch controls which gpu is used). |
I'm running out of VRAM in the reasoning section, so would upgrading to a GPU with larger VRAM be the solution? |
More single GPU memory would make it possible to run larger bucket sizes. Normally we would advise using unified memory https://github.com/google-deepmind/alphafold3/blob/main/docs/performance.md#gpu-memory - but there seems to be some issues with this for RTX 4090: #209 |
what if unified RAM option was turned on? |
Haven't tried it yet, but this is how it's supposed to work. Have you tested this solution? Could it resolve the issue of insufficient VRAM? |
W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below 18.88GiB (20266891105 bytes) by rematerialization; only reduced to 41.43GiB (44490724824 bytes), down from 45.92GiB (49309847144 bytes) originally
2024-12-14 19:57:44.462633: W external/xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 47.20GiB (rounded to 50677266944)requested by op
2024-12-14 19:57:44.464119: W external/xla/xla/tsl/framework/bfc_allocator.cc:508] ************________________________________________________________________________________________
E1214 19:57:44.464390 17928 pjrt_stream_executor_client.cc:3084] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 50677266816 bytes.
Traceback (most recent call last):
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 690, in
app.run(main)
File "/root/miniconda/envs/af3-2/lib/python3.11/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/root/miniconda/envs/af3-2/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
^^^^^^^^^^
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 674, in main
process_fold_input(
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 542, in process_fold_input
all_inference_results = predict_structure(
^^^^^^^^^^^^^^^^^^
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 375, in predict_structure
result = model_runner.run_inference(example, rng_key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 313, in run_inference
result = self._model(rng_key, featurised_example)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 50677266816 bytes.
For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
i have 3 4090
but only can be used
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:3D:00.0 Off | Off |
| 30% 28C P8 15W / 450W | 23399MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:63:00.0 Off | Off |
| 30% 29C P8 21W / 450W | 393MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 On | 00000000:BD:00.0 Off | Off |
| 30% 27C P8 19W / 450W | 393MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
The text was updated successfully, but these errors were encountered: