Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get af3 to mobilize multiple GPUs to use more memory #213

Closed
XIANZHE-LI opened this issue Dec 14, 2024 · 5 comments
Closed

How to get af3 to mobilize multiple GPUs to use more memory #213

XIANZHE-LI opened this issue Dec 14, 2024 · 5 comments
Labels
question Further information is requested

Comments

@XIANZHE-LI
Copy link

W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below 18.88GiB (20266891105 bytes) by rematerialization; only reduced to 41.43GiB (44490724824 bytes), down from 45.92GiB (49309847144 bytes) originally
2024-12-14 19:57:44.462633: W external/xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 47.20GiB (rounded to 50677266944)requested by op
2024-12-14 19:57:44.464119: W external/xla/xla/tsl/framework/bfc_allocator.cc:508] ************________________________________________________________________________________________
E1214 19:57:44.464390 17928 pjrt_stream_executor_client.cc:3084] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 50677266816 bytes.
Traceback (most recent call last):
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 690, in
app.run(main)
File "/root/miniconda/envs/af3-2/lib/python3.11/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/root/miniconda/envs/af3-2/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
^^^^^^^^^^
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 674, in main
process_fold_input(
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 542, in process_fold_input
all_inference_results = predict_structure(
^^^^^^^^^^^^^^^^^^
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 375, in predict_structure
result = model_runner.run_inference(example, rng_key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 313, in run_inference
result = self._model(rng_key, featurised_example)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 50677266816 bytes.

For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
image

i have 3 4090
but only can be used

| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:3D:00.0 Off | Off |
| 30% 28C P8 15W / 450W | 23399MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:63:00.0 Off | Off |
| 30% 29C P8 21W / 450W | 393MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 On | 00000000:BD:00.0 Off | Off |
| 30% 27C P8 19W / 450W | 393MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

@joshabramson joshabramson added the question Further information is requested label Dec 14, 2024
@joshabramson
Copy link
Collaborator

The model is not parallelizable across multiple GPUs, one can only run it on one GPU at a time.

It is possible to run multiple copies of the model in parallel on many GPU (note the docker launch controls which gpu is used).

@XIANZHE-LI
Copy link
Author

The model is not parallelizable across multiple GPUs, one can only run it on one GPU at a time.

It is possible to run multiple copies of the model in parallel on many GPU (note the docker launch controls which gpu is used).

I'm running out of VRAM in the reasoning section, so would upgrading to a GPU with larger VRAM be the solution?

@joshabramson
Copy link
Collaborator

More single GPU memory would make it possible to run larger bucket sizes.

Normally we would advise using unified memory https://github.com/google-deepmind/alphafold3/blob/main/docs/performance.md#gpu-memory - but there seems to be some issues with this for RTX 4090: #209

@MaoSihong
Copy link

推理部分用完了 VRAM,那么升级到具有更大 VRAM 的 GPU 是解决方

what if unified RAM option was turned on?

@XIANZHE-LI
Copy link
Author

推理部分用完了 VRAM,那么升级到具有更大 VRAM 的 GPU 是解决方

what if unified RAM option was turned on?

Haven't tried it yet, but this is how it's supposed to work. Have you tested this solution? Could it resolve the issue of insufficient VRAM?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants