How to get af3 to mobilize multiple GPUs to use more memory #213

XIANZHE-LI · 2024-12-14T14:09:25Z

W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below 18.88GiB (20266891105 bytes) by rematerialization; only reduced to 41.43GiB (44490724824 bytes), down from 45.92GiB (49309847144 bytes) originally
2024-12-14 19:57:44.462633: W external/xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 47.20GiB (rounded to 50677266944)requested by op
2024-12-14 19:57:44.464119: W external/xla/xla/tsl/framework/bfc_allocator.cc:508] ************________________________________________________________________________________________
E1214 19:57:44.464390 17928 pjrt_stream_executor_client.cc:3084] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 50677266816 bytes.
Traceback (most recent call last):
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 690, in
app.run(main)
File "/root/miniconda/envs/af3-2/lib/python3.11/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/root/miniconda/envs/af3-2/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
^^^^^^^^^^
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 674, in main
process_fold_input(
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 542, in process_fold_input
all_inference_results = predict_structure(
^^^^^^^^^^^^^^^^^^
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 375, in predict_structure
result = model_runner.run_inference(example, rng_key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 313, in run_inference
result = self._model(rng_key, featurised_example)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 50677266816 bytes.

For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

i have 3 4090
but only can be used

joshabramson · 2024-12-14T15:43:04Z

The model is not parallelizable across multiple GPUs, one can only run it on one GPU at a time.

It is possible to run multiple copies of the model in parallel on many GPU (note the docker launch controls which gpu is used).

XIANZHE-LI · 2024-12-14T15:54:59Z

The model is not parallelizable across multiple GPUs, one can only run it on one GPU at a time.

It is possible to run multiple copies of the model in parallel on many GPU (note the docker launch controls which gpu is used).

I'm running out of VRAM in the reasoning section, so would upgrading to a GPU with larger VRAM be the solution?

joshabramson · 2024-12-14T16:00:08Z

More single GPU memory would make it possible to run larger bucket sizes.

Normally we would advise using unified memory https://github.com/google-deepmind/alphafold3/blob/main/docs/performance.md#gpu-memory - but there seems to be some issues with this for RTX 4090: #209

MaoSihong · 2024-12-16T03:25:37Z

推理部分用完了 VRAM，那么升级到具有更大 VRAM 的 GPU 是解决方

what if unified RAM option was turned on?

XIANZHE-LI · 2024-12-16T11:26:56Z

推理部分用完了 VRAM，那么升级到具有更大 VRAM 的 GPU 是解决方

what if unified RAM option was turned on?

Haven't tried it yet, but this is how it's supposed to work. Have you tested this solution? Could it resolve the issue of insufficient VRAM?

joshabramson added the question Further information is requested label Dec 14, 2024

joshabramson closed this as completed Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get af3 to mobilize multiple GPUs to use more memory #213

How to get af3 to mobilize multiple GPUs to use more memory #213

XIANZHE-LI commented Dec 14, 2024

joshabramson commented Dec 14, 2024

XIANZHE-LI commented Dec 14, 2024

joshabramson commented Dec 14, 2024

MaoSihong commented Dec 16, 2024

XIANZHE-LI commented Dec 16, 2024

How to get af3 to mobilize multiple GPUs to use more memory #213

How to get af3 to mobilize multiple GPUs to use more memory #213

Comments

XIANZHE-LI commented Dec 14, 2024

joshabramson commented Dec 14, 2024

XIANZHE-LI commented Dec 14, 2024

joshabramson commented Dec 14, 2024

MaoSihong commented Dec 16, 2024

XIANZHE-LI commented Dec 16, 2024