CUDA_ERROR_OUT_OF_MEMORY issue when running test case on 4090 24G GPU machine locally #209

MaoSihong · 2024-12-12T12:46:23Z

when I launched the 2PV7 test case with all default profile, I encountered following err after MSA pipeline(it seems so):

I also tried the mentioned troubleshooting set:

ENV XLA_PYTHON_CLIENT_PREALLOCATE=false
ENV TF_FORCE_UNIFIED_MEMORY=true
ENV XLA_CLIENT_MEM_FRACTION=3.2
and modified model_config.py:
  pair_transition_shard_spec: Sequence[_Shape2DType] = (
      (2048, None),
      (3072, 1024),
      (None, 512),
  )

but still got errors at very beginning this time,no matter whether the --norun_data_pipeline was used:

any advice are appreciated!
link my docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi here:

The text was updated successfully, but these errors were encountered:

MaoSihong · 2024-12-12T12:51:37Z

additionally I noticed that the container have low RAM costs when the test case is running at its MSA searching stage(usually less than 5GB). I dont know if it is the normal case for 2PV7

MaoSihong · 2024-12-12T12:55:30Z

additionally I noticed that the container have low RAM costs when the test case is running at its MSA searching stage(usually less than 5GB). I dont know if it is the normal case for 2PV7
also following notice under default profile running

alchemistcai · 2024-12-12T13:08:47Z

I test 2pv7 on 4060 8G and 768 tokens is the max default compile bucket.

By using --buckets='900' I can inference at most 900 tokens without OutOfMemory error.

I use the default single GPU settings without uniform memory:

export XLA_FLAGS="--xla_gpu_enable_triton_gemm=false" 
export XLA_PYTHON_CLIENT_PREALLOCATE=true
export XLA_CLIENT_MEM_FRACTION=0.95

Use nvidia-smi before you python run_alphafold.py to find out which process is using the GPU.

For MSA searching,my test RAM is about 4~5G,used by default 8 CPU core.So it's normal.

joshabramson · 2024-12-13T11:48:22Z

The error is during the inference stage, after msa.

As well as @alchemistcai being able to run on 4060, other users have reported success for this example on RTX 4090: #59 (comment)

It should be fine having two gpu available, but perhaps that is causing an issue for some reason, can you try with --gpus device=0 instead of --gpus all ?

MaoSihong · 2024-12-13T15:43:52Z

I test 2pv7 on 4060 8G and 768 tokens is the max default compile bucket.

By using --buckets='900' I can inference at most 900 tokens without OutOfMemory error.

I use the default single GPU settings without uniform memory:
export XLA_FLAGS="--xla_gpu_enable_triton_gemm=false" 
export XLA_PYTHON_CLIENT_PREALLOCATE=true
export XLA_CLIENT_MEM_FRACTION=0.95
Use nvidia-smi before you python run_alphafold.py to find out which process is using the GPU.

For MSA searching,my test RAM is about 4~5G,used by default 8 CPU core.So it's normal.

I tried default profile compiling but with --buckets=1280 this time, seems like the vanilla pipeline will pre-occupy the GPU resources so I expect for success with limit of bucketsize this time. Unfortunately, I still got the same error, thanks for your advice anyway

MaoSihong · 2024-12-13T15:59:38Z

The error is during the inference stage, after msa.

As well as @alchemistcai being able to run on 4060, other users have reported success for this example on RTX 4090: #59 (comment)

It should be fine having two gpu available, but perhaps that is causing an issue for some reason, can you try with --gpus device=0 instead of --gpus all ?

sorry for that, I still failed

alchemistcai · 2024-12-14T06:36:13Z

ENV XLA_PYTHON_CLIENT_PREALLOCATE=false
ENV TF_FORCE_UNIFIED_MEMORY=true
ENV XLA_CLIENT_MEM_FRACTION=3.2
and modified model_config.py:
  pair_transition_shard_spec: Sequence[_Shape2DType] = (
      (2048, None),
      (3072, 1024),
      (None, 512),
  )

You may try to undo this modification. I tried it before.

For a single GPU,uniform memory settings makes even 256 tokens inference OutOfMemory on 4060.

XIANZHE-LI · 2024-12-14T12:45:37Z

I've the same problem.

W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below 18.88GiB (20266891105 bytes) by rematerialization; only reduced to 41.43GiB (44490724824 bytes), down from 45.92GiB (49309847144 bytes) originally
2024-12-14 19:57:44.462633: W external/xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 47.20GiB (rounded to 50677266944)requested by op 
2024-12-14 19:57:44.464119: W external/xla/xla/tsl/framework/bfc_allocator.cc:508] ************________________________________________________________________________________________
E1214 19:57:44.464390   17928 pjrt_stream_executor_client.cc:3084] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 50677266816 bytes.
Traceback (most recent call last):
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 690, in <module>
    app.run(main)
  File "/root/miniconda/envs/af3-2/lib/python3.11/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/root/miniconda/envs/af3-2/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
             ^^^^^^^^^^
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 674, in main
    process_fold_input(
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 542, in process_fold_input
    all_inference_results = predict_structure(
                            ^^^^^^^^^^^^^^^^^^
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 375, in predict_structure
    result = model_runner.run_inference(example, rng_key)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 313, in run_inference
    result = self._model(rng_key, featurised_example)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 50677266816 bytes.
--------------------

For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

i have 3 4090
but only can be used

| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:3D:00.0 Off |                  Off |
| 30%   28C    P8             15W /  450W |   23399MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  |   00000000:63:00.0 Off |                  Off |
| 30%   29C    P8             21W /  450W |     393MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        On  |   00000000:BD:00.0 Off |                  Off |
| 30%   27C    P8             19W /  450W |     393MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

MaoSihong · 2024-12-16T03:17:56Z

ENV XLA_PYTHON_CLIENT_PREALLOCATE=false
ENV TF_FORCE_UNIFIED_MEMORY=true
ENV XLA_CLIENT_MEM_FRACTION=3.2
and modified model_config.py:
pair_transition_shard_spec: Sequence[_Shape2DType] = (
(2048, None),
(3072, 1024),
(None, 512),
)

You may try to undo this modification.I tried it before.

For a single GPU,uniform memory settings makes even 256 tokens inference OutOfMemory on 4060.

yep, but at my first test with default PARAs, those ENV PARAs are not turned on.
but OOM failure still happened that time.
thank you very much for your close attention!

MaoSihong · 2024-12-16T03:23:43Z

I've the same problem.

W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below 18.88GiB (20266891105 bytes) by rematerialization; only reduced to 41.43GiB (44490724824 bytes), down from 45.92GiB (49309847144 bytes) originally
2024-12-14 19:57:44.462633: W external/xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 47.20GiB (rounded to 50677266944)requested by op 
2024-12-14 19:57:44.464119: W external/xla/xla/tsl/framework/bfc_allocator.cc:508] ************________________________________________________________________________________________
E1214 19:57:44.464390   17928 pjrt_stream_executor_client.cc:3084] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 50677266816 bytes.
Traceback (most recent call last):
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 690, in <module>
    app.run(main)
  File "/root/miniconda/envs/af3-2/lib/python3.11/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/root/miniconda/envs/af3-2/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
             ^^^^^^^^^^
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 674, in main
    process_fold_input(
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 542, in process_fold_input
    all_inference_results = predict_structure(
                            ^^^^^^^^^^^^^^^^^^
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 375, in predict_structure
    result = model_runner.run_inference(example, rng_key)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 313, in run_inference
    result = self._model(rng_key, featurised_example)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 50677266816 bytes.
--------------------

For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these. i have 3 4090 but only can be used

| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:3D:00.0 Off |                  Off |
| 30%   28C    P8             15W /  450W |   23399MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  |   00000000:63:00.0 Off |                  Off |
| 30%   29C    P8             21W /  450W |     393MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        On  |   00000000:BD:00.0 Off |                  Off |
| 30%   27C    P8             19W /  450W |     393MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

sure I'll try to dig out the inside details without log filtering.
as far as I know, also only single GPU were allocated when my container was running

MaoSihong · 2024-12-16T08:01:40Z

错误发生在推理阶段，在 msa 之后。

也@alchemistcai能够在 4060 上运行，其他用户报告此示例在 RTX 4090 上成功运行：#59（评论）

有两个可用的 GPU 应该没问题，但也许由于某种原因导致了问题，您可以尝试用--gpus device=0而不是吗--gpus all？

yes I tried it yet with explicitly assignment of device=0, in fact no matter 'all' or 'device=0' option was on ,there's always device0 4090 24GB GPU showing significant resources occupy, but same OOM error occured

Augustin-Zidek · 2024-12-19T12:08:17Z

@MaoSihong from the nvidia-smi screenshots, it looks like you are on the 560 version of NVIDIA drivers, i.e. on the beta channel. Could you try downgrading to the stable 550 version?

Or is this under Windows for Linux subsystem? If so, I strongly recommending running AlphaFold 3 under Linux, this is the only supported operating system.

MaoSihong · 2024-12-23T03:24:25Z

@MaoSihong从nvidia-smi截图来看，您使用的是 NVIDIA 驱动程序的 560 版本，即测试版。您可以尝试降级到稳定的 550 版本吗？

或者这是在 Windows 下的 Linux 子系统？如果是这样，我强烈建议在 Linux 下运行 AlphaFold 3，这是唯一受支持的操作系统。
yep I made the case under the WSL2+docker desktop, docker can integrate to the subsystem. I dont know whether is the cuda version or subsystem cause the OOM. I think I should probably quit this struggling process on WSL2, seems that there is complicated compatibility problem for launching af3 in WSL

joshabramson added the question Further information is requested label Dec 13, 2024

joshabramson mentioned this issue Dec 14, 2024

How to get af3 to mobilize multiple GPUs to use more memory #213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA_ERROR_OUT_OF_MEMORY issue when running test case on 4090 24G GPU machine locally #209

CUDA_ERROR_OUT_OF_MEMORY issue when running test case on 4090 24G GPU machine locally #209

MaoSihong commented Dec 12, 2024 •

edited by Augustin-Zidek

Loading

MaoSihong commented Dec 12, 2024

MaoSihong commented Dec 12, 2024

alchemistcai commented Dec 12, 2024

joshabramson commented Dec 13, 2024

MaoSihong commented Dec 13, 2024 •

edited by Augustin-Zidek

Loading

MaoSihong commented Dec 13, 2024

alchemistcai commented Dec 14, 2024 •

edited by Augustin-Zidek

Loading

XIANZHE-LI commented Dec 14, 2024 •

edited by Augustin-Zidek

Loading

MaoSihong commented Dec 16, 2024

MaoSihong commented Dec 16, 2024

MaoSihong commented Dec 16, 2024

Augustin-Zidek commented Dec 19, 2024 •

edited

Loading

MaoSihong commented Dec 23, 2024

CUDA_ERROR_OUT_OF_MEMORY issue when running test case on 4090 24G GPU machine locally #209

CUDA_ERROR_OUT_OF_MEMORY issue when running test case on 4090 24G GPU machine locally #209

Comments

MaoSihong commented Dec 12, 2024 • edited by Augustin-Zidek Loading

MaoSihong commented Dec 12, 2024

MaoSihong commented Dec 12, 2024

alchemistcai commented Dec 12, 2024

joshabramson commented Dec 13, 2024

MaoSihong commented Dec 13, 2024 • edited by Augustin-Zidek Loading

MaoSihong commented Dec 13, 2024

alchemistcai commented Dec 14, 2024 • edited by Augustin-Zidek Loading

XIANZHE-LI commented Dec 14, 2024 • edited by Augustin-Zidek Loading

MaoSihong commented Dec 16, 2024

MaoSihong commented Dec 16, 2024

MaoSihong commented Dec 16, 2024

Augustin-Zidek commented Dec 19, 2024 • edited Loading

MaoSihong commented Dec 23, 2024

MaoSihong commented Dec 12, 2024 •

edited by Augustin-Zidek

Loading

MaoSihong commented Dec 13, 2024 •

edited by Augustin-Zidek

Loading

alchemistcai commented Dec 14, 2024 •

edited by Augustin-Zidek

Loading

XIANZHE-LI commented Dec 14, 2024 •

edited by Augustin-Zidek

Loading

Augustin-Zidek commented Dec 19, 2024 •

edited

Loading