Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Number of Grids Causing Insufficient GPU Memory #5366

Open
Lucas-Lucas1 opened this issue Oct 4, 2024 · 3 comments
Open

Large Number of Grids Causing Insufficient GPU Memory #5366

Lucas-Lucas1 opened this issue Oct 4, 2024 · 3 comments
Labels
component: parallelization Guard cell exchanges and particle redistribution question Further information is requested

Comments

@Lucas-Lucas1
Copy link

When performing 3D simulations, I want to divide a large number of grids, 2048*64*2048, but I encounter the error :

amrex::Abort::1::Out of gpu memory. Free: 2293760 Asked: 8388608 !!!
SIGABRT
amrex::Abort::0::Out of gpu memory. Free: 2293760 Asked: 8388608 !!!
SIGABRT
amrex::Abort::3::Out of gpu memory. Free: 2293760 Asked: 8388608 !!!
SIGABRT
amrex::Abort::2::Out of gpu memory. Free: 2293760 Asked: 8388608 !!!
SIGABRT
See Backtrace.0 file for details
See Backtrace.1 file for details
See Backtrace.2 file for details
See Backtrace.3 file for details
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM 0
with errorcode 6.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[gpu005.cluster.cn:51980] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[gpu005.cluster.cn:51980] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Backtrace.0.txt

In this case, what can I do to resolve the issue? As a new WarpX user, there are many details in the official documentation that I’m still learning.

Additionally, I want to apply a time-varying external electromagnetic field in a specific region. I’ve reviewed case 5046, but I noticed that using if(..., ...) statements there caused issues. Does the latest version of WarpX support setting this up with if statements, or is there a better method now?

@ax3l ax3l added question Further information is requested component: parallelization Guard cell exchanges and particle redistribution labels Oct 8, 2024
@ax3l
Copy link
Member

ax3l commented Oct 8, 2024

Hi @Lucas-Lucas1,

Thanks for reaching out.
Did you already read https://warpx.readthedocs.io/en/latest/usage/workflows/domain_decomposition.html ?

To guide you a bit more, can you post the inputs and submission scripts you are using?

@roelof-groenewald
Copy link
Member

Hi @Lucas-Lucas1. It would also be helpful to know how many GPUs (and what kind) are you trying to run this simulation on and how many particles do you have in total? Note that WarpX permanently keeps the particle quantities on GPU memory since moving them between the GPU and CPU is time consuming. For this reason you have to have enough total GPU memory to fit all the particles in your simulation. In my experience a 40Gb A100 GPU can hold about 200 million particles, so if I want to run a large simulation with, say, 800 million particles I need to use at least 4 A100 GPUs.

@Lucas-Lucas1
Copy link
Author

Thanks for your responses. In fact, I haven't yet reached the part regarding Domain Decomposition, I will study it as soon as possible.

Below are my input script test.py and submission script sbatch.sh.
test.py.txt
sbatch.sh.txt

My cluster consists of 9 NVIDIA DGX-A100 high-performance computing servers. Each server is equipped with dual AMD ROME 7742 64C128T processors, 1TB DDR4 memory, 8 NVIDIA TESLA A100 40GB SMX4 acceleration cards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: parallelization Guard cell exchanges and particle redistribution question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants