MOM6-CICE6 0.25deg configurations - optimisation and scaling performance #148

minghangli-uni · 2024-04-29T01:30:26Z

This issue is to monitor the progress of the profiling and optimization efforts aimed at efficiently running the 0.25deg configuration.

minghangli-uni · 2024-04-29T03:29:50Z

It takes approximately 19 hours with diagnostic output when running sequentially with 288 cores for a one-year model run.

When running concurrently with a total of 1296 cores (864 for ocn, 288 for ice, and 144 for cpl), the runtime is reduced to around 5 hours without diagnostic output or 6 hours with current diagnostic output.

Processor layout is not tuned at the current stage, but is adjusted at runtime. For the current config, it is (32, 27).

aekiss · 2024-04-29T07:29:22Z

ACCESS-OM2-025 stats as a point of comparison:

1817 cores (1456 MOM5 + 361 CICE5; neglecting 1 for YATM), with land masking
7.3KSU and 2hr walltime per model year with diagnostic output and a 1350s timestep (for coupling, MOM baroclinic, CICE thermodynamics and CICE dynamics)

aekiss · 2024-04-30T01:22:08Z

So with diag output

ACCESS-OM2 takes 3600 core-hr per model year
ACCESS-OM3 takes 5500 core-hr per model year sequentially on 288 cores
ACCESS-OM3 takes 7800 core-hr per model year concurrently on 1296 cores

So we're in the ball park.

What timestep are you using?

aekiss · 2024-04-30T01:23:16Z

with a total of 1296 cores (864 for ocn, 288 for ice, and 144 for cpl)

So the coupler gets dedicated cores? I didn't realise this was how it works.

minghangli-uni · 2024-04-30T02:01:43Z

Thank you @aekiss for providing the comparison case.

What timestep are you using?

The timestep I am using is the same as that of OM2, where dt = dt_thermo = dt_cpl = dt_ice_thermo = 1350s.

So the coupler gets dedicated cores?

yes, the coupler component has its own dedicated processor set for doing computations such as mapping, merging, diagnostics, and flux calculation.

aekiss · 2024-05-01T00:58:05Z

Are we using AUTO_MASKTABLE? (see ACCESS-NRI/access-om3-configs#38 (comment))

minghangli-uni · 2024-05-07T01:34:45Z

This comment is to document the scaling performance of multiple sequential runs, adjusting the core count from 1 to 384 on Gadi. I am still thinking how to present the scaling results of concurrent runs. I will share them later in the next comment.

Given that the 025deg configuration requires more memory compared to the 1deg configuration, runs with 48 cores or fewer use the Sapphire Rapids partition, while the remaining runs use the Cascade Lake partition. auto-masking parameter is activated for runs exceeding 48 cores, as this parameter cannot automatically eliminate any land blocks for configurations with 48 cores or fewer. Each run is constrained to half a model day (32 timesteps) in duration with DT=DT_THERM=DT_ICE_THERM=DT_CPL=1350s

Profiling is conducted using om3-utils tools, with a presentation format similar to the profiling for the 1deg configuration.

Summary of the figures below:

Beyond 192 cores, there's a notable increase in initialization time, though it's not expected to persist as a long-term issue.
Enabling the auto-masking parameter can enhance performance by up to 15% compared to keeping it off.
Parallel efficiency during the ESMF time-stepping phase declines rapidly, falling below 60%, mirroring trends observed in the 1deg configuration.
MOM6 exhibits better parallel efficiency compared to other components (e.g., CICE etc), yet overall efficiency remains suboptimal.
Specifically regarding MOM6 profiling, most regions show parallel efficiency levels between 40-60%, with surface forcing showing a steep decline. However, the time average of surface forcing is around one order of magnitude lower than that of other regions, suggesting it may not warrant significant concern.
I'm considering extending the run duration beyond half a model day, perhaps up to a month, for a few cases, and then reassessing.

ESMF overall performance

Average time for the full calculation (ESMF), Initialisation ([ensemble] Init 1), time-stepping ([ensemble] RunPhase1) and finalisation ([ensemble] FinalizePhase1). Lines marked with a solid circle denote when auto-masking is off, whereas lines marked with an "x" indicate when auto-masking is turned on.

Speedup for the full calculation (ESMF), Initialisation ([ensemble] Init 1), time-stepping ([ensemble] RunPhase 1) and finalisation ([ensemble] FinalizePhase1).
Parrallel efficiency for the full calculation (ESMF), Initialisation ([ensemble] Init 1), time-stepping ([ensemble] RunPhase 1) and finalisation ([ensemble] FinalizePhase1).

ESMF RunPhase for each component

Average runtime for each component [OCN] RunPhase1, [ICE] RunPhase1, [ATM] RunPhase1, [ROF] RunPhase1
Speedup for each component [OCN] RunPhase1, [ICE] RunPhase1, [ATM] RunPhase1, [ROF] RunPhase1
Parrallel efficiency for each component [OCN] RunPhase1, [ICE] RunPhase1, [ATM] RunPhase1, [ROF] RunPhase1

MOM6 profiling

Average time for Total runtime, Ocean, Ocean dynamics, Ocean thermodynamics and tracers, Ocean Other, Ocean Initialization, Ocean surface forcing
Speedup for Total runtime, Ocean, Ocean dynamics, Ocean thermodynamics and tracers, Ocean Other, Ocean Initialization, Ocean surface forcing
Parrallel efficiency Total runtime, Ocean, Ocean dynamics, Ocean thermodynamics and tracers, Ocean Other, Ocean Initialization, Ocean surface forcing

aekiss · 2024-05-09T01:15:12Z

Thanks @minghangli-uni, I agree it looks like init is the culprit for poor scaling at high core count, so longer runs would be more representative (although if we only look at the run phase I guess things will still look much the same).

aekiss · 2024-05-09T01:23:18Z

runs with 48 cores or fewer use the Sapphire Rapids partition, while the remaining runs use the Cascade Lake partition

Could be interesting to also do a few tests with Sapphire Rapids with >48 cores to assess the performance difference from Cascade Lake.

For that matter, at some point it could be useful to also test with Broadwell or Skylake (will be slower but may be cheaper since they have a lower SU charge rate).

aekiss · 2024-05-09T01:28:15Z

FYI, the scaling plots from our 2020 GMD paper were generated from these files
https://github.com/COSIMA/ACCESS-OM2-1-025-010deg-report/tree/master/figures/scaling_mom_cice
https://github.com/COSIMA/ACCESS-OM2-1-025-010deg-report/tree/master/figures/cice_scaling
https://github.com/COSIMA/ACCESS-OM2-1-025-010deg-report/tree/master/figures/mom_scaling

These are from @marshallward and were done on the previous NCI system (raijin) with very old and inefficient configurations, so aren't very relevant anymore.

Things have improved a lot since then, as you can see from the plot I showed at Tuesday's meeting - see the end of https://github.com/aekiss/notebooks/blob/master/run_summary.ipynb

aekiss · 2024-05-09T02:35:40Z

FYI Marshall's performance study of NCI's old raijin system https://github.com/marshallward/optiflop/blob/main/doc/microbench.rst

aekiss · 2024-05-09T02:40:14Z

More relevant is @penguian's 2020 ACCESS-OM2 scaling study on Gadi.

marshallward · 2024-05-09T13:46:19Z

Very nice study @minghangli-uni, thanks for preparing this. This gradual decrease in efficiency is similar to what I saw in an earlier study below, although you are seeing a stronger drop rate. The general trend is different from MOM5, which would show high efficiency and then suddenly drop like a rock.

(Disregard the NEMO curve, it was based on a rather old codebase.)

We should talk more at a future Monday dev call (Tuesday in AEST).

minghangli-uni · 2024-05-09T23:52:52Z

Could be interesting to also do a few tests with Sapphire Rapids with >48 cores to assess the performance difference from Cascade Lake.
For that matter, at some point it could be useful to also test with Broadwell or Skylake (will be slower but may be cheaper since they have a lower SU charge rate).

Thank you for the suggestion. Will do some tests using these two partitiions.

minghangli-uni · 2024-05-09T23:53:39Z

Hi @marshallward, thank you for sharing your scaling results. As you may find from my plots, each runtime/speedup/parallel efficiency plot begins from 1 core, although it's not strictly a serial run. There's a significant drop in parallel efficiency shortly after the initial scaling from 1 core (or 6 cores). However, when the starting core count is set to 96 cores, the scaling appears to improve considerably. So I am curious about the rationale behind not starting from a very low core count? Could it be that starting from a higher core count helps alleviate overheads and enables a more efficient distribution of the workload?

The following 3 figures show the average time (fig 1), speedup (fig 2) and parallel efficiency (fig 3) for 4 major regions Ocean, Ocean dynamics, Ocean Thermodynamics and tracers and Ocean Other (a sub-region of Ocean). Each run, as described in the previous comment, is constrained to half a model day (32 timesteps) in duration with DT=DT_THERM=DT_ICE_THERM=DT_CPL=1350s. It's worth noting that the runtime for the Ocean region roughly matches the combined runtime of the Ocean dynamics, Ocean Thermodynamics and tracers regions.

Average time as a function of ocn cpus for Ocean, Ocean dynamics, Ocean Thermodynamics and tracers and Ocean Other.
Speedup as a function of ocn cpus for Ocean, Ocean dynamics, Ocean Thermodynamics and tracers and Ocean Other.
Parallel efficiency as a function of ocn cpus for Ocean, Ocean dynamics, Ocean Thermodynamics and tracers and Ocean Other.

marshallward · 2024-05-10T04:46:59Z

Thanks for the updated figures! As for why we started at such high numbers, I could say something about single-core jobs being heavily RAM-bound, rather than more cache-vs-communication bound at the higher cores... but the real reason was that we just couldn't fit those single-core jobs in memory, since the RAM-per-node was way too small. 🤪

If I were to try and explain the difference in performance from from 1 to 96 cores, I would guess it is due to two competing effects.

Division of work into smaller arrays per core.
Moving the arrays from RAM to cache, producing higher load/store speeds and potential for vectorization (>1 FLOP/cycle).
Gradual overhead from communication, both number of messages and total data transfer. You will also see an increase in array size per core, due to halos.

Generally the trend will go something like this:

At low N, the first two are prominent, yielding major speedups.
As N is increased, the benefits from cache and vectorization are diminished, and at some point become negligible. But you continue to see speedup from division of work.
At large N, communication costs become significant. Eventually they become comparable to benefits from division-of-work, and the cost may even exceed any further benefit of scaling.

So if someone were to ask about the "scalability" of a model: do they mean the speedup relative to N=1? Or are they simply asking about the threshold for communication scaling? That is, how long can we throw cores at the problem until it becomes a waste of time/energy/hardware?

I don't think there is a single answer to this question, and we certainly should strive to maximize the benefits of cache-bound (if not cycle-bound peak) performance. As long as one is clear about what they mean "scalability", then I don't think there is any harm in using one or the other as a reference point.

By the way: I think these figures are fantastic, and I hope you will continue to include low-N values in these studies. We could only speculate on how the model behaved in this regime, and it's amazing to see the entire trend of the model.

(And the reference "efficiency" and "ideal speedup" were always imaginary concepts anyway. Don't let it get in the way of the actual measurements!)

minghangli-uni · 2024-05-13T23:57:42Z

@marshallward Thank you for the detailed explanations! I couldnt agree more with the competing effects when N is increasing.

So if someone were to ask about the "scalability" of a model: do they mean the speedup relative to N=1? Or are they simply asking about the threshold for communication scaling? That is, how long can we throw cores at the problem until it becomes a waste of time/energy/hardware?
I don't think there is a single answer to this question, and we certainly should strive to maximize the benefits of cache-bound (if not cycle-bound peak) performance. As long as one is clear about what they mean "scalability", then I don't think there is any harm in using one or the other as a reference point.

From my personal understanding, if a configuration (e.g., 0.25deg) can be fitted within a lower core count job, and yet its performance relative to that lower core count is relatively poor while showing significant improvement relative to a higher core count, it implies that the code indeed gains considerable advantages from parallelization and makes efficient use of computational resources. However, it may not exhibit optimal scalability relative to lower core counts. I agree with the idea you proposed. It is important for us to aim for maximising the benefits of cache-bound performance.

minghangli-uni · 2024-05-14T01:17:09Z

This comment is to document profiling plots for low-N values, starting from 48 cores (1 node) for Cascade Lake (CL) partition, and from 32 cores (1 node) for the Sky Lake (SL) partition, which operates with a lower SU charge rate.

It can be found that for both partitions, the overall parallel efficiency for MOM6 remains consistently above 80% up to 2304 cores. Anticipated better performance is expected from the Cascade Lake partition.

Cpu cores	tavg(s) - Ocean CL	tavg(s) - Ocean SL	Percentage Diff % (SL-CL)/CL
576	49.503341	62.230439	25.7
864	33.683130	41.850632	24.2
1152	25.661238	32.463238	26.5
2304	12.052829	17.893434	48.4

minghangli-uni · 2024-05-14T01:18:24Z

The existing compiler options used for compiling our code are not compatible with the Broadwell partition, resulting in illegall instruction errors.

forrtl: severe (168): Program Exception - illegal instruction
Image              PC                Routine            Line        Source
libpthread-2.28.s  000014FA26594CF0  Unknown               Unknown  Unknown
access-om3-MOM6-C  0000000001B44C20  _ZN5ESMCI3VMK4ini         509  ESMCI_VMKernel.C
access-om3-MOM6-C  0000000002209F0B  _ZN5ESMCI2VM10ini        3163  ESMCI_VM.C
access-om3-MOM6-C  0000000000B7C4F3  c_esmc_vminitiali        1151  ESMCI_VM_F.C
access-om3-MOM6-C  0000000000FCF0D2  esmf_vmmod_mp_esm        9275  ESMF_VM.F90
access-om3-MOM6-C  0000000000E718B3  esmf_initmod_mp_e         671  ESMF_Init.F90
access-om3-MOM6-C  0000000000E704C2  esmf_initmod_mp_e         373  ESMF_Init.F90
access-om3-MOM6-C  0000000000431BFD  MAIN__                     68  esmApp.F90
access-om3-MOM6-C  00000000004319CD  Unknown               Unknown  Unknown
libc-2.28.so       000014FA261F7D85  __libc_start_main     Unknown  Unknown
access-om3-MOM6-C  00000000004318EE  Unknown               Unknown  Unknown

minghangli-uni · 2024-05-14T01:39:14Z

The 025-degree configuration currently supports a maximum of 2736 CPU cores (56 cores for the Cascade Lake partition). Exceeding this limit results in a memory issue. In contrast, the OM2 report indicates that CPU cores can extend beyond 8000.

[gadi-cpu-clx-0218:374091:0:374091] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x303030302e)
[gadi-cpu-clx-0218:374098:0:374098] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x303030302e)
[gadi-cpu-clx-0218:374097:0:374097] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x303030302e)
[gadi-cpu-clx-0218:374104:0:374104] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x303030302e)
[gadi-cpu-clx-0218:374106:0:374106] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x303030302e)
[gadi-cpu-clx-0218:374110:0:374110] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x303030302e)
==== backtrace (tid: 374069) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x000000000011b4ca std::ostream::sentry::sentry()  ???:0
 2 0x000000000011bbac std::__ostream_insert<char, std::char_traits<char> >()  ???:0
 3 0x000000000011c05b std::operator<< <std::char_traits<char> >()  ???:0
 4 0x000000000221226c ESMCI::VMId::log()  /jobfs/111971718.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.5.0-ki52vmthtnxdyjeghtyphmrk5ju3yxuj/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:500
 5 0x00000000022110d7 ESMCI::VM::logGarbageInfo()  /jobfs/111971718.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.5.0-ki52vmthtnxdyjeghtyphmrk5ju3yxuj/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:2405
 6 0x0000000000b8c06b c_esmc_vmloggarbageinfo_()  /jobfs/111971718.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.5.0-ki52vmthtnxdyjeghtyphmrk5ju3yxuj/spack-src/src/Infrastructure/VM/interface/ESMCI_VM_F.C:1994
 7 0x0000000000fc004e esmf_vmmod_mp_esmf_vmloggarbageinfo_()  /jobfs/111971718.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.5.0-ki52vmthtnxdyjeghtyphmrk5ju3yxuj/spack-src/src/Infrastructure/VM/interface/ESMF_VM.F90:5989
 8 0x00000000015e0642 nuopc_base_mp_nuopc_logintro_()  /jobfs/111971718.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.5.0-ki52vmthtnxdyjeghtyphmrk5ju3yxuj/spack-src/src/addon/NUOPC/src/NUOPC_Base.F90:3222
 9 0x0000000004b6eaff nuopc_modelbase_mp_initializeipdvxp07_()  /jobfs/111971718.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.5.0-ki52vmthtnxdyjeghtyphmrk5ju3yxuj/spack-src/src/addon/NUOPC/src/NUOPC_ModelBase.F90:1369
...

aekiss · 2024-08-27T23:32:16Z

@minghangli-uni have you looked at the performance impact of the coupler timestep?

I feel like it should be set to the ocean baroclinic timestep (equal to the sea ice dynamic timestep) to get the most accurate ocean-ice coupling and also the benefit of temporal interpolation of the surface stress from DATM.

However, I'm not sure if that would impose a significant performance cost (either the direct cost of the coupler, or the cost of more frequent waiting for synchronisation between MOM and CICE if their loads are unbalanced).

minghangli-uni · 2024-08-27T23:51:08Z

I havent yet checked the impact of the coupler timestep, though it's something I've been meaning to do. For the timebeing, the coupler timestep is set to match the baroclinic ts. I will do a performance test as well as a scientific check by varying the coupler timestep for the 1deg configuration first and then proceed to the 0.25deg.

minghangli-uni self-assigned this Apr 29, 2024

minghangli-uni mentioned this issue Apr 29, 2024

Update the PE layout for concurrent runs - 025deg_jra55do_ryf ACCESS-NRI/access-om3-configs#60

Closed

anton-seaice added cmip7 om3-025 labels Apr 30, 2024

anton-seaice added this to CMIP7 Apr 30, 2024

anton-seaice moved this to In Progress in CMIP7 Apr 30, 2024

anton-seaice mentioned this issue Apr 30, 2024

Processor layout for 0.25deg configuration ACCESS-NRI/access-om3-configs#38

Closed

dougiesquire added priority:high in progress and removed priority:high labels May 2, 2024

minghangli-uni mentioned this issue May 15, 2024

Rebuild executable for different partitions #163

Open

ezhilsabareesh8 moved this to Todo in ACCESS-OM3 025 Nov 15, 2024

ezhilsabareesh8 added this to ACCESS-OM3 025 Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MOM6-CICE6 0.25deg configurations - optimisation and scaling performance #148

MOM6-CICE6 0.25deg configurations - optimisation and scaling performance #148

minghangli-uni commented Apr 29, 2024

minghangli-uni commented Apr 29, 2024

aekiss commented Apr 29, 2024

aekiss commented Apr 30, 2024

aekiss commented Apr 30, 2024

minghangli-uni commented Apr 30, 2024

aekiss commented May 1, 2024

minghangli-uni commented May 7, 2024 •

edited

Loading

aekiss commented May 9, 2024

aekiss commented May 9, 2024

aekiss commented May 9, 2024

aekiss commented May 9, 2024 •

edited

Loading

aekiss commented May 9, 2024 •

edited

Loading

marshallward commented May 9, 2024

minghangli-uni commented May 9, 2024

minghangli-uni commented May 9, 2024 •

edited

Loading

marshallward commented May 10, 2024

minghangli-uni commented May 13, 2024 •

edited

Loading

minghangli-uni commented May 14, 2024

minghangli-uni commented May 14, 2024

minghangli-uni commented May 14, 2024

aekiss commented Aug 27, 2024 •

edited

Loading

minghangli-uni commented Aug 27, 2024

MOM6-CICE6 0.25deg configurations - optimisation and scaling performance #148

MOM6-CICE6 0.25deg configurations - optimisation and scaling performance #148

Comments

minghangli-uni commented Apr 29, 2024

minghangli-uni commented Apr 29, 2024

aekiss commented Apr 29, 2024

aekiss commented Apr 30, 2024

aekiss commented Apr 30, 2024

minghangli-uni commented Apr 30, 2024

aekiss commented May 1, 2024

minghangli-uni commented May 7, 2024 • edited Loading

Summary of the figures below:

ESMF overall performance

ESMF RunPhase for each component

MOM6 profiling

aekiss commented May 9, 2024

aekiss commented May 9, 2024

aekiss commented May 9, 2024

aekiss commented May 9, 2024 • edited Loading

aekiss commented May 9, 2024 • edited Loading

marshallward commented May 9, 2024

minghangli-uni commented May 9, 2024

minghangli-uni commented May 9, 2024 • edited Loading

marshallward commented May 10, 2024

minghangli-uni commented May 13, 2024 • edited Loading

minghangli-uni commented May 14, 2024

minghangli-uni commented May 14, 2024

minghangli-uni commented May 14, 2024

aekiss commented Aug 27, 2024 • edited Loading

minghangli-uni commented Aug 27, 2024

minghangli-uni commented May 7, 2024 •

edited

Loading

aekiss commented May 9, 2024 •

edited

Loading

aekiss commented May 9, 2024 •

edited

Loading

minghangli-uni commented May 9, 2024 •

edited

Loading

minghangli-uni commented May 13, 2024 •

edited

Loading

aekiss commented Aug 27, 2024 •

edited

Loading