Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MOM6-CICE6 0.25deg configurations - optimisation and scaling performance #148

Open
minghangli-uni opened this issue Apr 29, 2024 · 22 comments

Comments

@minghangli-uni
Copy link
Contributor

This issue is to monitor the progress of the profiling and optimization efforts aimed at efficiently running the 0.25deg configuration.

@minghangli-uni
Copy link
Contributor Author

It takes approximately 19 hours with diagnostic output when running sequentially with 288 cores for a one-year model run.

When running concurrently with a total of 1296 cores (864 for ocn, 288 for ice, and 144 for cpl), the runtime is reduced to around 5 hours without diagnostic output or 6 hours with current diagnostic output.

Processor layout is not tuned at the current stage, but is adjusted at runtime. For the current config, it is (32, 27).

@aekiss
Copy link
Contributor

aekiss commented Apr 29, 2024

ACCESS-OM2-025 stats as a point of comparison:

@aekiss
Copy link
Contributor

aekiss commented Apr 30, 2024

So with diag output

  • ACCESS-OM2 takes 3600 core-hr per model year
  • ACCESS-OM3 takes 5500 core-hr per model year sequentially on 288 cores
  • ACCESS-OM3 takes 7800 core-hr per model year concurrently on 1296 cores

So we're in the ball park.

What timestep are you using?

@aekiss
Copy link
Contributor

aekiss commented Apr 30, 2024

with a total of 1296 cores (864 for ocn, 288 for ice, and 144 for cpl)

So the coupler gets dedicated cores? I didn't realise this was how it works.

@minghangli-uni
Copy link
Contributor Author

Thank you @aekiss for providing the comparison case.

What timestep are you using?

The timestep I am using is the same as that of OM2, where dt = dt_thermo = dt_cpl = dt_ice_thermo = 1350s.

So the coupler gets dedicated cores?

yes, the coupler component has its own dedicated processor set for doing computations such as mapping, merging, diagnostics, and flux calculation.

@aekiss
Copy link
Contributor

aekiss commented May 1, 2024

Are we using AUTO_MASKTABLE? (see ACCESS-NRI/access-om3-configs#38 (comment))

@minghangli-uni
Copy link
Contributor Author

minghangli-uni commented May 7, 2024

This comment is to document the scaling performance of multiple sequential runs, adjusting the core count from 1 to 384 on Gadi. I am still thinking how to present the scaling results of concurrent runs. I will share them later in the next comment.

Given that the 025deg configuration requires more memory compared to the 1deg configuration, runs with 48 cores or fewer use the Sapphire Rapids partition, while the remaining runs use the Cascade Lake partition. auto-masking parameter is activated for runs exceeding 48 cores, as this parameter cannot automatically eliminate any land blocks for configurations with 48 cores or fewer. Each run is constrained to half a model day (32 timesteps) in duration with DT=DT_THERM=DT_ICE_THERM=DT_CPL=1350s

Profiling is conducted using om3-utils tools, with a presentation format similar to the profiling for the 1deg configuration.

Summary of the figures below:

  • Beyond 192 cores, there's a notable increase in initialization time, though it's not expected to persist as a long-term issue.
  • Enabling the auto-masking parameter can enhance performance by up to 15% compared to keeping it off.
  • Parallel efficiency during the ESMF time-stepping phase declines rapidly, falling below 60%, mirroring trends observed in the 1deg configuration.
  • MOM6 exhibits better parallel efficiency compared to other components (e.g., CICE etc), yet overall efficiency remains suboptimal.
  • Specifically regarding MOM6 profiling, most regions show parallel efficiency levels between 40-60%, with surface forcing showing a steep decline. However, the time average of surface forcing is around one order of magnitude lower than that of other regions, suggesting it may not warrant significant concern.
  • I'm considering extending the run duration beyond half a model day, perhaps up to a month, for a few cases, and then reassessing.

ESMF overall performance

  1. Average time for the full calculation (ESMF), Initialisation ([ensemble] Init 1), time-stepping ([ensemble] RunPhase1) and finalisation ([ensemble] FinalizePhase1). Lines marked with a solid circle denote when auto-masking is off, whereas lines marked with an "x" indicate when auto-masking is turned on.

walltime_ocn_cores-seq-w_o-mask

  1. Speedup for the full calculation (ESMF), Initialisation ([ensemble] Init 1), time-stepping ([ensemble] RunPhase 1) and finalisation ([ensemble] FinalizePhase1).
    speedup_ocn_cores-seq-w_o-mask

  2. Parrallel efficiency for the full calculation (ESMF), Initialisation ([ensemble] Init 1), time-stepping ([ensemble] RunPhase 1) and finalisation ([ensemble] FinalizePhase1).
    ESMF-parallel-efficiency_ocn_cores-seq-w_o-mask

ESMF RunPhase for each component

  1. Average runtime for each component [OCN] RunPhase1, [ICE] RunPhase1, [ATM] RunPhase1, [ROF] RunPhase1
    component-walltime_ocn_cores-seq-w_o-mask
  2. Speedup for each component [OCN] RunPhase1, [ICE] RunPhase1, [ATM] RunPhase1, [ROF] RunPhase1
    component-speedup_ocn_cores-seq-w_o-mask
  3. Parrallel efficiency for each component [OCN] RunPhase1, [ICE] RunPhase1, [ATM] RunPhase1, [ROF] RunPhase1
    ESMF-component-parallel-efficiency_ocn_cores-seq-w_o-mask

MOM6 profiling

  1. Average time for Total runtime, Ocean, Ocean dynamics, Ocean thermodynamics and tracers, Ocean Other, Ocean Initialization, Ocean surface forcing
    fms-walltime_ocn_cores-seq-w_o-mask
  2. Speedup for Total runtime, Ocean, Ocean dynamics, Ocean thermodynamics and tracers, Ocean Other, Ocean Initialization, Ocean surface forcing
    fms-speedup_ocn_cores-seq-w_o-mask
  3. Parrallel efficiency Total runtime, Ocean, Ocean dynamics, Ocean thermodynamics and tracers, Ocean Other, Ocean Initialization, Ocean surface forcing
    fms-parallel-efficiency_ocn_cores-seq-w_o-mask

@aekiss
Copy link
Contributor

aekiss commented May 9, 2024

Thanks @minghangli-uni, I agree it looks like init is the culprit for poor scaling at high core count, so longer runs would be more representative (although if we only look at the run phase I guess things will still look much the same).

@aekiss
Copy link
Contributor

aekiss commented May 9, 2024

runs with 48 cores or fewer use the Sapphire Rapids partition, while the remaining runs use the Cascade Lake partition

Could be interesting to also do a few tests with Sapphire Rapids with >48 cores to assess the performance difference from Cascade Lake.

For that matter, at some point it could be useful to also test with Broadwell or Skylake (will be slower but may be cheaper since they have a lower SU charge rate).

@aekiss
Copy link
Contributor

aekiss commented May 9, 2024

FYI, the scaling plots from our 2020 GMD paper were generated from these files
https://github.com/COSIMA/ACCESS-OM2-1-025-010deg-report/tree/master/figures/scaling_mom_cice
https://github.com/COSIMA/ACCESS-OM2-1-025-010deg-report/tree/master/figures/cice_scaling
https://github.com/COSIMA/ACCESS-OM2-1-025-010deg-report/tree/master/figures/mom_scaling

These are from @marshallward and were done on the previous NCI system (raijin) with very old and inefficient configurations, so aren't very relevant anymore.

Things have improved a lot since then, as you can see from the plot I showed at Tuesday's meeting - see the end of https://github.com/aekiss/notebooks/blob/master/run_summary.ipynb

@aekiss
Copy link
Contributor

aekiss commented May 9, 2024

FYI Marshall's performance study of NCI's old raijin system https://github.com/marshallward/optiflop/blob/main/doc/microbench.rst

@aekiss
Copy link
Contributor

aekiss commented May 9, 2024

More relevant is @penguian's 2020 ACCESS-OM2 scaling study on Gadi.

@marshallward
Copy link

Very nice study @minghangli-uni, thanks for preparing this. This gradual decrease in efficiency is similar to what I saw in an earlier study below, although you are seeing a stronger drop rate. The general trend is different from MOM5, which would show high efficiency and then suddenly drop like a rock.

mom_nemo_scaling_ecmwf

(Disregard the NEMO curve, it was based on a rather old codebase.)

We should talk more at a future Monday dev call (Tuesday in AEST).

@minghangli-uni
Copy link
Contributor Author

Could be interesting to also do a few tests with Sapphire Rapids with >48 cores to assess the performance difference from Cascade Lake.
For that matter, at some point it could be useful to also test with Broadwell or Skylake (will be slower but may be cheaper since they have a lower SU charge rate).

Thank you for the suggestion. Will do some tests using these two partitiions.

@minghangli-uni
Copy link
Contributor Author

minghangli-uni commented May 9, 2024

Hi @marshallward, thank you for sharing your scaling results. As you may find from my plots, each runtime/speedup/parallel efficiency plot begins from 1 core, although it's not strictly a serial run. There's a significant drop in parallel efficiency shortly after the initial scaling from 1 core (or 6 cores). However, when the starting core count is set to 96 cores, the scaling appears to improve considerably. So I am curious about the rationale behind not starting from a very low core count? Could it be that starting from a higher core count helps alleviate overheads and enables a more efficient distribution of the workload?

The following 3 figures show the average time (fig 1), speedup (fig 2) and parallel efficiency (fig 3) for 4 major regions Ocean, Ocean dynamics, Ocean Thermodynamics and tracers and Ocean Other (a sub-region of Ocean). Each run, as described in the previous comment, is constrained to half a model day (32 timesteps) in duration with DT=DT_THERM=DT_ICE_THERM=DT_CPL=1350s. It's worth noting that the runtime for the Ocean region roughly matches the combined runtime of the Ocean dynamics, Ocean Thermodynamics and tracers regions.

  1. Average time as a function of ocn cpus for Ocean, Ocean dynamics, Ocean Thermodynamics and tracers and Ocean Other.
    fms-walltime_ocn_cores

  2. Speedup as a function of ocn cpus for Ocean, Ocean dynamics, Ocean Thermodynamics and tracers and Ocean Other.
    fms-speedup_ocn_cores

  3. Parallel efficiency as a function of ocn cpus for Ocean, Ocean dynamics, Ocean Thermodynamics and tracers and Ocean Other.
    fms-parallel-efficiency_ocn_cores

@marshallward
Copy link

Thanks for the updated figures! As for why we started at such high numbers, I could say something about single-core jobs being heavily RAM-bound, rather than more cache-vs-communication bound at the higher cores... but the real reason was that we just couldn't fit those single-core jobs in memory, since the RAM-per-node was way too small. 🤪


If I were to try and explain the difference in performance from from 1 to 96 cores, I would guess it is due to two competing effects.

  • Division of work into smaller arrays per core.

  • Moving the arrays from RAM to cache, producing higher load/store speeds and potential for vectorization (>1 FLOP/cycle).

  • Gradual overhead from communication, both number of messages and total data transfer. You will also see an increase in array size per core, due to halos.

Generally the trend will go something like this:

  • At low N, the first two are prominent, yielding major speedups.

  • As N is increased, the benefits from cache and vectorization are diminished, and at some point become negligible. But you continue to see speedup from division of work.

  • At large N, communication costs become significant. Eventually they become comparable to benefits from division-of-work, and the cost may even exceed any further benefit of scaling.

So if someone were to ask about the "scalability" of a model: do they mean the speedup relative to N=1? Or are they simply asking about the threshold for communication scaling? That is, how long can we throw cores at the problem until it becomes a waste of time/energy/hardware?

I don't think there is a single answer to this question, and we certainly should strive to maximize the benefits of cache-bound (if not cycle-bound peak) performance. As long as one is clear about what they mean "scalability", then I don't think there is any harm in using one or the other as a reference point.


By the way: I think these figures are fantastic, and I hope you will continue to include low-N values in these studies. We could only speculate on how the model behaved in this regime, and it's amazing to see the entire trend of the model.

(And the reference "efficiency" and "ideal speedup" were always imaginary concepts anyway. Don't let it get in the way of the actual measurements!)

@minghangli-uni
Copy link
Contributor Author

minghangli-uni commented May 13, 2024

@marshallward Thank you for the detailed explanations! I couldnt agree more with the competing effects when N is increasing.

So if someone were to ask about the "scalability" of a model: do they mean the speedup relative to N=1? Or are they simply asking about the threshold for communication scaling? That is, how long can we throw cores at the problem until it becomes a waste of time/energy/hardware?
I don't think there is a single answer to this question, and we certainly should strive to maximize the benefits of cache-bound (if not cycle-bound peak) performance. As long as one is clear about what they mean "scalability", then I don't think there is any harm in using one or the other as a reference point.

From my personal understanding, if a configuration (e.g., 0.25deg) can be fitted within a lower core count job, and yet its performance relative to that lower core count is relatively poor while showing significant improvement relative to a higher core count, it implies that the code indeed gains considerable advantages from parallelization and makes efficient use of computational resources. However, it may not exhibit optimal scalability relative to lower core counts. I agree with the idea you proposed. It is important for us to aim for maximising the benefits of cache-bound performance.

@minghangli-uni
Copy link
Contributor Author

This comment is to document profiling plots for low-N values, starting from 48 cores (1 node) for Cascade Lake (CL) partition, and from 32 cores (1 node) for the Sky Lake (SL) partition, which operates with a lower SU charge rate.

It can be found that for both partitions, the overall parallel efficiency for MOM6 remains consistently above 80% up to 2304 cores. Anticipated better performance is expected from the Cascade Lake partition.

Cpu cores tavg(s) - Ocean
CL
tavg(s) - Ocean
SL
Percentage Diff %
(SL-CL)/CL
576 49.503341 62.230439 25.7
864 33.683130 41.850632 24.2
1152 25.661238 32.463238 26.5
2304 12.052829 17.893434 48.4

skylake-fms-walltime_ocn_cores

skylake-fms-speedup_ocn_cores

skylake-fms-parallel-efficiency_ocn_cores

@minghangli-uni
Copy link
Contributor Author

The existing compiler options used for compiling our code are not compatible with the Broadwell partition, resulting in illegall instruction errors.

forrtl: severe (168): Program Exception - illegal instruction
Image              PC                Routine            Line        Source
libpthread-2.28.s  000014FA26594CF0  Unknown               Unknown  Unknown
access-om3-MOM6-C  0000000001B44C20  _ZN5ESMCI3VMK4ini         509  ESMCI_VMKernel.C
access-om3-MOM6-C  0000000002209F0B  _ZN5ESMCI2VM10ini        3163  ESMCI_VM.C
access-om3-MOM6-C  0000000000B7C4F3  c_esmc_vminitiali        1151  ESMCI_VM_F.C
access-om3-MOM6-C  0000000000FCF0D2  esmf_vmmod_mp_esm        9275  ESMF_VM.F90
access-om3-MOM6-C  0000000000E718B3  esmf_initmod_mp_e         671  ESMF_Init.F90
access-om3-MOM6-C  0000000000E704C2  esmf_initmod_mp_e         373  ESMF_Init.F90
access-om3-MOM6-C  0000000000431BFD  MAIN__                     68  esmApp.F90
access-om3-MOM6-C  00000000004319CD  Unknown               Unknown  Unknown
libc-2.28.so       000014FA261F7D85  __libc_start_main     Unknown  Unknown
access-om3-MOM6-C  00000000004318EE  Unknown               Unknown  Unknown

@minghangli-uni
Copy link
Contributor Author

The 025-degree configuration currently supports a maximum of 2736 CPU cores (56 cores for the Cascade Lake partition). Exceeding this limit results in a memory issue. In contrast, the OM2 report indicates that CPU cores can extend beyond 8000.

[gadi-cpu-clx-0218:374091:0:374091] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x303030302e)
[gadi-cpu-clx-0218:374098:0:374098] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x303030302e)
[gadi-cpu-clx-0218:374097:0:374097] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x303030302e)
[gadi-cpu-clx-0218:374104:0:374104] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x303030302e)
[gadi-cpu-clx-0218:374106:0:374106] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x303030302e)
[gadi-cpu-clx-0218:374110:0:374110] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x303030302e)
==== backtrace (tid: 374069) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x000000000011b4ca std::ostream::sentry::sentry()  ???:0
 2 0x000000000011bbac std::__ostream_insert<char, std::char_traits<char> >()  ???:0
 3 0x000000000011c05b std::operator<< <std::char_traits<char> >()  ???:0
 4 0x000000000221226c ESMCI::VMId::log()  /jobfs/111971718.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.5.0-ki52vmthtnxdyjeghtyphmrk5ju3yxuj/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:500
 5 0x00000000022110d7 ESMCI::VM::logGarbageInfo()  /jobfs/111971718.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.5.0-ki52vmthtnxdyjeghtyphmrk5ju3yxuj/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:2405
 6 0x0000000000b8c06b c_esmc_vmloggarbageinfo_()  /jobfs/111971718.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.5.0-ki52vmthtnxdyjeghtyphmrk5ju3yxuj/spack-src/src/Infrastructure/VM/interface/ESMCI_VM_F.C:1994
 7 0x0000000000fc004e esmf_vmmod_mp_esmf_vmloggarbageinfo_()  /jobfs/111971718.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.5.0-ki52vmthtnxdyjeghtyphmrk5ju3yxuj/spack-src/src/Infrastructure/VM/interface/ESMF_VM.F90:5989
 8 0x00000000015e0642 nuopc_base_mp_nuopc_logintro_()  /jobfs/111971718.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.5.0-ki52vmthtnxdyjeghtyphmrk5ju3yxuj/spack-src/src/addon/NUOPC/src/NUOPC_Base.F90:3222
 9 0x0000000004b6eaff nuopc_modelbase_mp_initializeipdvxp07_()  /jobfs/111971718.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.5.0-ki52vmthtnxdyjeghtyphmrk5ju3yxuj/spack-src/src/addon/NUOPC/src/NUOPC_ModelBase.F90:1369
...

@aekiss
Copy link
Contributor

aekiss commented Aug 27, 2024

@minghangli-uni have you looked at the performance impact of the coupler timestep?

I feel like it should be set to the ocean baroclinic timestep (equal to the sea ice dynamic timestep) to get the most accurate ocean-ice coupling and also the benefit of temporal interpolation of the surface stress from DATM.

However, I'm not sure if that would impose a significant performance cost (either the direct cost of the coupler, or the cost of more frequent waiting for synchronisation between MOM and CICE if their loads are unbalanced).

@minghangli-uni
Copy link
Contributor Author

I havent yet checked the impact of the coupler timestep, though it's something I've been meaning to do. For the timebeing, the coupler timestep is set to match the baroclinic ts. I will do a performance test as well as a scientific check by varying the coupler timestep for the 1deg configuration first and then proceed to the 0.25deg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

5 participants