-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MOM6-CICE6 0.25deg configurations - optimisation and scaling performance #148
Comments
It takes approximately 19 hours with diagnostic output when running sequentially with 288 cores for a one-year model run. When running concurrently with a total of 1296 cores (864 for ocn, 288 for ice, and 144 for cpl), the runtime is reduced to around 5 hours without diagnostic output or 6 hours with current diagnostic output. Processor layout is not tuned at the current stage, but is adjusted at runtime. For the current config, it is (32, 27). |
ACCESS-OM2-025 stats as a point of comparison:
|
So with diag output
So we're in the ball park. What timestep are you using? |
So the coupler gets dedicated cores? I didn't realise this was how it works. |
Thank you @aekiss for providing the comparison case.
The timestep I am using is the same as that of OM2, where dt = dt_thermo = dt_cpl = dt_ice_thermo = 1350s.
yes, the coupler component has its own dedicated processor set for doing computations such as mapping, merging, diagnostics, and flux calculation. |
Are we using |
This comment is to document the scaling performance of multiple sequential runs, adjusting the core count from 1 to 384 on Gadi. I am still thinking how to present the scaling results of concurrent runs. I will share them later in the next comment. Given that the 025deg configuration requires more memory compared to the 1deg configuration, runs with 48 cores or fewer use the Sapphire Rapids partition, while the remaining runs use the Cascade Lake partition. Profiling is conducted using om3-utils tools, with a presentation format similar to the profiling for the 1deg configuration. Summary of the figures below:
ESMF overall performance
ESMF RunPhase for each component
MOM6 profiling
|
Thanks @minghangli-uni, I agree it looks like init is the culprit for poor scaling at high core count, so longer runs would be more representative (although if we only look at the run phase I guess things will still look much the same). |
Could be interesting to also do a few tests with Sapphire Rapids with >48 cores to assess the performance difference from Cascade Lake. For that matter, at some point it could be useful to also test with Broadwell or Skylake (will be slower but may be cheaper since they have a lower SU charge rate). |
FYI, the scaling plots from our 2020 GMD paper were generated from these files These are from @marshallward and were done on the previous NCI system (raijin) with very old and inefficient configurations, so aren't very relevant anymore. Things have improved a lot since then, as you can see from the plot I showed at Tuesday's meeting - see the end of https://github.com/aekiss/notebooks/blob/master/run_summary.ipynb |
FYI Marshall's performance study of NCI's old raijin system https://github.com/marshallward/optiflop/blob/main/doc/microbench.rst |
More relevant is @penguian's 2020 ACCESS-OM2 scaling study on Gadi. |
Very nice study @minghangli-uni, thanks for preparing this. This gradual decrease in efficiency is similar to what I saw in an earlier study below, although you are seeing a stronger drop rate. The general trend is different from MOM5, which would show high efficiency and then suddenly drop like a rock. (Disregard the NEMO curve, it was based on a rather old codebase.) We should talk more at a future Monday dev call (Tuesday in AEST). |
Thank you for the suggestion. Will do some tests using these two partitiions. |
Hi @marshallward, thank you for sharing your scaling results. As you may find from my plots, each runtime/speedup/parallel efficiency plot begins from 1 core, although it's not strictly a serial run. There's a significant drop in parallel efficiency shortly after the initial scaling from 1 core (or 6 cores). However, when the starting core count is set to 96 cores, the scaling appears to improve considerably. So I am curious about the rationale behind not starting from a very low core count? Could it be that starting from a higher core count helps alleviate overheads and enables a more efficient distribution of the workload? The following 3 figures show the average time (fig 1), speedup (fig 2) and parallel efficiency (fig 3) for 4 major regions
|
Thanks for the updated figures! As for why we started at such high numbers, I could say something about single-core jobs being heavily RAM-bound, rather than more cache-vs-communication bound at the higher cores... but the real reason was that we just couldn't fit those single-core jobs in memory, since the RAM-per-node was way too small. 🤪 If I were to try and explain the difference in performance from from 1 to 96 cores, I would guess it is due to two competing effects.
Generally the trend will go something like this:
So if someone were to ask about the "scalability" of a model: do they mean the speedup relative to N=1? Or are they simply asking about the threshold for communication scaling? That is, how long can we throw cores at the problem until it becomes a waste of time/energy/hardware? I don't think there is a single answer to this question, and we certainly should strive to maximize the benefits of cache-bound (if not cycle-bound peak) performance. As long as one is clear about what they mean "scalability", then I don't think there is any harm in using one or the other as a reference point. By the way: I think these figures are fantastic, and I hope you will continue to include low-N values in these studies. We could only speculate on how the model behaved in this regime, and it's amazing to see the entire trend of the model. (And the reference "efficiency" and "ideal speedup" were always imaginary concepts anyway. Don't let it get in the way of the actual measurements!) |
@marshallward Thank you for the detailed explanations! I couldnt agree more with the competing effects when N is increasing.
From my personal understanding, if a configuration (e.g., 0.25deg) can be fitted within a lower core count job, and yet its performance relative to that lower core count is relatively poor while showing significant improvement relative to a higher core count, it implies that the code indeed gains considerable advantages from parallelization and makes efficient use of computational resources. However, it may not exhibit optimal scalability relative to lower core counts. I agree with the idea you proposed. It is important for us to aim for maximising the benefits of cache-bound performance. |
This comment is to document profiling plots for low-N values, starting from 48 cores (1 node) for Cascade Lake (CL) partition, and from 32 cores (1 node) for the Sky Lake (SL) partition, which operates with a lower SU charge rate. It can be found that for both partitions, the overall parallel efficiency for MOM6 remains consistently above 80% up to 2304 cores. Anticipated better performance is expected from the Cascade Lake partition.
|
The existing compiler options used for compiling our code are not compatible with the Broadwell partition, resulting in illegall instruction errors.
|
The 025-degree configuration currently supports a maximum of 2736 CPU cores (56 cores for the Cascade Lake partition). Exceeding this limit results in a memory issue. In contrast, the OM2 report indicates that CPU cores can extend beyond 8000.
|
@minghangli-uni have you looked at the performance impact of the coupler timestep? I feel like it should be set to the ocean baroclinic timestep (equal to the sea ice dynamic timestep) to get the most accurate ocean-ice coupling and also the benefit of temporal interpolation of the surface stress from DATM. However, I'm not sure if that would impose a significant performance cost (either the direct cost of the coupler, or the cost of more frequent waiting for synchronisation between MOM and CICE if their loads are unbalanced). |
I havent yet checked the impact of the coupler timestep, though it's something I've been meaning to do. For the timebeing, the coupler timestep is set to match the baroclinic ts. I will do a performance test as well as a scientific check by varying the coupler timestep for the 1deg configuration first and then proceed to the 0.25deg. |
This issue is to monitor the progress of the profiling and optimization efforts aimed at efficiently running the 0.25deg configuration.
The text was updated successfully, but these errors were encountered: