Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling hyper-threading #2965

Conversation

AntonMFernando-NOAA
Copy link
Contributor

@AntonMFernando-NOAA AntonMFernando-NOAA commented Sep 30, 2024

Description

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO
  • Does this change require an update to any of the following submodules? NO (If YES, please add a link to any PRs that are pending.)

How has this been tested?

  • Will be tested on HERA, HERCULES, and ORION

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have documented my code, including function, input, and output descriptions
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added
  • I have made corresponding changes to the system documentation if necessary

@AntonMFernando-NOAA AntonMFernando-NOAA changed the title Feature/hyper threading Disabling hyper-threading Sep 30, 2024
@AntonMFernando-NOAA
Copy link
Contributor Author

AntonMFernando-NOAA commented Oct 7, 2024

@DavidHuber-NOAA GEFS test results. Some of the tests are still running. Will update once the tests are completed.

ORION ORION HERCULES HERCULES HERA HERA
h/t off h/t on h/t off h/t on h/t off h/t on
stage_ic 55 54 42 36 26 29
wave_init 45 44 29 30 31 38
prep_emiss 46 46 36 39 22 22
fcst_m1 1258 1254 934 936 1182 1186
fcst_m2 3012 3015 2274 2266 2873 2879
fcst_m3 3024 3007 2260 2273 2840 2848
atmos_prd_mem 139 146 95 93 110 110
atmos_enstat 34 34 20 21 39 27
ocn_prod_mem 62 39 27 48 33 29
ice_prod_mem 54 47 18 45 22 20
wave_post_grid 427 394 247 248 440 359
wave_post_pnt 4411 4364 7078 4063 4540 4445
cleanup 45 42 17 18 21 21

@AntonMFernando-NOAA
Copy link
Contributor Author

AntonMFernando-NOAA commented Oct 7, 2024

GFS test results

ORION ORION HERCULES HERCULES HERA HERA
h/t off h/t on h/t off h/t on h/t off h/t on
gfsstage_ic 58 48 45 45 22 31
gfsfcst 2903 2914 2185 2184 2750 2786
gfsatmos_prod 161 181 113 110 125 129
gfstracker - - - - 68 72
gfsgenesis - - - - 460 449
gfsmetp - - - - 84 87
gfsarch 96 96 64 74 158 121
gfscleanup 26 30 27 27

@DavidHuber-NOAA
Copy link
Contributor

@DavidHuber-NOAA GEFS test results. Some of the tests are still running. Will update once the tests are completed.

ORION ORION HERCULES HERCULES HERA HERA
h/t off h/t on h/t off h/t on h/t off h/t on
...
wave_post_pnt 4411 4364 7078 4063 4540
cleanup 45 42 17 18 21

@AntonMFernando-NOAA Would you mind running the wave_post_pnt job again with hyper-threading disabled on Hercules just to verify that runtime?

@AntonMFernando-NOAA
Copy link
Contributor Author

AntonMFernando-NOAA commented Oct 8, 2024

Did a rerun with h/t disabled on HERCULES and wave_post_pnt took 3197s.

@AntonMFernando-NOAA
Copy link
Contributor Author

AntonMFernando-NOAA commented Oct 9, 2024

@DavidHuber-NOAA cyc test results

HERCULES HERCULES
h/t on h/t off
gdasstage_ic 44 44
gdasfcst 323 322
gdasatmos_prod 66 66
enkfgdasechgres 42 27
enkfgdasstage_ic 19 15
enkfgdasfcst 258 243
enkfgdasepmn 36 37
gdasprep 137 136
gdasanal 921 917
gdassfcanl 37 39
gdasanalcalc 76 88
gdasanaldiag 118 121
gdasatmanlupp 51 51
gdasatmanlprod 68 64
gdasfcst 407 397
gdasatmos_prod 67 69
gdasfit2obs 17 17
gdasverfozn 53 43
gdasverfrad 665 576
gdasvminmon 22 20
gdasarch 45 42
gdascleanup 176 26
enkfgdaseobs 426 416
enkfgdaseupd 147 153
enkfgdasechgres 39 33
enkfgdasediag 137 135
enkfgdasecmn 47 37
enkfgdasesfc 115 118
enkfgdasfcst 330 309
enkfgdasepmn 31 30
enkfgdaseamn 15 16
enkfgdascleanup 158 26
gfsprep 132 127
gfsanal 561 540
gfssfcanl 40 40
gfsanalcalc 87 93
gfsatmanlupp 40 39
gfsatmanlprod 110 111
gfsfcst 3134 3115
gfsatmos_prod 111 110
gfsvminmon 112 20
gfsmetp 335 101
gfsarch 56 73
gfscleanup 23 27
gdas_prep 115 112
gdas_anal 1069 1060
gdas_sfcanl 40 42
gdas_analcalc 75 96
gdas_analdiag 117 121
gdas_atmanlupp 62 46
gdas_atmanlprod 66 66
gdas_fcst 401 395
gdas_atmos_prod 67 66
gdas_fit2obs 178 18
gdas_verfozn 297 40
gdas_verfrad 1333 671
gdas_vminmon 132 17
gdas_arch 19 19
gdas_cleanup 22 22
enkfgdas_eobs 422 413
enkfgdas_eupd 140 142
enkfgdas_echgres 30 35
enkfgdas_ediag 133 130
enkfgdas_ecmn 37 47
enkfgdas_esfc 115 117
enkfgdas_fcst 313 305
enkfgdas_epmn 33 31
enkfgdas_eamn 16 16
enkfgdas_cleanup 21 23

@AntonMFernando-NOAA
Copy link
Contributor Author

@DavidHuber-NOAA The tests are complete. Any comments?

@DavidHuber-NOAA
Copy link
Contributor

DavidHuber-NOAA commented Oct 11, 2024

It's interesting that the cleanup jobs sped up without hyperthreading as these do not issue srun commands, so there shouldn't be any difference. The other jobs that show some differences are also interesting (fit2obs, verfozn, verfrad, and metp). I think MET may make use of hyperthreading, but I will take a look at those logs. I doubt the verf or fit2obs jobs do, though.

Can you do a file comparison on the GRIB2 products as well to ensure they are all identical? Just a cmp should be fine. They should be in the ARCDIR for both the hyperthreaded and non-hyperthreaded cases.

@AntonMFernando-NOAA AntonMFernando-NOAA marked this pull request as ready for review October 11, 2024 20:15
@AntonMFernando-NOAA
Copy link
Contributor Author

AntonMFernando-NOAA commented Oct 11, 2024

@WalterKolczynski-NOAA @DavidHuber-NOAA Checked the GRIB2 files with and without h/t in ARCDIRs and made sure that the files are identical. I think this is now ready to review.

Copy link
Contributor

@DavidHuber-NOAA DavidHuber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good on Hercules. Disabling hyperthreading may have a speed boost for a few jobs, which is interesting. The METplus logs mention using up to 80 tasks/node with 8 threads for both the hyperthreaded and non-hyperthreaded case. This is both interesting and a little disturbing, but does not seem to tie into hyperthreading. I did not look into the other jobs that showed significant differences in runtime (Fit2Obs, verfozn, and verfrad).

@emcbot emcbot added CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed and removed CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed labels Oct 25, 2024
@emcbot
Copy link

emcbot commented Oct 25, 2024

CI Failed on Hera in Build# 3
Built and ran in directory /scratch1/NCEPDEV/global/CI/2965


Experiment C96C48_hybatmaerosnowDA_3addff12 Terminated with 0
FAIL
FAIL tasks failed and 2 dead at Thu Oct 24 16:36:58 UTC 2024
Experiment C96C48_hybatmaerosnowDA_3addff12 Terminated: *FAIL*
Error logs:
/scratch1/NCEPDEV/global/CI/2965/RUNTESTS/COMROOT/C96C48_hybatmaerosnowDA_3addff12/logs/2021122018/gdas_aeroanlinit.log
/scratch1/NCEPDEV/global/CI/2965/RUNTESTS/COMROOT/C96C48_hybatmaerosnowDA_3addff12/logs/2021122018/gfs_aeroanlinit.log
Experiment C96_atm3DVar_3addff12 Terminated with  tasks failed and  dead at Thu Oct 24 19:12:49 UTC 2024
Experiment C96_atm3DVar_3addff12 Terminated: **
Experiment C96C48_hybatmDA_3addff12 Terminated with  tasks failed and  dead at Thu Oct 24 19:12:50 UTC 2024
Experiment C96C48_hybatmDA_3addff12 Terminated: **
Experiment C48mx500_3DVarAOWCDA_3addff12 Completed 2 Cycles: *SUCCESS* at Thu Oct 24 22:21:17 UTC 2024
Experiment C96_S2SWA_gefs_replay_ics_3addff12 Completed 1 Cycles: *SUCCESS* at Fri Oct 25 06:03:58 UTC 2024
Experiment C48_ATM_3addff12 Completed 2 Cycles: *SUCCESS* at Fri Oct 25 06:22:09 UTC 2024
Experiment C48_S2SW_3addff12 Completed 2 Cycles: *SUCCESS* at Fri Oct 25 06:26:00 UTC 2024
Experiment C48_S2SWA_gefs_3addff12 Completed 1 Cycles: *SUCCESS* at Fri Oct 25 07:41:40 UTC 2024
Experiment C96C48_ufs_hybatmDA_3addff12 Completed 3 Cycles: *SUCCESS* at Fri Oct 25 10:12:25 UTC 2024

@DavidHuber-NOAA DavidHuber-NOAA removed the CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed label Oct 28, 2024
@DavidHuber-NOAA
Copy link
Contributor

I believe this PR is ready to be tested again. Launching on Hera.

@DavidHuber-NOAA DavidHuber-NOAA added the CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera label Oct 28, 2024
@emcbot emcbot added CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully and removed CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress labels Oct 28, 2024
@emcbot
Copy link

emcbot commented Oct 29, 2024

CI Passed on Hera in Build# 1
Built and ran in directory /scratch1/NCEPDEV/global/CI/2965


Experiment C48mx500_3DVarAOWCDA_8d166b87 Completed 2 Cycles: *SUCCESS* at Mon Oct 28 22:36:56 UTC 2024
Experiment C48_ATM_8d166b87 Completed 2 Cycles: *SUCCESS* at Mon Oct 28 22:42:45 UTC 2024
Experiment C96_S2SWA_gefs_replay_ics_8d166b87 Completed 1 Cycles: *SUCCESS* at Mon Oct 28 22:43:06 UTC 2024
Experiment C96_atm3DVar_8d166b87 Completed 3 Cycles: *SUCCESS* at Mon Oct 28 23:56:45 UTC 2024
Experiment C48_S2SWA_gefs_8d166b87 Completed 1 Cycles: *SUCCESS* at Tue Oct 29 00:16:24 UTC 2024
Experiment C48_S2SW_8d166b87 Completed 2 Cycles: *SUCCESS* at Tue Oct 29 00:27:10 UTC 2024
Experiment C96C48_hybatmDA_8d166b87 Completed 3 Cycles: *SUCCESS* at Tue Oct 29 00:27:14 UTC 2024
Experiment C96C48_hybatmaerosnowDA_8d166b87 Completed 3 Cycles: *SUCCESS* at Tue Oct 29 00:33:17 UTC 2024
Experiment C96C48_ufs_hybatmDA_8d166b87 Completed 3 Cycles: *SUCCESS* at Tue Oct 29 01:03:56 UTC 2024

@WalterKolczynski-NOAA WalterKolczynski-NOAA added CI-Hercules-Ready **CM use only** PR is ready for CI testing on Hercules CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS labels Oct 29, 2024
@emcbot emcbot added CI-Hercules-Building **Bot use only** CI testing is cloning/building on Hercules CI-Hercules-Running **Bot use only** CI testing on Hercules for this PR is in-progress and removed CI-Hercules-Ready **CM use only** PR is ready for CI testing on Hercules CI-Hercules-Building **Bot use only** CI testing is cloning/building on Hercules labels Oct 29, 2024
@WalterKolczynski-NOAA WalterKolczynski-NOAA added CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress and removed CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS labels Oct 29, 2024
@emcbot emcbot added CI-Hercules-Passed **Bot use only** CI testing on Hercules for this PR has completed successfully and removed CI-Hercules-Running **Bot use only** CI testing on Hercules for this PR is in-progress labels Oct 29, 2024
@emcbot
Copy link

emcbot commented Oct 29, 2024

CI Passed on Hercules in Build# 2
Built and ran in directory /work2/noaa/global/CI/HERCULES/2965


Experiment C48_ATM_8d166b87 Completed 2 Cycles: *SUCCESS* at Tue Oct 29 03:24:15 CDT 2024
Experiment C96_S2SWA_gefs_replay_ics_8d166b87 Completed 1 Cycles: *SUCCESS* at Tue Oct 29 03:48:35 CDT 2024
Experiment C96_atm3DVar_8d166b87 Completed 3 Cycles: *SUCCESS* at Tue Oct 29 05:07:10 CDT 2024
Experiment C96C48_hybatmDA_8d166b87 Completed 3 Cycles: *SUCCESS* at Tue Oct 29 05:07:17 CDT 2024
Experiment C48_S2SW_8d166b87 Completed 2 Cycles: *SUCCESS* at Tue Oct 29 05:49:39 CDT 2024
Experiment C48_S2SWA_gefs_8d166b87 Completed 1 Cycles: *SUCCESS* at Tue Oct 29 05:50:36 CDT 2024

@WalterKolczynski-NOAA WalterKolczynski-NOAA added CI-Wcoss2-Passed **Bot use only** CI testing on WCOSS for this PR has completed successfully and removed CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress labels Oct 30, 2024
@WalterKolczynski-NOAA WalterKolczynski-NOAA merged commit fc2c5ea into NOAA-EMC:develop Oct 30, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully CI-Hercules-Passed **Bot use only** CI testing on Hercules for this PR has completed successfully CI-Wcoss2-Passed **Bot use only** CI testing on WCOSS for this PR has completed successfully
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Explicitly disable hyper-threading on Hera, Hercules, and Orion
5 participants