Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visible Nagios alerts for available infra not meeting Temurin minimum SLA ? #3372

Closed
andrew-m-leonard opened this issue Feb 7, 2024 · 26 comments
Assignees
Labels
Nagios Nagios monitoring issues

Comments

@andrew-m-leonard
Copy link
Contributor

I'd like to see a visible alert if the number of infra nodes does not meet a defined SLA.
eg.The number of "online" ci.role.test&&aarch64 nodes < 6, then BIG Alert on Slack, that's not hidden by 100s of other Nagios alerts?

@steelhead31
Copy link
Contributor

Moved into iteration 4, whilst we scope the requirements.

@sxa
Copy link
Member

sxa commented Feb 7, 2024

Ref high CPU alerts, I think we can probably strip it down a bit. The only time I've seen a real problem was when the ppc64le boxes shot up to a level way in excess of what I would expect. I would tentatively propose that we set it to alert if the CPU usage is over 200% over 5 minutes (or perhaps over 90% for a 24 hours period, but not sure if that's feasible).

@sxa
Copy link
Member

sxa commented Feb 7, 2024

We'll also need to consider how useful the memory warnings are, and specifically whether they are something we need to take any action on - we're getting about one of those every 1-2 weeks:

  • Nov 15: HOST: test-azure-win2012r2-x64-3 SERVICE: Memory Usage STATE: WARNING MESSAGE: Memory usage: total:6186.16 MB - used: 5449.47 MB (88%) - free: 736.69 MB (12%)
  • Nov 23: HOST: test-azure-win2012r2-x64-3 SERVICE: Memory Usage STATE: WARNING MESSAGE: Memory usage: total:6234.91 MB - used: 5459.21 MB (88%) - free: 775.70 MB (12%)
  • Dec 3: HOST: test-azure-win2012r2-x64-3 SERVICE: Memory Usage STATE: WARNING MESSAGE: Memory usage: total:5836.37 MB - used: 4786.06 MB (82%) - free: 1050.31 MB (18%)
  • Dec 5: HOST: test-azure-win2012r2-x64-3 SERVICE: Memory Usage STATE: WARNING MESSAGE: Memory usage: total:6416.96 MB - used: 5264.50 MB (82%) - free: 1152.46 MB (18%)
  • Dec 14 (Critical): HOST: build-ibmcloud-win2012r2-x64-2 SERVICE: Memory Usage STATE: CRITICAL MESSAGE: CRITICAL - Socket timeout
  • Jan 5: (Critical): HOST: build-azure-win2022-x64-2 SERVICE: Memory Usage STATE: CRITICAL MESSAGE: CRITICAL - Socket timeout
  • Feb 7: HOST: test-azure-win11-aarch64-1 SERVICE: Memory Usage STATE: WARNING MESSAGE: Memory usage: total:9695.75 MB - used: 7895.15 MB (81%) - free: 1800.60 MB (19%)
    The critical ones are potentially not directly memory related (Unless it had gone critical because of memory starvation)
    test-azure-win2012r2-x64-3, which had most of the memory issues, no longer exists, so we won't see those again. OK I've convinced myself we can leave the memory warnings as-is :-)

@sxa
Copy link
Member

sxa commented Feb 7, 2024

We've got a lot of these on the macincloud which is concerning and likely needs remediation if it's correct for the macos file system:
HOST: test-macincloud-macos1201-x64-1 SERVICE: Disk Space Root Partition STATE: WARNING MESSAGE: DISK WARNING - free space: / 16525 MiB (13.47% inode=100%)
HOST: test-macincloud-macos1201-x64-2 SERVICE: Disk Space Root Partition STATE: WARNING MESSAGE: DISK WARNING - free space: / 23837 MiB (19.43% inode=100%)

I'll raise an issue ...

@sxa
Copy link
Member

sxa commented Feb 7, 2024

We don't currently have a specific SLA, so I'm not sure we can be considered as not meeting them. So we have two actions here (Noting that we already have some rules in Nagios for checking for percentages of available machines, and total numbers etc. but this needs to be enhanced. Plan of action:

  • Define the checks we want (and document them!)
  • Implement the checks
  • Confirm that everyone who needs the information is happy that #infrastructure-bot is producing useful information

@sxa
Copy link
Member

sxa commented Feb 7, 2024

Current checks and thresholds in Nagios. Note that this is just the raw info without much extra detail to give an idea of the sort of things being checked for input into this discussion.

	check_command			check_local_disk!20%!10%!/
	check_command			check_local_users!10!15
	check_command			check_local_load!5.0,4.0,3.0!10.0,6.0,4.0
	check_command			check_local_swap!20!10
	check_command			check_ssh
	check_command			check_http
        check_command                   check_local_mem!15!5
        check_command			check_local_apt
#        check_command                   check_label!arm&&build!80!66
#        check_command                   check_label!arm&&ci.role.test!30!10
        check_command                   check_label!hw.arch.aarch32&&ci.role.test&&sw.os.linux!30!10
#        check_command                   check_label!ubuntu&&fpm!99!98
        check_command                   check_label!build&&linux&&s390x!75!30
        check_command                   check_label!test&&linux&&s390x!75!30
        check_command                   check_label!build&&windows&&x64!75!30
        check_command                   check_label!test&&windows&&x64!75!30
        check_command                   check_label!build&&openj9&&linux&&s390x!75!30
        check_command                   check_label!macos10.14&&build&&mac&&x64!75!30
        check_command                   check_label!mac&&macos10.14&&xcode10!65!33
        check_command                   check_label!ci.role.test&&sw.os.linux&&hw.arch.riscv!65!33
        check_command                   check_label!ci.role.test&&sw.os.alpine-linux&&hw.arch.aarch64!65!33
        check_command                   check_inventory
        check_command                   check_nagios_sync
        check_command                   check_label!wix!75!30

@steelhead31
Copy link
Contributor

I think the above covers it, but the specific infrastructure checks based on the number of agents online in jenkins, with certain labels are shown below:

All of these count the total number of machines with the labels listed, and then warn and alert at various thresholds...

1) Check: Arm32 Linux Test Machines
   Labels: ci.role.test & sw.os.linux & hw.arch.aarch32
   Warn : 30% Online, Alert : 10% Online

2) Check: s390x Linux Build Machines
   Labels: build & linux & s390x
   Warn : 75% Online, Alert : 30% Online

3) Check: s390x Linux Test Machines
   Labels: test & linux & s390x
   Warn : 75% Online, Alert : 30% Online

4) Check: x64 Windows Build Machines
   Labels: build & windows & x64
   Warn : 75% Online, Alert : 30% Online

5) Check: x64 Windows Test Machines
   Labels: test & windows & x64
   Warn : 75% Online, Alert : 30% Online

6) Check: s390x build openj9 linux Machines
   Labels: build & linux & s390x & openj9
   Warn : 75% Online, Alert : 30% Online

7) Check: x64 Macos10.14 Build Machines
   Labels: macos10.14 & build & mac & x64
   Warn : 75% Online, Alert : 30% Online

8) Check: Macos10.14 with xcode10 Machines
   Labels: macos10.14 & mac & xcode10
   Warn : 65% Online, Alert : 33% Online

9) Check: Risc-V 64 test Machines
   Labels: ci.role.test & sw.os.linux & hw.arch.riscv
   Warn : 65% Online, Alert : 33% Online

10) Check: Alpine ARM64 Test Machines
    Labels: ci.role.test & sw.os.alpine-linux & hw.arch.aarch64
    Warn : 65% Online, Alert : 33% Online

11) Check: Wix Machines
    Labels: wix
    Warn : 75% Online, Alert : 30% Online

@steelhead31
Copy link
Contributor

A complete list of checks across all the infra can be found on this link:
https://nagios.adoptopenjdk.net/nagios/cgi-bin/status.cgi?host=all

A complete list of the current warnings and alerts can be found on this link:
https://nagios.adoptopenjdk.net/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28

@steelhead31
Copy link
Contributor

The base templates used by the automated configuration for nodes can be seen here..

https://github.com/adoptium/infrastructure/tree/master/ansible/playbooks/nagios/roles/Nagios_Config/files/templates

These can then be customised post installation, should that be required.

@andrew-m-leonard
Copy link
Contributor Author

I think the above covers it, but the specific infrastructure checks based on the number of agents online in jenkins, with certain labels are shown below:

All of these count the total number of machines with the labels listed, and then warn and alert at various thresholds...

1) Check: Arm32 Linux Test Machines
   Labels: ci.role.test & sw.os.linux & hw.arch.aarch32
   Warn : 30% Online, Alert : 10% Online

2) Check: s390x Linux Build Machines
   Labels: build & linux & s390x
   Warn : 75% Online, Alert : 30% Online

3) Check: s390x Linux Test Machines
   Labels: test & linux & s390x
   Warn : 75% Online, Alert : 30% Online

4) Check: x64 Windows Build Machines
   Labels: build & windows & x64
   Warn : 75% Online, Alert : 30% Online

5) Check: x64 Windows Test Machines
   Labels: test & windows & x64
   Warn : 75% Online, Alert : 30% Online

6) Check: s390x build openj9 linux Machines
   Labels: build & linux & s390x & openj9
   Warn : 75% Online, Alert : 30% Online

7) Check: x64 Macos10.14 Build Machines
   Labels: macos10.14 & build & mac & x64
   Warn : 75% Online, Alert : 30% Online

8) Check: Macos10.14 with xcode10 Machines
   Labels: macos10.14 & mac & xcode10
   Warn : 65% Online, Alert : 33% Online

9) Check: Risc-V 64 test Machines
   Labels: ci.role.test & sw.os.linux & hw.arch.riscv
   Warn : 65% Online, Alert : 33% Online

10) Check: Alpine ARM64 Test Machines
    Labels: ci.role.test & sw.os.alpine-linux & hw.arch.aarch64
    Warn : 65% Online, Alert : 33% Online

11) Check: Wix Machines
    Labels: wix
    Warn : 75% Online, Alert : 30% Online

Thanks @steelhead31 this looks great. I'll review this, cheers

@andrew-m-leonard
Copy link
Contributor Author

@steelhead31 how often does Nagios "poll" for these thresholds?

@andrew-m-leonard
Copy link
Contributor Author

Initial review @steelhead31

Can be removed as replaced by dynamic Orka:

7) Check: x64 Macos10.14 Build Machines
   Labels: macos10.14 & build & mac & x64
   Warn : 75% Online, Alert : 30% Online

8) Check: Macos10.14 with xcode10 Machines
   Labels: macos10.14 & mac & xcode10
   Warn : 65% Online, Alert : 33% Online

Corrections?

ci.role.test&&hw.arch.x86&&sw.os.windows:  Warn : 75% Online, Alert : 30% Online

ci.role.test&&hw.arch.s390x&&sw.os.linux: Warn : 75% Online, Alert : 30% Online

Need ?

build&&linux&&aarch64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
ci.role.test&&sw.os.linux&&hw.arch.aarch64: Warn : 75% Online, Alert : 30% Online

build&&linux&&x64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
ci.role.test&&hw.arch.x86&&sw.os.linux: Warn : 75% Online, Alert : 30% Online

build&&alpine-linux&&x64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
ci.role.test&&hw.arch.x86&&sw.os.alpine-linux: Warn : 75% Online, Alert : 30% Online

build&&alpine-linux&&aarch64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
ci.role.test&&hw.arch.aarch64&&sw.os.alpine-linux: Warn : 75% Online, Alert : 30% Online

aix720&&build&&aix&&ppc64: Warn : 51% Online, Alert : 30% Online
ci.role.test&&hw.arch.ppc64&&sw.os.aix&&sw.os.aix.7_2: Warn : 75% Online, Alert : 30% Online

build&&linux&&ppc64le&&dockerBuild: Warn : 51% Online, Alert : 30% Online
ci.role.test&&hw.arch.ppc64le&&sw.os.linux: Warn : 75% Online, Alert : 30% Online

build&&windows&&x86-32: Warn : 51% Online, Alert : 30% Online

@sxa
Copy link
Member

sxa commented Feb 8, 2024

@andrew-m-leonard To be clear are your "corrections?" ones purely to change the labels on the existing checks and "need?" is your proposal for things to add?

@steelhead31
Copy link
Contributor

@steelhead31 how often does Nagios "poll" for these thresholds?

Daily as it stands, though its configurable... more often = more noise though :)

@andrew-m-leonard
Copy link
Contributor Author

@andrew-m-leonard To be clear are your "corrections?" ones purely to change the labels on the existing checks and "need?" is your proposal for things to add?

yes please, correct

@steelhead31
Copy link
Contributor

@andrew-m-leonard
Copy link
Contributor Author

One click overview of current issues: https://nagios.adoptopenjdk.net/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=16&hoststatustypes=15

I like this view @steelhead31 that could be my goto place for critical alerts
If we could perhaps:

@sxa
Copy link
Member

sxa commented Feb 9, 2024

* Make CurrentLoad and CPULoad alerts be more likely a Warning, maybe only Critical if say 99% for 6hours ?

To my mind:

CRITICAL - load average: 212.13, 161.58, 167.15 

is actually something we'd want to address so should be flagged since the machine is overworked and hitting issues as a result (we've spotted the issue here and it will be dealt with under #3375)

But something like

CPU Load 99% (5 min average) 

isn't a real concern (in fact it's good that the jobs are making full use of the machines!) so is something we should certainly avoid posting that to slack.

@steelhead31
Copy link
Contributor

Adjust jenkins warning and alert thresholds for jobs & workspace to warn at 95 and critical at 98 due to the large filesystems.

@steelhead31
Copy link
Contributor

@sxa & @andrew-m-leonard , I've rationalised the discussion above into an easy to follow list.. I'll implement when Im back on Monday.. sadly I'm fairly limited in what I can do as regards Unix machine CPU load, but I think my proposed changed to running the check with a slightly different setup, should hopefully be more useful, we can always review and fix the thresholds again..

  1. Machine Load

Windows : Warn If Over 90% 12 Hours(720), Critical If Over 90% 18 Hours (108)
Unix : Warn if 1,5 & 15 minute averages are above = 95%, 90%, 85% across all cores
Critical if 1,5, & 15 minute averages are above = 100%, 99%, 90%

Calculated by dividing the load average by number of CPUs ( 1.0 is fully loaded )

  1. Label Checks ( Runs On Nagios Server Against Jenkins )

Remove These Checks:

2.01) Check: x64 Macos10.14 Build Machines
Labels: macos10.14 & build & mac & x64
Warn : 75% Online, Alert : 30% Online

2.02) Check: Macos10.14 with xcode10 Machines
Labels: macos10.14 & mac & xcode10
Warn : 65% Online, Alert : 33% Online

Correct These Checks

2.03) ci.role.test&&hw.arch.x86&&sw.os.windows: Warn : 75% Online, Alert : 30% Online

2.04) ci.role.test&&hw.arch.s390x&&sw.os.linux: Warn : 75% Online, Alert : 30% Online

Add These Checks

2.05) build&&linux&&aarch64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
2.06) ci.role.test&&sw.os.linux&&hw.arch.aarch64: Warn : 75% Online, Alert : 30% Online
2.07) build&&linux&&x64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
2.08) ci.role.test&&hw.arch.x86&&sw.os.linux: Warn : 75% Online, Alert : 30% Online
2.09) build&&alpine-linux&&x64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
2.10)ci.role.test&&hw.arch.x86&&sw.os.alpine-linux: Warn : 75% Online, Alert : 30% Online
2.11) build&&alpine-linux&&aarch64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
2.12) ci.role.test&&hw.arch.aarch64&&sw.os.alpine-linux: Warn : 75% Online, Alert : 30% Online
2.13) aix720&&build&&aix&&ppc64: Warn : 51% Online, Alert : 30% Online
2.14) ci.role.test&&hw.arch.ppc64&&sw.os.aix&&sw.os.aix.7_2: Warn : 75% Online, Alert : 30% Online
2.15) build&&linux&&ppc64le&&dockerBuild: Warn : 51% Online, Alert : 30% Online
2.16) ci.role.test&&hw.arch.ppc64le&&sw.os.linux: Warn : 75% Online, Alert : 30% Online
2.17) build&&windows&&x86-32: Warn : 51% Online, Alert : 30% Online

  1. Adjust The Jenkins Server Disk Space Thresholds

On the jobs & workspace partitions , warn at 95 and critical at 98 due to the large filesystems involved.

@sxa
Copy link
Member

sxa commented Feb 15, 2024

Unix : Warn if 1,5 & 15 minute averages are above = 95%, 90%, 85% across all cores
Critical if 1,5, & 15 minute averages are above = 100%, 99%, 90%

Calculated by dividing the load average by number of CPUs ( 1.0 is fully loaded )

I honestly still think that'll be too noisy for non-dockerhost systems since it'll trigger on every run of the system test suites (and probably more). If we're using load/cores when we should use higher numbers e.g. warn if over 110% for 15 minutes, critical if over maybe 150 over either 5 or 15 minutes? I'm running a sanity.system on a 2 core system and the 1-minute load went above 2 just when building the material (up to about 2.5). During the actualy test run at step 5 of TestJlmRemoteClassAuth_0 I was seeing numbers like 19.79, 9.87, 4.04 so that would trigger a critical alert regardless of what we use on a 2-core system (although the indivdual CPUs are showing only around 50% usage just now)

[EDIT: For a 4-core system I'm seeing similar figures during the test: 21.14, 13.83, 6.25 so that would still blow a "150% over 15 minutes" check]
[EDIT 2: That job, an bit further down the line, got to a load reading of 21.38, 18.99, 12.09 - I think for the non-dockerhost systems we need to start by disable the posting of the CPU alerts to slack (or disable them completely if there's no way of generating the alert in the UI but not posting to slack]

However for dockerhost systems, the values you suggest are likely a reasonable starting point.

@steelhead31
Copy link
Contributor

Update 1:
Machine Load Parameters Have All Been adjusted
For Windows Machines : Warn If Over 90% 12 Hours(720), Critical If Over 90% 18 Hours (108)
For Unix/Dockerhost Machines: if ENTIRE machine ( ALL CORES ), for 1,5 & 15 minute averages are
warn = 95%, 90%, 85%
critical = 100%, 99%, 90%

@steelhead31
Copy link
Contributor

Update 2:
Checks have been removed/modified and added as per request, now visible in Nagios.

@steelhead31
Copy link
Contributor

Update 3:
Jenkins disk space thresholds have been adjusted.

@steelhead31
Copy link
Contributor

Tweaked solaris & SLES load thresholds to
check_load -r -w 3.00, 2.5, 1.50 -c 4.00, 3.00, 2.00

@steelhead31 steelhead31 moved this from Todo to In Progress in 2024 1Q Adoptium Plan Feb 19, 2024
@steelhead31
Copy link
Contributor

All works completed. New issues should be raised for future improvement.

@github-project-automation github-project-automation bot moved this from In Progress to Done in 2024 1Q Adoptium Plan Feb 19, 2024
@sxa sxa added this to the 2024-02 (February) milestone Feb 20, 2024
@sxa sxa added the Nagios Nagios monitoring issues label Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Nagios Nagios monitoring issues
Projects
Status: Done
Development

No branches or pull requests

3 participants