Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slack alerts #557

Open
3 tasks
fergusL opened this issue May 26, 2022 · 2 comments
Open
3 tasks

Slack alerts #557

fergusL opened this issue May 26, 2022 · 2 comments

Comments

@fergusL
Copy link
Contributor

fergusL commented May 26, 2022

Once Huntsman-Pocs is running reliably, we will enter a phase were the system will be running for consecutive nights without supervision. At this point we want to have a convenient alert system that will notify us via slack when something has gone wrong and human intervention is required. A barebones system would simply poll to check if tell-tale process is running that would indicate if pocs is running or not and would alert us via slack if it isn't.

Some potential alerts that could be implemented are:

  • implement a basic v0.1 alert system where we just continuously poll for a tell-tale process that indicates that POCS is running (e.g. run ps -fA | grep python and check if the python process is pocs related and raise alert if none is found)
  • Implement similar system for nifi, e.g. by checking if the archive directory total size has exceeded some predefined threshold
  • implement an alert system for dome shutter status (this may be a little trickier(
@fergusL
Copy link
Contributor Author

fergusL commented Jul 1, 2022

Here is what is visible outside the docker container:

(huntsman-pocs) huntsman@huntsman-control:/var/huntsman/huntsman-config/conf_files/pocs$ ps -fA | grep python           
root         706       1  0 Jun16 ?        00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
root         897       1  0 Jun16 ?        00:00:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
huntsman  420266  255289  0 Jun22 pts/4    00:00:00 /usr/bin/python3 /usr/bin/grc tail -F logs/minifi-app.log
huntsman  420276  420266  0 Jun22 pts/4    00:00:47 /usr/bin/python3 /usr/bin/grcat conf.log
huntsman  855335 2016308  1 07:26 pts/1    00:07:41 /opt/conda/bin/python /opt/conda/bin/ipython
huntsman 1138806 1138783  0 11:56 pts/0    00:00:07 python /huntsman/scripts/read_alt_weather.py --storage-dir /huntsman/json_store --source skymapper_weather --read-delay 60
huntsman 1268712 2022666  0 13:59 pts/5    00:00:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox python
huntsman 1527561     913  0 Jun23 ?        00:03:02 /usr/bin/python3 /usr/bin/update-manager --no-update --no-focus-on-map
huntsman 2015226  202409  0 Jun27 pts/2    00:15:58 /home/huntsman/miniforge3/envs/huntsman-pocs/bin/python /home/huntsman/miniforge3/envs/huntsman-pocs/bin/docker-compose up
huntsman 2015323 2015300  0 Jun27 pts/0    00:00:16 /sbin/docker-init -- /usr/bin/env bash -ic python /app/scripts/read-aag.py --config-file /data/config.yaml --storage-dir /json_store --store-result
huntsman 2015436 2015323  0 Jun27 pts/0    00:08:09 python /app/scripts/read-aag.py --config-file /data/config.yaml --storage-dir /json_store --store-result
huntsman 2015502 2015387  0 Jun27 pts/0    00:11:50 python /huntsman/scripts/archive-images.py
huntsman 2015574 2015441  0 Jun27 pts/0    00:22:32 /opt/conda/bin/python /opt/conda/bin/panoptes-config-server --verbose run --no-save-local --no-load-local --config-file /huntsman/conf_files/huntsman.yaml
huntsman 2015637 2015545  0 Jun27 pts/0    00:00:03 /opt/conda/bin/python3.9 /opt/conda/bin/huntsman-pyro nameserver --auto-clean 90
huntsman 2015887 2015574  0 Jun27 pts/0    00:44:30 /opt/conda/bin/python /opt/conda/bin/panoptes-config-server --verbose run --no-save-local --no-load-local --config-file /huntsman/conf_files/huntsman.yaml
huntsman 2016021 2015637  0 Jun27 pts/0    00:01:41 /opt/conda/bin/python3.9 /opt/conda/bin/huntsman-pyro nameserver --auto-clean 90
huntsman 3227738  205990  0 Jun29 pts/3    00:02:39 python run_dash.py
huntsman 3859486 3859464  0 Jun29 pts/0    00:00:52 python /huntsman/scripts/read_alt_weather.py --storage-dir /huntsman/json_store --source aat_weather --read-delay 60

And from within the pocs-control docker container:

(huntsman-pocs) huntsman@huntsman-control:/var/huntsman/huntsman-config/conf_files/pocs$ docker exec -it pocs-control /bin/bash

(base) pocs-user@pocs-control:/huntsman$ ps -fA | grep python
pocs-us+    9897      41  1 07:26 pts/1    00:07:41 /opt/conda/bin/python /opt/conda/bin/ipython
pocs-us+   10507   10488  0 14:00 pts/2    00:00:00 grep --color=auto python

So unfortunately, this doesn't allow us to determine if pocs is still running as we can only tell if an ipython session is open (if pocs fails within the session it doesnt end). If we run pocs via a script and a regular python shell then we can use an alert system that monitors the output of ps -fA | grep python | "specific process" to determine if an alert needs to be raised. This would need to be run within the pocs-control container. We could have another alert service setup outside of docker that monitors is the containers are running.

@fergusL
Copy link
Contributor Author

fergusL commented Aug 10, 2022

Better option would be to just parse the logs for critical errors or note if there hasn't been a log entry for a while and ping slack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant