-
Notifications
You must be signed in to change notification settings - Fork 365
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
infra: add watchdog for batcher (#1311)
- Loading branch information
Showing
3 changed files
with
82 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
PROMETHEUS_URL=<ip>:<port> | ||
SYSTEMD_SERVICE=batcher | ||
PROMETHEUS_COUNTER=sent_batches | ||
PROMETHEUS_BOT=batcher | ||
PROMETHEUS_INTERVAL=20m | ||
SLACK_WEBHOOK_URL=<> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# Batcher Watchdog | ||
|
||
The Batcher Watchdog checks a prometheus metric and restart the Batcher as needed. | ||
|
||
The metric is the quantity of batches sent in the last N minutes, defined in the PROMETHEUS_INTERVAL variable. Lets call this metric `sent_batches`. | ||
|
||
Since we are sending proofs constantly, the ideal behaviour is the creation of a task every 3 Ethereum blocks (~36 secs). So, if the `sent_batches` metrics is 0 it means there is a problem in the Batcher, for example a transaction is stuck in Ethereum and the Batcher is locked waiting for the transaction. If this happens, the Watchdog restarts the Batcher. | ||
|
||
## Configuration | ||
|
||
You need to create a .env file with the following variables | ||
|
||
``` | ||
PROMETHEUS_URL=<ip>:<port> | ||
SYSTEMD_SERVICE=batcher | ||
PROMETHEUS_COUNTER=sent_batches | ||
PROMETHEUS_BOT=batcher | ||
PROMETHEUS_INTERVAL=20m | ||
SLACK_WEBHOOK_URL=<> | ||
``` | ||
|
||
There is a `.env.example` file in this directory. | ||
|
||
## Run with Crontab | ||
|
||
Open the Crontab configuration with `crontab -e` and add the following line: | ||
|
||
``` | ||
*/10 * * * * /path/to/watchdog/batcher_watchdog.sh /path/to/config/.env >> /path/to/logs/folder/batcher_watchdog.log 2>&1 | ||
``` | ||
|
||
The cron interval has to be the half of PROMETHEUS_INTERVAL (PROMETHEUS_INTERVAL/2). | ||
|
||
You can check logs in the specified file, for example: | ||
|
||
``` | ||
Tue Oct 15 08:00:01 UTC 2024: tasks created in the last 20m: "25" | ||
Tue Oct 15 08:20:01 UTC 2024: tasks created in the last 20m: "2" | ||
Tue Oct 15 08:40:01 UTC 2024: tasks created in the last 20m: "0" | ||
Tue Oct 15 08:40:01 UTC 2024: restarting batcher | ||
Tue Oct 15 08:40:01 UTC 2024: batcher restarted | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
#!/bin/bash | ||
|
||
# Load env file from first parameter | ||
# Env variables: | ||
# - PROMETHEUS_URL | ||
# - SYSTEMD_SERVICE | ||
# - PROMETHEUS_COUNTER | ||
# - PROMETHEUS_BOT | ||
# - PROMETHEUS_INTERVAL | ||
# - SLACK_WEBHOOK_URL | ||
source $1 | ||
|
||
# Function to send slack message | ||
# @param message | ||
function send_slack_message() { | ||
curl -X POST -H 'Content-type: application/json' \ | ||
--data "{\"text\":\"$1\"}" \ | ||
$SLACK_WEBHOOK_URL | ||
} | ||
|
||
# Get rate from prometheus | ||
rate=$(curl -gs 'http://'$PROMETHEUS_URL'/api/v1/query?query=floor(increase('$PROMETHEUS_COUNTER'{bot="'$PROMETHEUS_BOT'"}['$PROMETHEUS_INTERVAL']))' | jq '.data.result[0].value[1]') | ||
|
||
echo "$(date): tasks created in the last $PROMETHEUS_INTERVAL: $rate" | ||
|
||
# Check if rate is 0 | ||
if [ "$rate" = \"0\" ]; then | ||
# Restart systemd service | ||
echo "$(date): restarting $SYSTEMD_SERVICE" | ||
sudo systemctl restart $SYSTEMD_SERVICE | ||
message="$(date): $SYSTEMD_SERVICE restarted by watchdog" | ||
echo $message | ||
send_slack_message "$message" | ||
fi |