Skip to content

Commit

Permalink
infra: add watchdog for batcher (#1311)
Browse files Browse the repository at this point in the history
  • Loading branch information
JuArce authored Nov 4, 2024
1 parent 7aa630c commit 9100ad0
Show file tree
Hide file tree
Showing 3 changed files with 82 additions and 0 deletions.
6 changes: 6 additions & 0 deletions infra/watchdog/batcher/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
PROMETHEUS_URL=<ip>:<port>
SYSTEMD_SERVICE=batcher
PROMETHEUS_COUNTER=sent_batches
PROMETHEUS_BOT=batcher
PROMETHEUS_INTERVAL=20m
SLACK_WEBHOOK_URL=<>
42 changes: 42 additions & 0 deletions infra/watchdog/batcher/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Batcher Watchdog

The Batcher Watchdog checks a prometheus metric and restart the Batcher as needed.

The metric is the quantity of batches sent in the last N minutes, defined in the PROMETHEUS_INTERVAL variable. Lets call this metric `sent_batches`.

Since we are sending proofs constantly, the ideal behaviour is the creation of a task every 3 Ethereum blocks (~36 secs). So, if the `sent_batches` metrics is 0 it means there is a problem in the Batcher, for example a transaction is stuck in Ethereum and the Batcher is locked waiting for the transaction. If this happens, the Watchdog restarts the Batcher.

## Configuration

You need to create a .env file with the following variables

```
PROMETHEUS_URL=<ip>:<port>
SYSTEMD_SERVICE=batcher
PROMETHEUS_COUNTER=sent_batches
PROMETHEUS_BOT=batcher
PROMETHEUS_INTERVAL=20m
SLACK_WEBHOOK_URL=<>
```

There is a `.env.example` file in this directory.

## Run with Crontab

Open the Crontab configuration with `crontab -e` and add the following line:

```
*/10 * * * * /path/to/watchdog/batcher_watchdog.sh /path/to/config/.env >> /path/to/logs/folder/batcher_watchdog.log 2>&1
```

The cron interval has to be the half of PROMETHEUS_INTERVAL (PROMETHEUS_INTERVAL/2).

You can check logs in the specified file, for example:

```
Tue Oct 15 08:00:01 UTC 2024: tasks created in the last 20m: "25"
Tue Oct 15 08:20:01 UTC 2024: tasks created in the last 20m: "2"
Tue Oct 15 08:40:01 UTC 2024: tasks created in the last 20m: "0"
Tue Oct 15 08:40:01 UTC 2024: restarting batcher
Tue Oct 15 08:40:01 UTC 2024: batcher restarted
```
34 changes: 34 additions & 0 deletions infra/watchdog/batcher/batcher_watchdog.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/bin/bash

# Load env file from first parameter
# Env variables:
# - PROMETHEUS_URL
# - SYSTEMD_SERVICE
# - PROMETHEUS_COUNTER
# - PROMETHEUS_BOT
# - PROMETHEUS_INTERVAL
# - SLACK_WEBHOOK_URL
source $1

# Function to send slack message
# @param message
function send_slack_message() {
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"$1\"}" \
$SLACK_WEBHOOK_URL
}

# Get rate from prometheus
rate=$(curl -gs 'http://'$PROMETHEUS_URL'/api/v1/query?query=floor(increase('$PROMETHEUS_COUNTER'{bot="'$PROMETHEUS_BOT'"}['$PROMETHEUS_INTERVAL']))' | jq '.data.result[0].value[1]')

echo "$(date): tasks created in the last $PROMETHEUS_INTERVAL: $rate"

# Check if rate is 0
if [ "$rate" = \"0\" ]; then
# Restart systemd service
echo "$(date): restarting $SYSTEMD_SERVICE"
sudo systemctl restart $SYSTEMD_SERVICE
message="$(date): $SYSTEMD_SERVICE restarted by watchdog"
echo $message
send_slack_message "$message"
fi

0 comments on commit 9100ad0

Please sign in to comment.