Skip to content
This repository has been archived by the owner on Feb 14, 2023. It is now read-only.

Add utility script to monitor the public dashboard and alert someone if the data seems frozen #77

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

npny
Copy link

@npny npny commented Dec 17, 2020

So, given this is a temporary plug for an unidentified problem, there are two ideas here:

  • Test as far down the chain as possible, in this case, the XHR made by the public dashboard webpage itself
  • Make this script as direct and standalone as possible, hence the inlined/commented-out stuff

As I don't know the details of your environment there are still some finishing touches needed on your part. I have tested each separate feature myself though. Running it is simply node scripts/alertStaleDashboard.mjs.

This script will periodically request https://defi.delphidigital.io/chaosnet/thorchain/lastblock, which is something that should be changing every few seconds, and if it's not changing and we're getting identical responses for five minutes straight, we send an alert. Should the endponit later resume normal operation, we send another notification. For completeness if the script fails to reach the endpoint (for any reason, not necessarily a problem with the endpoint), we also send an alert (this would still be an issue that requires investigation after all)

The reason I'm running three checks, thorchain/lastblock, v1/network, and int/extra, is because I've noticed that the ultimate data source for each is different (thorchain node, midgard api, and cache server respectively). This should help increase the chances of detecting the problem and pinpointing where it happens. I have commented out some extra lines for user convenience - running the same checks against the data sources directly, and running them against your yarn develop server for debugging.
Those two other checks are also constantly changing every few seconds, whether on "Pool Overview" or "Network & Nodes" pages, making this selection a decent proxy for overall dashboard data responsiveness.

There are three ways you can choose to receive this alert:

  • With a local system tray notification popup, if running this script on your own (this is the most direct and standalone way)
  • Over Telegram, if the script is running on a host somewhere (less configuration to do than for email)
  • Over email (you will need to ensure the MTA/SMTP configuration yourself as I don't know the details of where this will be run)

I've provided all three and commented out the import / yarn add statements to use for the one you wish to enable. Feel free to choose one and edit the other ones out.

Let me know if there's anything else to add in!

@npny
Copy link
Author

npny commented Dec 17, 2020

For what it's worth, regarding the dashboard problem itself, I think I've possibly reproduced the issue locally ; after running for a while my saveApiResponses.mjs script crashed and the dashboard stopped updating. Its dying breath looked like this:

[chaosnet]: starting data fetch...
[testnet]: starting data fetch...
[chaosnet]: ended data fetch in 8.504 seconds...
[chaosnet]: starting data fetch...
node:events:353
      throw er; // Unhandled 'error' event
      ^

ReplyError: Ready check failed: ERR max number of clients reached
    at parseError (/home/user/delphi-thorchain/node_modules/redis-parser/lib/parser.js:179:12)
    at parseType (/home/user/delphi-thorchain/node_modules/redis-parser/lib/parser.js:302:14)
Emitted 'error' event on RedisClient instance at:
    at RedisClient.on_info_cmd (/home/user/delphi-thorchain/node_modules/redis/index.js:431:14)
    at /home/user/delphi-thorchain/node_modules/redis/index.js:470:14
    at Object.callbackOrEmit [as callback_or_emit] (/home/user/delphi-thorchain/node_modules/redis/lib/utils.js:89:9)
    at /home/user/delphi-thorchain/node_modules/redis/lib/individualCommands.js:157:15
    at Object.callbackOrEmit [as callback_or_emit] (/home/user/delphi-thorchain/node_modules/redis/lib/utils.js:89:9)
    at RedisClient.return_error (/home/user/delphi-thorchain/node_modules/redis/index.js:641:11)
    at JavascriptRedisParser.returnError (/home/user/delphi-thorchain/node_modules/redis/index.js:141:18)
    at JavascriptRedisParser.execute (/home/user/delphi-thorchain/node_modules/redis-parser/lib/parser.js:542:14)
    at Socket.<anonymous> (/home/user/delphi-thorchain/node_modules/redis/index.js:218:27)
    at Socket.emit (node:events:376:20) {
  command: 'INFO',
  code: 'ERR'
}
error Command failed with exit code 1.

Which looks like too many concurrent connections to the redis server. Possibly unlucky requests timing causing a sudden glut all at the same time, more likely, just previous connections not being closed properly (or fast enough) for some reason and using up limited resources.

I spent a little bit of time looking into it, but given this was out of scope and that I had no easy way to test hypotheses on the live node, I dropped it for now. Just thought I'd write it up here though.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant