Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Ternary check mode" for more lightweight checks #1

Open
mholt opened this issue May 31, 2016 · 3 comments
Open

"Ternary check mode" for more lightweight checks #1

mholt opened this issue May 31, 2016 · 3 comments

Comments

@mholt
Copy link
Collaborator

mholt commented May 31, 2016

Just jotting my thoughts down into an issue for discussion...

If you do 1 check every 10 minutes and your status page shows the last 24 hours of checks, the browser downloads 144 check files to render the status page. This isn't too bad, but if you distribute your checks across multiple instances, you multiply the number of check files by the number of instances you distribute your checks to. And if you want finer granularity in the reporting, you have to produce check files more frequently.

One way to alleviate this volume is to introduce an alternate mode of producing checks: a "ternary" or "discrete" mode (for lack of better words) that only reports healthy, degraded, or down. A healthy status is assumed unless a file exists to report degraded or down. Assuming an endpoint is usually healthy, this would drastically reduce the number of check files produced. Checks could be run every minute on multiple instances, if desired, and if the endpoint is reliably up, no check files would have to be downloaded.

You do lose the RTT (response time) value, so the graphs will report "Up", "Degraded", or "Down" instead of a number. But if the service is only down for 5 minutes, you'd only have to download ~5 check files, so the status pages load much faster and you lose less storage.

@sqs also had the terrible, wonderful, no good, really great idea of encoding the results of the checks directly into the filenames on S3. 😄 That would allow us to download most of the results in just one or a few requests for file listings...

Anyway, it's too early to tell yet how people will be using this and if this mode will be in demand. This change would definitely be a paradigm shift so lots of code changes would be required, I think, unless there's a clever way for the checkup workers and the status page to mutually agree what the mode is from the results of the checks. (Would rather make the mode implicit than requiring explicit configuration. Going for the "just works" ideal.)

@sqs
Copy link
Member

sqs commented May 31, 2016

The discrete mode is great if all you care about is uptime/downtime (as you say). But you lose the RTT (as you say).

Conversely, if all you care about is response time and not uptime/downtime, then you could selectively fetch only every other filename that you receive in the list operation, or every Nth file, or some other similar scheme to ensure you are pulling from all of the regions equally.

There seem to be quite attractive tradeoffs you can make if you only care about one of (uptime, RTT). Just noting that nice property.

Another way to think about it is to combine these approaches...write a degraded/down file each time the check is down (but not up), and write a file for each check with the RTT. Then the number of files to fetch would be proportional to the number of down checks and the granularity of RTT data you want.

@jsjohnst
Copy link

jsjohnst commented Aug 8, 2016

Another approach is writing a single file per instance per day. When a check runs, you pull down the latest version, update with the new value, then reupload the file overwriting the old one. If you are worried about race conditions (not an issue if you are using the "built in" cron behavior), you could write a new file each time and only pull down the latest file in the client side. This would reduce the number of files pulled down to N (with N being the number of instances running the checker) which would be an immense reduction.

@cuu508
Copy link

cuu508 commented Aug 9, 2016

Per-day files would work great if the timeframe is day (or a low number of days). Let's say you want a timeframe of the last year–there would be lots of heavy lifting.

How about a command, say, checkup collate which would prepare hourly summaries for past day, daily summaries for past month, monthly summaries for past year? If the summaries exist, statuspage can use them, if they don't exist, it can still load individual check files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants