"Ternary check mode" for more lightweight checks #1

mholt · 2016-05-31T20:28:56Z

Just jotting my thoughts down into an issue for discussion...

If you do 1 check every 10 minutes and your status page shows the last 24 hours of checks, the browser downloads 144 check files to render the status page. This isn't too bad, but if you distribute your checks across multiple instances, you multiply the number of check files by the number of instances you distribute your checks to. And if you want finer granularity in the reporting, you have to produce check files more frequently.

One way to alleviate this volume is to introduce an alternate mode of producing checks: a "ternary" or "discrete" mode (for lack of better words) that only reports healthy, degraded, or down. A healthy status is assumed unless a file exists to report degraded or down. Assuming an endpoint is usually healthy, this would drastically reduce the number of check files produced. Checks could be run every minute on multiple instances, if desired, and if the endpoint is reliably up, no check files would have to be downloaded.

You do lose the RTT (response time) value, so the graphs will report "Up", "Degraded", or "Down" instead of a number. But if the service is only down for 5 minutes, you'd only have to download ~5 check files, so the status pages load much faster and you lose less storage.

@sqs also had the terrible, wonderful, no good, really great idea of encoding the results of the checks directly into the filenames on S3. 😄 That would allow us to download most of the results in just one or a few requests for file listings...

Anyway, it's too early to tell yet how people will be using this and if this mode will be in demand. This change would definitely be a paradigm shift so lots of code changes would be required, I think, unless there's a clever way for the checkup workers and the status page to mutually agree what the mode is from the results of the checks. (Would rather make the mode implicit than requiring explicit configuration. Going for the "just works" ideal.)

The text was updated successfully, but these errors were encountered:

sqs · 2016-05-31T21:57:30Z

The discrete mode is great if all you care about is uptime/downtime (as you say). But you lose the RTT (as you say).

Conversely, if all you care about is response time and not uptime/downtime, then you could selectively fetch only every other filename that you receive in the list operation, or every Nth file, or some other similar scheme to ensure you are pulling from all of the regions equally.

There seem to be quite attractive tradeoffs you can make if you only care about one of (uptime, RTT). Just noting that nice property.

Another way to think about it is to combine these approaches...write a degraded/down file each time the check is down (but not up), and write a file for each check with the RTT. Then the number of files to fetch would be proportional to the number of down checks and the granularity of RTT data you want.

jsjohnst · 2016-08-08T17:03:49Z

Another approach is writing a single file per instance per day. When a check runs, you pull down the latest version, update with the new value, then reupload the file overwriting the old one. If you are worried about race conditions (not an issue if you are using the "built in" cron behavior), you could write a new file each time and only pull down the latest file in the client side. This would reduce the number of files pulled down to N (with N being the number of instances running the checker) which would be an immense reduction.

cuu508 · 2016-08-09T16:53:20Z

Per-day files would work great if the timeframe is day (or a low number of days). Let's say you want a timeframe of the last year–there would be lots of heavy lifting.

How about a command, say, checkup collate which would prepare hourly summaries for past day, daily summaries for past month, monthly summaries for past year? If the summaries exist, statuspage can use them, if they don't exist, it can still load individual check files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Ternary check mode" for more lightweight checks #1

"Ternary check mode" for more lightweight checks #1

mholt commented May 31, 2016

sqs commented May 31, 2016

jsjohnst commented Aug 8, 2016

cuu508 commented Aug 9, 2016

"Ternary check mode" for more lightweight checks #1

"Ternary check mode" for more lightweight checks #1

Comments

mholt commented May 31, 2016

sqs commented May 31, 2016

jsjohnst commented Aug 8, 2016

cuu508 commented Aug 9, 2016