Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Critical Tasks & Crons should log + alert when interrupted #10206

Open
11 tasks
mekarpeles opened this issue Dec 26, 2024 · 1 comment
Open
11 tasks

Critical Tasks & Crons should log + alert when interrupted #10206

mekarpeles opened this issue Dec 26, 2024 · 1 comment
Labels
Affects: Operations Affects the IA DevOps folks Affects: Server Issues with the server (olweb) or its plugins. [managed] Lead: @jimchamp Issues overseen by Jim (Front-end Lead, BookNotes) [managed] Needs: Response Issues which require feedback from lead Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. Priority: 2 Important, as time permits. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@mekarpeles
Copy link
Member

mekarpeles commented Dec 26, 2024

Proposal

Now

Later

  • web logs

  • cover archival

  • solr updater

  • Know that the cron / dump / service started (or was triggered)

  • We want to be alerted that cron / dumps failed or succeeded

  • Historical view of how often failures occur (stats.inc?)

  • Look back at how often restarts attempted / failed

Justification

Problem: What problem does this proposal address & for what audience(s)? -->

Today, when crons or other critical tasks fail, we are often learning about it from patrons rather than workflows.

Impact: What's the predicted impact, how do we measure, & what does success look like?

Some of our biggest sources of value (bots that clean things up for us, like solr restarter or the bot to fix redirects) potentially don't run for weeks on end if broken.

Breakdown

Requirements Checklist

  • We should be able to query over time any errors for a cron or service using statsd e.g. https://graphite.us.archive.org/render?
  • We should either be pinged via slack on abc-plus when a cron job fails (or potentially create an issue for us, but let's start with slack)
@mekarpeles mekarpeles added Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed] Priority: 2 Important, as time permits. [managed] Affects: Server Issues with the server (olweb) or its plugins. [managed] Affects: Operations Affects the IA DevOps folks Lead: @jimchamp Issues overseen by Jim (Front-end Lead, BookNotes) [managed] Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. labels Dec 26, 2024
@mekarpeles mekarpeles added this to the 2025 (Provisional) milestone Dec 26, 2024
@PredictiveManish
Copy link

Is this a bunch of problems together or any big project?

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Operations Affects the IA DevOps folks Affects: Server Issues with the server (olweb) or its plugins. [managed] Lead: @jimchamp Issues overseen by Jim (Front-end Lead, BookNotes) [managed] Needs: Response Issues which require feedback from lead Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. Priority: 2 Important, as time permits. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
Development

No branches or pull requests

2 participants