-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
O+M 16-3-2022 #4224
Comments
Day 1 SummaryInventory:
Dashboard:
Catalog:
Other news:
|
This was referenced Mar 3, 2023
Day 2 Summary
Delayed work because of large overhead
|
Day 3 Summary
|
Day 4 Summary
|
This was referenced Mar 9, 2023
Merged
Day 5 Summary
|
Day 6 Summary
Other news
|
Day 7 Summary
|
This was referenced Mar 13, 2023
Day 8 Summary
|
Day 9 Summary
|
Day 10 Summary
Final thoughts on this O&M shift
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.
Routine Tasks
These repositories will automatically create failure tickets, so no need to check the Actions
Snyk Scans
For Catalog and Inventory, snyk will create PR's if a dependency needs to be updated.
If either of these actions failed and a PR was created, review and approve/triage it as needed
If either of these actions failed and a PR was not created, an unfixable vulnerability was found, check the Snyk UI Console to triage the vulnerability.
Daily Routine
GH Actions
Check Action tabs for each active repositories, as these will not create issues automatically on failure
Miscs
Weekly Routine
Solr
Use this command to find Solr URLs and credentials in the
prod
space.Verify their Start time is in sync with Solr Memory Alert history at path
/solr/#/
Verify each follower stays with Solr leader at path
/solr/#/ckan/core-overview
Verify each Solr is responsive by running a few queries at
/solr/#/ckan/query
Inspect each Solr's logging for abnormal errors at
/solr/#/~logging
Examine the Solr Memory Utilization Graph to catch any abnormal incidences.
Log in to
tts-jump
AWS account with roleSSBDev@ssb-production
, go to custom SolrAlarm dashboard to see the graph for the past 72 hours. There should not be any Solr instance that has MemoryUtilization go above 90% threshold without getting restarted. Each Solr should not restart too often (more than a few times a week)Acceptance criteria
You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten.
The text was updated successfully, but these errors were encountered: