O+M 16-3-2022 #4224

hkdctol · 2023-03-02T21:43:54Z

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Routine Tasks

These repositories will automatically create failure tickets, so no need to check the Actions

Snyk Scans

For Catalog and Inventory, snyk will create PR's if a dependency needs to be updated.

If either of these actions failed and a PR was created, review and approve/triage it as needed

If either of these actions failed and a PR was not created, an unfixable vulnerability was found, check the Snyk UI Console to triage the vulnerability.

Daily Routine

GH Actions

Check Action tabs for each active repositories, as these will not create issues automatically on failure

Catalog DB-Solr-Sync Action The actions should finish in minutes. Examine the amount of datasets affected if it takes long to finish.
Tracking Update Action The action should take 1 - 2 hours to finish on prod. Examine the amount of datasets affected or Solr index speed if the time is way off.

Miscs

Verify harvesting jobs are running, go through Error reports to catch unusual errors that need attention [Wiki doc]
Watch for user email requests
Triage DMARC Report from Google (daily) sent to [email protected] (only for catalog in prod).
Watch in #datagov-alerts and Vulnerable dependency notifications (daily email reports) for critical alerts.

Weekly Routine

Solr

Verify each Solr Leader/Followers are functional

Use this command to find Solr URLs and credentials in the prod space.

$ cf t -s prod
$ cf env catalog-web | grep solr -C 2 | grep "uri\|solr_follower_individual_urls\|password\|username"

Verify their Start time is in sync with Solr Memory Alert history at path /solr/#/
Verify each follower stays with Solr leader at path /solr/#/ckan/core-overview
Verify each Solr is responsive by running a few queries at /solr/#/ckan/query
Inspect each Solr's logging for abnormal errors at /solr/#/~logging
Examine the Solr Memory Utilization Graph to catch any abnormal incidences.
Log in to tts-jump AWS account with role SSBDev@ssb-production, go to custom SolrAlarm dashboard to see the graph for the past 72 hours. There should not be any Solr instance that has MemoryUtilization go above 90% threshold without getting restarted. Each Solr should not restart too often (more than a few times a week)

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten.

Audit log updated for AU-6 Log auditing (Friday).
Any New Relic alerts have been addressed or GH issues created.
Weekly Duplicate check has been done, and any pertinent issues created.
Weekly Nessus scan has been triaged.
Weekly Snyk scan is complete.
Weekly resources.data.gov link scan
If received, the monthly Netsparker scan has been triaged.
Finishing the shift: Log the number of alerts

The text was updated successfully, but these errors were encountered:

nickumia-reisys · 2023-03-03T23:07:34Z

Day 1 Summary

Inventory:

No events

Dashboard:

Did not check (and won't check for the rest of the week unless something prompts me to)
It loaded in the browser 🤷

Catalog:

All three followers restarted this morning (Follower 1 --> 4:45a, Follower 0 --> 6:19a, Follower 2 --> 7:02a EST)
- All times solr recovered well and did not show any issues.
- Notable anomalies that are accepted as normal:
  - Follower 0 is using index.20221007160458545
  - Follower 1 is using index.20230130083830278
  - Follower 2 is using index.20230130083955899
catalog-web had multiple high error alerts (@FuhuXia said these were all white-noise probably due to cyber traffic and error rates > 5% for a few hours is possible. Since we can't do anything about it, it's a no-op). (7 counts)
Harvesting audit
- 107 error emails
  - 6 Attention items (will update later)
- 9 success emails

Other news:

Started to review Invicti web scan reports
Optimized + Improved automated ckan tasks (catalog)
- Created new way of tracking db-solr-sync and tracking-update events
  - Automated CKAN Job Error Condition catalog.data.gov#845
- Optimize Automated CKAN Command Management catalog.data.gov#846
- Fix auto ckan start time catalog.data.gov#850
- Fix missing comma catalog.data.gov#852
Implemented quiet time for harvesting (with help from @FuhuXia)
No Snyk vulnerabilities

nickumia-reisys · 2023-03-06T23:53:55Z

Day 2 Summary

(Catalog) solr follower 0 restarted at 3:43a without core issues
Automated CKAN Job Error Condition catalog.data.gov#855
logstack-space-drain (staging/prod) app is crashing #4149
Sent email to AWS Support in response to SES automated email
Replied to Michael about Invicti scan
No issues with automated tasks/restarts
No real deployment issues, but concurrency and/or harvesting interruptions
Logged tracking update + db-solr-sync updates
- 📌 DB Solr Sync Auditing Log catalog.data.gov#848
- 📌 Tracking Update Auditing Log catalog.data.gov#847
Update O&M Issue template (Update automated ckan task architecture #4227)
91 Successful harvesting emails
35 Error harvesting emails (all validation errors)
Surprisingly no error rate NR notifications

Delayed work because of large overhead

Debugging JSON loading issues from harvesting on 3/3

nickumia-reisys · 2023-03-08T01:41:18Z

Day 3 Summary

Cleaned up aws (production, development) accounts (only listing things on prod in case I accidentally break something)
- Deleted certificates:
  - f71f8395-cf6e-47d6-a99c-b97aaf4e8561 --> 858a5eca-7f6d-4ca1-875e-3813c8ab6af7
  - c71fcee5-d302-4fd6-95a9-079f380a0c16 --> e3070ef7-b574-4c11-acc7-b8ab30bca246
- Investigated/inventoried what to clean up (will continue tomorrow)
No major events with production
Created a link of known broken harvest sources.
Mostly pairing with others.
28 Successful harvesting emails
11 Error harvesting emails (all validation errors)
@btylerburton + @FuhuXia helped respond to a user request about NOAA.
Triaged two new issues about NAT Gateway cost optimizations
- [Cost Improvements] Consolidate NAT Gateways per AZ GSA-TTS/datagov-brokerpak-solr#85
- [Cost Improvements] Consolidate NAT Gateways per AZ GSA-TTS/datagov-brokerpak-eks#109
Started to investiage google DMARC reports + certain spf failed tests.
- Understand SPF as it relates to DMARC Reports #4228
Catalog DCAT-US API dupe check:
- ~6 counts
Catalog Geospatial API duplicate check:
- 35 counts
Ran de-dupe organization overview job:

nickumia-reisys · 2023-03-08T18:59:40Z

Day 4 Summary

No major events on prod
Harvesting audit
- 27 Successful harvesting emails
- 11 Error harvesting emails (all validation errors)
Investigated and possibly solved
- Understand SPF as it relates to DMARC Reports #4228
Cleaned up resources related to the following IDs (prod)
- High Cost IDs
  - 8796cd0b-0066-4675-97e6-94ce402a9fef
  - b3d6f4b4-da69-4232-8571-eba8d8621c52
  - a11a422d-5650-4259-8ea7-785bd930ebd8
  - 3b2ec575-1e58-49c5-b3c0-d546eba12632
  - 69ec800a-4cdb-46a2-a90d-cc15c732a817
  - 858a5eca-7f6d-4ca1-875e-3813c8ab6af7
- Lower Cost IDs
  - c4c0319a-f8b9-4fd2-a83b-6818ec0139c1
  - 034a2f4d-2698-4657-b17b-3d0f64767454
  - 03ddcebb-09fb-4048-9131-3e8e137fd3bb
  - 1d78d98c-db0f-436b-8a67-815609309ffc
  - 3c2071d9-812c-40f8-97f3-1b1e10b6e786
  - 49337bda-ad9b-467a-a826-21bad09b1f15
  - 62c54669-a850-4495-b539-4633ba959d7d
  - 85274566-b7d4-4748-a68b-96ad156c6c1c
  - 8fe5a471-e566-4be6-8efc-c4611415608c
  - a4721a0d-3c58-4107-aeab-af7c3fe31aa1
  - a921e44b-4ba1-4a46-899a-e4484d87a536
  - aa259c0c-4f8e-47a5-98fc-651d87af9eee
  - d15f531b-8181-4382-933c-2364a545e289
  - dadcfa14-ea32-4ecb-9bd1-270a2c797b71
  - dbcbb946-037f-4331-923e-c4bdae9cff18
  - e22da168-d181-4db6-9da3-4ce936c3c753
  - e67cfaee-fafb-434a-abac-5783f84ca9a3
  - e96ff9c0-58d4-4e3e-bea2-6a6c53a4f2bf
  - ecc5f824-2d9d-4269-82c8-054d335b5198

nickumia-reisys · 2023-03-09T23:12:43Z

Day 5 Summary

(Catalog) solr follower 2 restarted at 3:52a + follower 0 restarted at 4:22a, both without core issues
Harvesting audit
- 29 Successful harvesting emails
- 13 Error harvesting emails (all validation errors)
Implemented manual fix for SPF and created new issue for long-term fix
- Fix SPF Verification #3971
- Update datagov-brokerpak-smtp #4232
Updated O&M Issue template
- Streamline O&M Checks #4230
Triaged Snyk vulnerabilities
- [Snyk] OWSLib Vulnerability #4231
- Fix ipdb inventory-app#564

nickumia-reisys · 2023-03-10T21:45:24Z

Day 6 Summary

(catalog) solr follower 1 restarted at 1:29a + follower 2 restarted at 6:27a, both without core issues
Automated CKAN Job Error Condition catalog.data.gov#863
Harvesting Audit
- 106 Successful harvesting emails
- 110 Error harvesting emails
  - 102 validation errors
  - 7 broken harvest source urls | (updated broken list)
    - 6 known
    - 1 new
  - 1 seemingly broken Transformation to ISO (needs investigation, will look into on Monday)
NR Log Review (need to get access to spreadsheet to update)
Possible DMARC incident
- @FuhuXia is running a test to validate that it's a false positive, but I think it's more likely a real positive.

Other news

Actually got to work on and "finish"
- Proof of Concept - SQS + Lambda datagov-harvester-test-aws#1

nickumia-reisys · 2023-03-13T13:28:29Z

Day 7 Summary

(catalog) solr follower 1 restarted 3/11@4:05a, no core issues
(catalog) solr follower 1 restarted today@3:38p, no core issues
(catalog) solr follower 2 restarted today@7:21p, no core issues, but [email protected] GB on leader + followers (maybe db-solr-sync will be a lot tomorrow?)
I noticed that the storage size of the followers dropped on Thursday to ~5.6G, today it's back up to ~7.84G. Given that DB-Solr-Sync had ~15k records to reindex on Thursday and ~13k to reindex on Friday, it is directly correlated. In terms of how it happened? I'm not sure
Updated DMARC Inspections:
- @FuhuXia was not able to replicate the issue seen on Friday. However, any recipient that forwards emails will show up in the report.
- Generated the following results with this gist.:
  - March 10: Success Rate 1270/1271 = 1.00
  - March 11: Success Rate 1668/1756 = 0.95
  - March 12: Success Rate 546/593 = 0.92
  - March 13: Success Rate 447/485 = 0.92
Harvesting Audit (including weekend)
- 85 Successful harvesting emails
- 56 Error harvesting emails
  - 55 validation errors
  - 1 failed harvest, but good link, NSF (will investigate) (according to @FuhuXia this server has nightly maintenance that gets in the way sometimes, it has since successfully run, so no real issues)
Talked through log review with @FuhuXia, there isn't a good way to find "discrepancies" right now, so we decided to make a ticket that'll help direct the logs from each application to a specialized parsing rule which will make the right type of information more searchable and recognizable
- Implement Grok Rules in logstack application #4234
Fixed a typo in ckan_auto CI
- Prod auto tasks should have more ram 🦆 catalog.data.gov#865
- Refactor 2.5G to 2500M catalog.data.gov#868

nickumia-reisys · 2023-03-15T14:09:18Z

Day 8 Summary

DMARC Audit
- Success Rate 479/479 = 1.00
Harvesting Audit
- 28 Successful harvesting emails
- 13 Error harvesting emails (all validation errors)
Silence owslib temporarily catalog.data.gov#870
Fix Solr Error when editing NASA Harvest Source (prod) #4236

nickumia-reisys · 2023-03-15T15:26:24Z

Day 9 Summary

(catalog) solr follower 2 restarted today@5:57a, no core issues
DMARC Audit
- Success Rate 447/447 = 1.00
Harvesting Audit
- 26 Successful harvesting emails
- 26 Error harvesting emails (all validation errors)
I'm surprised there was nothing else (just paired with the team on other stuff).

nickumia-reisys · 2023-03-17T13:39:53Z

Day 10 Summary

DMARC Audit
- Success Rate 460/460 = 1.00
Harvesting Audit
- 25 Successful harvesting emails
- 10 Error harvesting emails (all validation errors)

Final thoughts on this O&M shift

We need to assign priorities to certain things. I found myself trying to evaluate how important things were which slowed my response time.
We need to continue to find good baselines. There's a decent amount of holes in what normal looks like for our system.
I started with pretty good momentum, but lost it after the first week.
The minor improvements I was able to make should continue to be expanded upon. Maybe a few months from now, O&M won't be so painful..

nickumia-reisys added this to data.gov team board Mar 2, 2023

hkdctol assigned nickumia-reisys and Jin-Sun-tts Mar 2, 2023

hkdctol added the O&M Operations and maintenance tasks for the Data.gov platform label Mar 2, 2023

hkdctol removed this from data.gov team board Mar 2, 2023

hkdctol added this to data.gov team board Mar 2, 2023

hkdctol moved this to 🏗 In Progress [8] in data.gov team board Mar 2, 2023

This was referenced Mar 3, 2023

Deployment Failure GSA/catalog.data.gov#844

Closed

Only update audit issue if on schedule GSA/catalog.data.gov#853

Merged

Update automated ckan task architecture #4227

Merged

This was referenced Mar 9, 2023

Streamline O&M Checks #4230

Merged

Fix ipdb GSA/inventory-app#564

Merged

nickumia-reisys moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Mar 16, 2023

hkdctol closed this as completed Mar 16, 2023

nickumia-reisys moved this from ✔ Done to 🗄 Closed in data.gov team board Oct 9, 2023

nickumia-reisys added the Explore label Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

O+M 16-3-2022 #4224

O+M 16-3-2022 #4224

hkdctol commented Mar 2, 2023 •

edited by nickumia-reisys

Loading

nickumia-reisys commented Mar 3, 2023 •

edited

Loading

nickumia-reisys commented Mar 6, 2023 •

edited

Loading

nickumia-reisys commented Mar 8, 2023 •

edited

Loading

nickumia-reisys commented Mar 8, 2023 •

edited

Loading

nickumia-reisys commented Mar 9, 2023

nickumia-reisys commented Mar 10, 2023 •

edited

Loading

nickumia-reisys commented Mar 13, 2023 •

edited

Loading

nickumia-reisys commented Mar 15, 2023

nickumia-reisys commented Mar 15, 2023 •

edited

Loading

nickumia-reisys commented Mar 17, 2023

O+M 16-3-2022 #4224

O+M 16-3-2022 #4224

Comments

hkdctol commented Mar 2, 2023 • edited by nickumia-reisys Loading

Routine Tasks

Snyk Scans

Daily Routine

GH Actions

Miscs

Weekly Routine

Solr

Acceptance criteria

nickumia-reisys commented Mar 3, 2023 • edited Loading

Day 1 Summary

Inventory:

Dashboard:

Catalog:

Other news:

nickumia-reisys commented Mar 6, 2023 • edited Loading

Day 2 Summary

nickumia-reisys commented Mar 8, 2023 • edited Loading

Day 3 Summary

nickumia-reisys commented Mar 8, 2023 • edited Loading

Day 4 Summary

nickumia-reisys commented Mar 9, 2023

Day 5 Summary

nickumia-reisys commented Mar 10, 2023 • edited Loading

Day 6 Summary

Other news

nickumia-reisys commented Mar 13, 2023 • edited Loading

Day 7 Summary

nickumia-reisys commented Mar 15, 2023

Day 8 Summary

nickumia-reisys commented Mar 15, 2023 • edited Loading

Day 9 Summary

nickumia-reisys commented Mar 17, 2023

Day 10 Summary

Final thoughts on this O&M shift

hkdctol commented Mar 2, 2023 •

edited by nickumia-reisys

Loading

nickumia-reisys commented Mar 3, 2023 •

edited

Loading

nickumia-reisys commented Mar 6, 2023 •

edited

Loading

nickumia-reisys commented Mar 8, 2023 •

edited

Loading

nickumia-reisys commented Mar 8, 2023 •

edited

Loading

nickumia-reisys commented Mar 10, 2023 •

edited

Loading

nickumia-reisys commented Mar 13, 2023 •

edited

Loading

nickumia-reisys commented Mar 15, 2023 •

edited

Loading