O+M 2024-05-20 #4755

hkdctol · 2024-05-17T18:36:08Z

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Check the O&M Rotation Schedule for future planning.

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Check Production State/Actions

Note: Catalog Auto Tasks
You will need to update the chart values manually. Click the Action link in each issue and grab the values from monitor task output and check runtime.

Check auto generated O&M tickets from no status column
Check Harvesting Emails
New Relic Alerts Triaged
Triage DMARC Report from Google

Weekly Checklist

DB-Solr Sync
Audit Log (more info on AU-3 and AU-6 Log auditing)
Tracking Update
- NOTE: This job will consistently timeout, but it is processing results ((more details)[https://github.com/change tracking update from nightly job to weekly #4345])
Check Catalog Solr
Catalog Dupe Check
Check user management requests

Monthly Checklist

Invicti Scan

ad-hoc checklist

audit/review applications on cloud foundry and determine what can be stopped and/or deleted.

Reference

Watch for user email requests
Watch in #datagov-alerts and Vulnerable dependency notifications (daily email reports) for critical alerts.
Monitor and improve Data.gov O&M Dashboard
Update and revise Data.gov O&M Tasks

The text was updated successfully, but these errors were encountered:

Jin-Sun-tts · 2024-05-20T16:07:43Z

catalog-fetch constantly crashing due to solr error.

2024-05-20T09:19:17.10-0400 [APP/PROC/WEB/3] ERR ckan.lib.search.common.SearchIndexError: Solr returned an error: Solr responded with an error (HTTP 400): [Reason: Error:[doc=0e4a1d3078ff068641f1cea20231812bc4d66124] Unknown operation for the an atomic update: fn]

There were total 7 stuck jobs. stopped the DOJ one, see following error in the catalog-fetch log:

2024-05-20T10:14:51.12-0400 [APP/PROC/WEB/0] ERR raise ParentNotHarvestedException('Unable to find parent dataset. Raising error to allow re-run later')
   2024-05-20T10:14:51.12-0400 [APP/PROC/WEB/0] ERR ckanext.datajson.exceptions.ParentNotHarvestedException: Unable to find parent dataset. Raising error to allow re-run later

This above error crash the catalog-fetch process, it is daily job, @hkdctol FYI. for now we just changed the frequency of this harvest source hhs-cas-json to Manual to avoid blocking other harvest jobs.

FuhuXia · 2024-05-20T16:18:46Z

In ticket #4223 we list ParentNotHarvestedException is one of scenario that cause catalog-fetch to crash. It is noticed that this error become a bigger issue for the past a few weeks, blocking system-wide harvest jobs. We need either tell agencies to fix their sources, or we need to fix the code that catalog-fetch does not crash on this error.

Jin-Sun-tts · 2024-05-21T14:18:38Z

This morning the catalog-fetch crashing again, samilar errors reported like yesterday, also found ParentNotHarvestedException error for dot-socrata-data-json harvest source:

   2024-05-21T09:56:07.47-0400 [APP/PROC/WEB/1] ERR raise ParentNotHarvestedException('Unable to find parent dataset. Raising error to allow re-run later')
   2024-05-21T09:56:07.47-0400 [APP/PROC/WEB/1] ERR ckanext.datajson.exceptions.ParentNotHarvestedException: Unable to find parent dataset. Raising error to allow re-run later

This is daily job too, we may need to fix the code that catalog-fetch does not crash on this error.

Jin-Sun-tts · 2024-05-22T14:50:14Z

catalog-fetch constantly crashing due to solr errors. Restart solar leader did not help.

Rebuild individual index has no issue.

Manually stopped 13 running jobs and scaled the catalog-fetch to 1, re-harvest one by one and try to reproduce this issue.

So far the doj-json harvesting reported same solr error, continue testing to narrow down the resource could cause this issu.

Jin-Sun-tts · 2024-05-23T13:35:01Z

Thursday 05/23

https://github.com/GSA/data.gov/

Check Catalog Auto Tasks

Check Harvesting Emails

Catalog:
DB-Solr Sync:
4 packages need to be removed from Solr
0 packages need to be updated/added to Solr
974 packages without harvest_object need to be mannually deleted
Finished 528s

The catalog-fetch service is frequently crashing due to the following issues:

Solr errors occurring during doj-json harvesting.

ERR ckan.lib.search.common.SearchIndexError: Solr returned an error: Solr responded with an error (HTTP 400): [Reason: Error:[doc=0e4a1d3078ff068641f1cea20231812bc4d66124] Unknown operation for the an atomic update: fn]

ParentNotHarvestedException errors encountered during dot-socrata-data-json and hhs-cas-json harvesting.

   2024-05-21T09:56:07.47-0400 [APP/PROC/WEB/1] ERR raise ParentNotHarvestedException('Unable to find parent dataset. Raising error to allow re-run later')
   2024-05-21T09:56:07.47-0400 [APP/PROC/WEB/1] ERR ckanext.datajson.exceptions.ParentNotHarvestedException: Unable to find parent dataset. Raising error to allow re-run later

We just set dot-socrata-data-json and hhs-cas-json frequency to Manual for now to avoid blocking other harvest jobs.

FuhuXia · 2024-05-24T16:59:01Z

Prod catalog-gather ProxyError error was reported for multiple harvest sources. Doing some troubleshooting steps, including restarting app proxy-gsa-datagov-prod-catalog-gather in space prod-egress. Evenually the issue is resolved, most likely thanks to the app restart.

btylerburton added this to data.gov team board May 17, 2024

hkdctol removed this from data.gov team board May 17, 2024

hkdctol added this to data.gov team board May 17, 2024

hkdctol moved this to 📟 Sprint Backlog [7] in data.gov team board May 17, 2024

hkdctol assigned Jin-Sun-tts May 17, 2024

Jin-Sun-tts moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board May 20, 2024

FuhuXia mentioned this issue May 20, 2024

Automated CKAN Job Error Condition GSA/catalog.data.gov#1347

Closed

FuhuXia closed this as completed May 28, 2024

github-project-automation bot moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board May 28, 2024

FuhuXia mentioned this issue Jun 4, 2024

ParentNotHarvestedException error crashes catalog-fetch #4775

Closed

gujral-rei moved this from ✔ Done to 🗄 Closed in data.gov team board Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

O+M 2024-05-20 #4755

O+M 2024-05-20 #4755

hkdctol commented May 17, 2024 •

edited by FuhuXia

Loading

Jin-Sun-tts commented May 20, 2024 •

edited

Loading

FuhuXia commented May 20, 2024

Jin-Sun-tts commented May 21, 2024

Jin-Sun-tts commented May 22, 2024

Jin-Sun-tts commented May 23, 2024 •

edited

Loading

FuhuXia commented May 24, 2024

O+M 2024-05-20 #4755

O+M 2024-05-20 #4755

Comments

hkdctol commented May 17, 2024 • edited by FuhuXia Loading

Acceptance criteria

Daily Checklist

Weekly Checklist

Monthly Checklist

ad-hoc checklist

Reference

Jin-Sun-tts commented May 20, 2024 • edited Loading

FuhuXia commented May 20, 2024

Jin-Sun-tts commented May 21, 2024

Jin-Sun-tts commented May 22, 2024

Jin-Sun-tts commented May 23, 2024 • edited Loading

Thursday 05/23

Check Catalog Auto Tasks

FuhuXia commented May 24, 2024

hkdctol commented May 17, 2024 •

edited by FuhuXia

Loading

Jin-Sun-tts commented May 20, 2024 •

edited

Loading

Jin-Sun-tts commented May 23, 2024 •

edited

Loading