-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"last_seen" date on publisher gets updated even when rss contents are missing #68
Comments
Which is the particular publisher? |
@danstoner I've sent you a link that has additional info and logging |
We probably want to add a |
We are using the feedparser lib. We can probably leverage bozo detection. https://pythonhosted.org/feedparser/bozo.html#advanced-bozo For example, on content that is a web page instead of an xml RSS feed:
Note to self, also check if we should be using |
See: #72 to find where to correct this. |
@mielliott noticed that the
last_seen
date for a publisher seemed to get updated even when it shouldn't. Did some digging and was able to confirm. The code handles any http return code >400 properly, but one particular publisher's rss URL was returning their custom "error" page with a 200 OK. Thus, it doesn't get handled or bubbled up properly, nor is deeper inspection of the string contents occurring.This field is mostly just for human use but should still be corrected.
line 164 of
update_db_from_rss
in /idigbio_ingestion/update_publisher_recordset.py , which then passes to_do_rss
in the same fileThe text was updated successfully, but these errors were encountered: