Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix low quality BWB imports (when no ISBN or unlikely dates) #10157

Open
2 tasks
mekarpeles opened this issue Dec 17, 2024 · 6 comments
Open
2 tasks

Fix low quality BWB imports (when no ISBN or unlikely dates) #10157

mekarpeles opened this issue Dec 17, 2024 · 6 comments
Assignees
Labels
Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. Priority: 1 Do this week, receiving emails, time sensitive, . [managed] Type: Bug Something isn't working. [managed]

Comments

@mekarpeles
Copy link
Member

mekarpeles commented Dec 17, 2024

Problem

Subtask of #9440

BWB promise records without ISBN seem to have unreliable data. Reject promise items that

  • Have no ISBN or ASIN
  • Have ISBN but dates earlier than ~1966 (unlikely date as it is pre-isbn)

Otherwise, as to not block digitization, clear any other metadata field (such as date, title, etc) before importing.

@mekarpeles mekarpeles added Type: Bug Something isn't working. [managed] Priority: 1 Do this week, receiving emails, time sensitive, . [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Lead: @seabelis Issuses overseen by Lisa (Staff: Lead Community Librarian) [managed] labels Dec 17, 2024
@mekarpeles mekarpeles added this to the 2025 (Provisional) milestone Dec 17, 2024
@mekarpeles mekarpeles added the Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] label Dec 17, 2024
@hornc
Copy link
Collaborator

hornc commented Dec 17, 2024

Isn't this the issue #9440, with the flow chart , was meant to address?

@mekarpeles
Copy link
Member Author

mekarpeles commented Dec 27, 2024

@hornc yes, both addressing same thing. Can you help us project manage the effort based on the flow chart by providing a Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] of the specific aspects / sub-issues that need to be addressed? Once this hit-list exists, I imagine lots of folks would be open to adding these improvements.

I believe the list of rules looks something like:

  • If no ISBN, don't import? (potential followup is to check/use ASIN?)
  • If ISBN but pre-1966, don't import
  • If ISBN but missing (any?) fields [title, date, author] then try to augment with e.g. amz or MARC, else don't import

I do see these seemingly solved:

This issue simply deals with the first two cases (if no ISBN don't import and if ISBN but pre-1966 don't import)

@mekarpeles mekarpeles removed the Lead: @seabelis Issuses overseen by Lisa (Staff: Lead Community Librarian) [managed] label Dec 27, 2024
@mekarpeles mekarpeles assigned scottbarnes and unassigned jimchamp Dec 27, 2024
@scottbarnes
Copy link
Collaborator

scottbarnes commented Dec 27, 2024

I was under the impression from what @seabelis said that the rules from #9440 must not be working as I thought they did, though the pre-1966 thing is a new requirement I am happy to add.

But it would be helpful to know which things are broken, because they must be evading the tests and it's not obvious to me which parts aren't working as intended. I think that may be more in the ballpark of @seabelis than @hornc, though, unless you, Charles, have also noticed ways the flowchart has been improperly implemented.

I am more than happy to fix this in any way, but I am unsure what exactly is not working, aside from the prohibition against importing records that have a publish_date of < 1966 and that also have an ISBN.

@mekarpeles
Copy link
Member Author

I don't know if/whether affiliate server is fetching things right now (is the service fully operational).

I'm not sure what happens to promise items if amz is down (I'm not sure we're rejecting them, in fact I think promise flow has several exemptions that allows it to proceed even if fields are missing).

@mekarpeles mekarpeles added the Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. label Dec 27, 2024
@scottbarnes
Copy link
Collaborator

scottbarnes commented Dec 27, 2024

Ah, I see. That is a mistake, then. It was meant to not import records that are incomplete (by the definition in #9440) full stop. I will try to get that fixed by Monday.

@mekarpeles
Copy link
Member Author

This was originally added as a way to ensure promise items were imported (even if data was incomplete) to enable scribes to find the book record.

def load(rec: dict, account_key=None, from_marc_record: bool = False) -> dict:
"""Given a record, tries to add/match that edition in the system.
Record is a dictionary containing all the metadata of the edition.
The following fields are mandatory:
* title: str
* source_records: list
:param dict rec: Edition record to add
:param bool from_marc_record: whether the record is based on a MARC record.
:rtype: dict
:return: a dict to be converted into a JSON HTTP response, same as load_data()
"""
if not is_promise_item(rec):
validate_record(rec)

However, it seems like maybe this is the wrong call.

There are a lot of competing concerns, for instance, we want to:

  1. be able to determine which promise item(s) we've received v. not
  2. avoid duplicates
  3. have correct metadata for items (and not accidentally pull wrong data from 3rd party sources like amz)

These requirements often compromise each other.

Furthermore, a large number of promise items don't have date, so even though I agree with @hornc, I think it's possible promise items would become somewhat worthless to our digitization process if we enforced every requirement in that list (i.e. many records only have title and ISBN)... So we may want a compromise that still passes promise items through different validation or we may need to do more to augment promise records with amazon and google books.

I think for now, the right path forward is to:

  1. enable validation for promise items
  2. add a check before stage_incomplete_records_for_import to remove items that have ISBN but no date, or ISBN but date <1966 (https://github.com/internetarchive/openlibrary/blob/master/scripts/promise_batch_imports.py#L165-L167)

After this, we might consider...

  • Auditing the google + amazon data augmentation to be smarter about which data to choose (i.e. when should amz / google clobber the promise item, when should the data be thrown out e.g. if the fetched information disagrees, etc)
  • keeping track (during a batch import) of which items in a batch fail import (and why) and then upload some sort of json file or summary back to the promise item on archive.org so we have something of an import log to reference in the future. 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. Priority: 1 Do this week, receiving emails, time sensitive, . [managed] Type: Bug Something isn't working. [managed]
Projects
None yet
Development

No branches or pull requests

5 participants