Fix low quality BWB imports (when no ISBN or unlikely dates) #10157

mekarpeles · 2024-12-17T19:03:20Z

Problem

Subtask of #9440

BWB promise records without ISBN seem to have unreliable data. Reject promise items that

Have no ISBN or ASIN
Have ISBN but dates earlier than ~1966 (unlikely date as it is pre-isbn)

Otherwise, as to not block digitization, clear any other metadata field (such as date, title, etc) before importing.

hornc · 2024-12-17T20:53:53Z

Isn't this the issue #9440, with the flow chart , was meant to address?

mekarpeles · 2024-12-27T23:10:39Z

@hornc yes, both addressing same thing. Can you help us project manage the effort based on the flow chart by providing a Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] of the specific aspects / sub-issues that need to be addressed? Once this hit-list exists, I imagine lots of folks would be open to adding these improvements.

I believe the list of rules looks something like:

If no ISBN, don't import? (potential followup is to check/use ASIN?)
If ISBN but pre-1966, don't import
If ISBN but missing (any?) fields [title, date, author] then try to augment with e.g. amz or MARC, else don't import

I do see these seemingly solved:

Augment non-ISBN ASIN BWB records with BookWorm data #8903
Augment non-ISBN ASIN BWB records with BookWorm data #9030
... though I'm not sure if bookworm / affiliate server is currently operating post new WAF & network rules.

This issue simply deals with the first two cases (if no ISBN don't import and if ISBN but pre-1966 don't import)

scottbarnes · 2024-12-27T23:19:42Z

I was under the impression from what @seabelis said that the rules from #9440 must not be working as I thought they did, though the pre-1966 thing is a new requirement I am happy to add.

But it would be helpful to know which things are broken, because they must be evading the tests and it's not obvious to me which parts aren't working as intended. I think that may be more in the ballpark of @seabelis than @hornc, though, unless you, Charles, have also noticed ways the flowchart has been improperly implemented.

I am more than happy to fix this in any way, but I am unsure what exactly is not working, aside from the prohibition against importing records that have a publish_date of < 1966 and that also have an ISBN.

mekarpeles · 2024-12-27T23:22:36Z

I don't know if/whether affiliate server is fetching things right now (is the service fully operational).

I'm not sure what happens to promise items if amz is down (I'm not sure we're rejecting them, in fact I think promise flow has several exemptions that allows it to proceed even if fields are missing).

scottbarnes · 2024-12-27T23:27:05Z

Ah, I see. That is a mistake, then. It was meant to not import records that are incomplete (by the definition in #9440) full stop. I will try to get that fixed by Monday.

mekarpeles · 2024-12-29T17:18:27Z

This was originally added as a way to ensure promise items were imported (even if data was incomplete) to enable scribes to find the book record.

openlibrary/openlibrary/catalog/add_book/__init__.py

Lines 919 to 934 in 81681fb

    
           def load(rec: dict, account_key=None, from_marc_record: bool = False) -> dict: 
        
               """Given a record, tries to add/match that edition in the system. 
        
               Record is a dictionary containing all the metadata of the edition. 
        
               The following fields are mandatory: 
        
                   * title: str 
        
                   * source_records: list 
        
               :param dict rec: Edition record to add 
        
               :param bool from_marc_record: whether the record is based on a MARC record. 
        
               :rtype: dict 
        
               :return: a dict to be converted into a JSON HTTP response, same as load_data() 
        
               """ 
        
               if not is_promise_item(rec): 
        
                   validate_record(rec)

However, it seems like maybe this is the wrong call.

There are a lot of competing concerns, for instance, we want to:

be able to determine which promise item(s) we've received v. not
avoid duplicates
have correct metadata for items (and not accidentally pull wrong data from 3rd party sources like amz)

These requirements often compromise each other.

Furthermore, a large number of promise items don't have date, so even though I agree with @hornc, I think it's possible promise items would become somewhat worthless to our digitization process if we enforced every requirement in that list (i.e. many records only have title and ISBN)... So we may want a compromise that still passes promise items through different validation or we may need to do more to augment promise records with amazon and google books.

I think for now, the right path forward is to:

enable validation for promise items
add a check before stage_incomplete_records_for_import to remove items that have ISBN but no date, or ISBN but date <1966 (https://github.com/internetarchive/openlibrary/blob/master/scripts/promise_batch_imports.py#L165-L167)

After this, we might consider...

Auditing the google + amazon data augmentation to be smarter about which data to choose (i.e. when should amz / google clobber the promise item, when should the data be thrown out e.g. if the fetched information disagrees, etc)
keeping track (during a batch import) of which items in a batch fail import (and why) and then upload some sort of json file or summary back to the promise item on archive.org so we have something of an import log to reference in the future. 🤷

mekarpeles assigned seabelis and jimchamp Dec 17, 2024

mekarpeles added this to the 2025 (Provisional) milestone Dec 17, 2024

mekarpeles added the Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] label Dec 17, 2024

mekarpeles unassigned seabelis Dec 17, 2024

mekarpeles removed the Lead: @seabelis Issuses overseen by Lisa (Staff: Lead Community Librarian) [managed] label Dec 27, 2024

mekarpeles assigned scottbarnes and unassigned jimchamp Dec 27, 2024

mekarpeles added the Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. label Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix low quality BWB imports (when no ISBN or unlikely dates) #10157

Fix low quality BWB imports (when no ISBN or unlikely dates) #10157

mekarpeles commented Dec 17, 2024 •

edited

Loading

hornc commented Dec 17, 2024

mekarpeles commented Dec 27, 2024 •

edited

Loading

scottbarnes commented Dec 27, 2024 •

edited

Loading

mekarpeles commented Dec 27, 2024

scottbarnes commented Dec 27, 2024 •

edited

Loading

mekarpeles commented Dec 29, 2024

Fix low quality BWB imports (when no ISBN or unlikely dates) #10157

Fix low quality BWB imports (when no ISBN or unlikely dates) #10157

Comments

mekarpeles commented Dec 17, 2024 • edited Loading

Problem

hornc commented Dec 17, 2024

mekarpeles commented Dec 27, 2024 • edited Loading

scottbarnes commented Dec 27, 2024 • edited Loading

mekarpeles commented Dec 27, 2024

scottbarnes commented Dec 27, 2024 • edited Loading

mekarpeles commented Dec 29, 2024

mekarpeles commented Dec 17, 2024 •

edited

Loading

mekarpeles commented Dec 27, 2024 •

edited

Loading

scottbarnes commented Dec 27, 2024 •

edited

Loading

scottbarnes commented Dec 27, 2024 •

edited

Loading