-
Notifications
You must be signed in to change notification settings - Fork 50
Bot Opportunities
Christian Clauss edited this page Mar 21, 2021
·
4 revisions
Also see Ideas for new bots
- A: Authors
- E: Editions
- W: Works
- 'mojibake' A~(c) style encoding errors.
From multiple sources
- Some from Amazon bulk imports
- Some from external MARC records
- Some are lossy, others can be reconstructed.
- Important point to note: there are multiple styles that look similar, but require different translation to fix.
- Combining diacritics (Github issue )
- Some external MARC records have had subsequent correction, reimportation could catch many such cases
- NFC Normalisation. Of titles, description, place, i.e. all text fields. Ensure all input mechanisms normalise correctly. UI + import paths. TEST.
- Github issue
- Outstanding items (ocaids with spaces) listed here
- Check importbot does not create more.
- Add input validation to UI. PR
- Not many?
- Needs checking. Some url ids appear to resolve correctly. Suspicion that some users are trying to insert affiliate links in OL ids. What is our policy on this? Should we be making more effort to ensure only Internet Archive affiliate links are used?
- _10 and _13
- Other ISBN fields; can they be used for anything?
- repair if possible
- normalise
- remove if bad
- test for reuse on multiple editions
- Identify cases
- LCCN
- OCLC
- LibraryThing ids, (should be Work level?) see Github issue
- Other
- Fix
- Sometimes the ids are valid, but link to a different book. How to resolve?
- repair if possible
- delete if junk data
-
The Open Library convention is natural name order as opposed to "Last, First"
-
Is everyone happy with this standard? How should titles be handled? e.g. Sir , Lord, Lady, Mrs. There seems to be variation in the data for this style of name.
-
Also, how to handle Aliases? I notice the author_role type https://openlibrary.org/type/author_role has an 'as' property. Is this to allow for name variations on a particular book?
- Fix Github issue
- Fix Github issue see plan for 5M Orphans
- Github Issue
- Appears to be a merging task now. These editionless works look to be left over from moving editions (translations are a common example) to another work without converting the original work to a redirect.
- If there was some way to go back to the the original edition move and complete the work redirection? Perhaps the WorkBot log might be helpful??
- Appears to occur mainly with apparently editionless works. How common is it? If rare, it should be fairly simple to address manually in librarian mode given a list of such editions.
- TODO
- Where author is known to IA, from MARC records. Update the records from the original MARC. Many of the missing authors seem to be on books that have editors rather than striaght authors. Is this a cause of the problem? How to correctly represent editors on Open Library works?
- There are 54,000 bad author records associated with Audio CD imports, see Github issue
- Each author is linked to at least one edition, and often a work. Many of these records happen to be editions without works as the bulk were imported in 2008, from Amazon, when these sort of data issues were common.