Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Author Resolution #10156

Open
6 tasks
mekarpeles opened this issue Dec 17, 2024 · 3 comments
Open
6 tasks

Better Author Resolution #10156

mekarpeles opened this issue Dec 17, 2024 · 3 comments
Assignees
Labels
Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Module: Identifier Resolution For resolving records based on identifiers or patterns Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Priority: 2 Important, as time permits. [managed] Theme: Identifiers Issues related to ISBN's or other identifiers in metadata. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@mekarpeles
Copy link
Member

Proposal

This issue is meant to track and manage our various issues related to improving author resolution.

Examples of Past Improvements

We need to compile:

  • Documentation about how our existing author resolution process works
  • A list of existing issues relate to this topic, see: Module: Identifier Resolution For resolving records based on identifiers or patterns
  • A list of common examples of where author resolution is failing
  • A list of agreed upon improvements we could make (and opinions from stakeholders)
  • Basic tests to know whether regressions are prevented
  • Basic metrics / instrumentation to know how often conflations are occurring per month or per some other volume metric.

Justification

Authors report that conflations are extremely hard to undo and cause a lot of work for librarians and harm the catalog.

The quantifiable impact, how often this problem occurs needs to be examined so we can know how we're doing.

Breakdown

Requirements Checklist

  • [ ]

Related files

Stakeholders


Instructions for Contributors

Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

@mekarpeles mekarpeles added Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] Theme: Identifiers Issues related to ISBN's or other identifiers in metadata. [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Priority: 2 Important, as time permits. [managed] Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Module: Identifier Resolution For resolving records based on identifiers or patterns labels Dec 17, 2024
@seabelis
Copy link
Collaborator

Here’s the Stephen King example.
The novelist: https://openlibrary.org/authors/OL19981A/Stephen_King
One of the lesser known Stephen Kings: https://openlibrary.org/authors/OL7829294A/Stephen_King. I created this profile in 2020 when I split off one of the incorrectly attributed works. The importer continues to import new novelist titles to this profile. That caused it to be incorrectly merged in 2022, which I reverted, but new titles belonging to the novelist continue to import to this profile. https://openlibrary.org/recentchanges/2022/07/22/merge-authors/95382993 There are some other Stephen Kings, but this is the profile that keeps getting imported to. The history does not capture items that are moved away from an author on the author's profile.

@hornc
Copy link
Collaborator

hornc commented Dec 18, 2024

An undifferentiated Stephen King without dates is going to match all other undifferentiated Stephen Kings. (If there are more than one undifferentiated name record, the system will pick the first / random one that matches, somewhat non-deterministic)

There needs to be a concrete signal, represented in metadata, for the import system, and anyone else, to know that an author record with only a name has been disambiguated by someone. I see this has a description, which helps human readers, but it is (currently) invisible to the importer. A super basic solution might be to not match undifferentiated author records if the existing record has a description populated? The problem then will be that otherwise undifferentiated authors with descriptions will never be auto-matched by imports. That may be desirable?

Currently this can be achieved by adding the differentiated authors' dates.

If that is insufficient for some reason, alternatives can be proposed which address the specific usecase (i.e. a correctly disambiguated author for which dates are not available)

A concrete example makes a specific problem clear and points to specific requirements soooo much better than "Make stuff better"

Thank you for the example @seabelis

@seabelis
Copy link
Collaborator

@hornc In this case and in many others, there are no known dates to add. I can add a question mark in the date fields in such cases if that will help.

I expect this will also be an issue, https://openlibrary.org/authors/OL14817364A/Ray_Bradbury. This is a common print-on-demand author problem. Famous author names are used precisely so that these works are included among the famous authors' works for visibility reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Module: Identifier Resolution For resolving records based on identifiers or patterns Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Priority: 2 Important, as time permits. [managed] Theme: Identifiers Issues related to ISBN's or other identifiers in metadata. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
Development

No branches or pull requests

4 participants