Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internet Archive imports often missing language #10141

Open
cdrini opened this issue Dec 10, 2024 · 16 comments · May be fixed by #10213
Open

Internet Archive imports often missing language #10141

cdrini opened this issue Dec 10, 2024 · 16 comments · May be fixed by #10213
Assignees
Labels
Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Response Issues which require feedback from lead Priority: 2 Important, as time permits. [managed] Type: Bug Something isn't working. [managed]

Comments

@cdrini
Copy link
Collaborator

cdrini commented Dec 10, 2024

Problem

Eg:

Reproducing the bug

  1. Trigger an import of an IA book
  • Expected behavior: Language on IA is copied over to Open Library
  • Actual behavior: Language is just missing

Context

  • Browser (Chrome, Safari, Firefox, etc):
  • OS (Windows, Mac, etc):
  • Logged in (Y/N):
  • Environment (prod, dev, local): prod

Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
@cdrini cdrini added Type: Bug Something isn't working. [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Needs: Lead labels Dec 10, 2024
@NickDevi
Copy link

hello @cdrini I would like to work on this issue. Thank you!

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Dec 12, 2024
@mekarpeles mekarpeles added Priority: 2 Important, as time permits. [managed] Lead: @scottbarnes Issues overseen by Scott (Community Imports) and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Needs: Lead labels Dec 16, 2024
@mekarpeles
Copy link
Member

@NickDevi thank you for offering! I think first we'll have @scottbarnes investigate and provide next steps if necessary

@mekarpeles mekarpeles added Needs: Investigation This issue/PR needs a root-cause analysis to determine a solution. [managed] Needs: Response Issues which require feedback from lead and removed Needs: Response Issues which require feedback from lead labels Dec 16, 2024
@scottbarnes
Copy link
Collaborator

scottbarnes commented Dec 24, 2024

@NickDevi, sorry for the inexcusable delay. If you are still interested in this, I think this is what's going on.

  1. Languages work for NEW imports via the /api/import/ia endpoint (see Direct Import by OCAID).
  2. Languages ARE NOT added to EXISTING items when using that endpoint to supplement an existing record.
  3. This is because 'languages' is not included in fields which are added to an existing record if the field is empty/absent in the existing record. See
    edition_list_fields = [
    'local_id',
    'lccn',
    'lc_classifications',
    'oclc_numbers',
    'source_records',
    ]
  4. HOWEVER, simply adding 'languages' there will cause an import error. Note: we here enter the realm of belief, where I may be mistaken. This seems to be because the language field is in the wrong format in rec at this point. We need it to look like [{"key": "/languages/eng"}], for example, but it seems to look like "languages": ["eng"]. That won't work.
  5. The reason for this seems to be related to why this would work as an initial import (when there is no matching record) but not when matched to an existing record. That is I think because build_query() is not called when re-importing, but is called upon an initial import, and I think that's the only place the language field gets formatted. See
    if k in ('languages', 'translated_from'):
    for language in v:
    if web.ctx.site.get('/languages/' + language.lower()) is None:
    raise InvalidLanguage(language.lower())
    book[k] = [{'key': '/languages/' + language.lower()} for language in v]
    continue

What this probably means is that:

  1. 'languages' needs to be added to add_book/__init__.py as mentioned in (3) above, and;
  2. the language formatting code in build_query() mentioned in (4) above needs to be broken out into an a helper function and called from both build_query() and update_edition_with_rec_data() (found in add_book/__init__.py).

I don't think any of this is necessarily hard in and of itself, but imports are a bit of a quagmire.

If you are interested in this, and I'm more than happy to do this if you think this is will involve too much wandering in the wilderness, you'd want to read:

  1. https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing#api-authentication-for-imports (for learning how to set a cookie when using imports locally) and
  2. https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing#direct-import-by-ocaid (for importing by OCAID/identifier, such as isbn_9781849353946.

And then test this all locally. This would get you started testing your code using curl:

$ curl -c cookies.txt -X POST "http://localhost:8080/account/login" -d "[email protected]&password=admin123"
$ curl -L -v -X POST "http://localhost:8080/api/import/ia" \
     -b ./cookies.txt \
     -d "identifier=isbn_9781849353946&require_marc=false&force_import=true"
...
{"authors": [{"key": "/authors/OL5A", "name": "Leslie Kaplan", "status": "created"}], "success": true, "edition": {"key": "/books/OL11M", "status": "created"}, "work": {"key": "/works/OL1W", "status": "created"}}%

That should import. If you remove the language (possibly using http://localhost:8080/books/OL11M.yml?m=edit, where OL11M is the edition identifier you got when importing), and then attempt to re-import the same record, it should match, but not add English back as a language, which is the problem and the reason this issue exists.

From there, the fun begins editing the Python code.

Again, please let me know if you're still interested. I would say because of the various parts here, this isn't a great first issue.

Edit: fix file names.

@scottbarnes scottbarnes removed Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Investigation This issue/PR needs a root-cause analysis to determine a solution. [managed] Needs: Response Issues which require feedback from lead labels Dec 24, 2024
@qotkdrn
Copy link

qotkdrn commented Dec 24, 2024

Hello, my name is Alex Bae. I was wondering if I could contribute to this issue?

@scottbarnes
Copy link
Collaborator

@NickDevi, is this still of interest to you?

@mekarpeles mekarpeles changed the title Internet Archive import often not setting language Internet Archive imports often missing language Dec 26, 2024
@mekarpeles
Copy link
Member

@qotkdrn why don't you give it a shot if you'd like

@scottbarnes
Copy link
Collaborator

@qotkdrn, just to follow up, if you have not noticed, you should have a Slack invite in your email (possibly in a spam folder).

@qotkdrn
Copy link

qotkdrn commented Dec 26, 2024

I got it!

@qotkdrn
Copy link

qotkdrn commented Dec 26, 2024

I’m currently experiencing an issue with connecting to the Open Library server running on my local machine. When I attempt to make a POST request to http://localhost:8080/account/login using curl, I get the following error:

curl: (7) Failed to connect to localhost port 8080 after 8 ms: Couldn't connect to server

It was working when I ran it earlier

@scottbarnes
Copy link
Collaborator

Hmmm, I can't speak to the error, but ideally once the cookie is set, it need not be set again. Does the process work without that step?

@qotkdrn
Copy link

qotkdrn commented Dec 27, 2024 via email

@scottbarnes
Copy link
Collaborator

scottbarnes commented Dec 27, 2024

That looks correct. /books/OL5M is the edition, so if you go to http://localhost:8080/books/OL5M you should see it there. If you try to import it again at this point, the results should mention success and the result should once again match, as it did in the example you shared, as the edition and work had already been created.

If you go to http://localhost:8080/books/OL5M.yml?m=edit and remove the language, and import it again, what will happen (I imagine) is it will match again, and yet the language will continue to be absent. That is the thing we're trying to solve.

The goal is that if the edition already exists, as OL5M does here, then if there is no language present, and the import record has one (as it does in this case), then the language should be added.

@qotkdrn
Copy link

qotkdrn commented Dec 27, 2024 via email

@scottbarnes
Copy link
Collaborator

Just open up a pull request and I will try to look at it tonight or over the weekend.

Please also include how you tested the pull request/fix, including before and after examples demonstrating it broken, and showing it fixed.

@qotkdrn
Copy link

qotkdrn commented Dec 28, 2024 via email

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Dec 28, 2024
@el4ctr0n
Copy link

@cdrini Please note that page counts are also missing.

Problem

Eg:

Reproducing the bug

  1. Trigger an import of an IA book
  • Expected behavior: Language on IA is copied over to Open Library
  • Actual behavior: Language is just missing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Response Issues which require feedback from lead Priority: 2 Important, as time permits. [managed] Type: Bug Something isn't working. [managed]
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants