Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daily updates #24

Closed
nichtich opened this issue Aug 25, 2022 · 5 comments
Closed

Daily updates #24

nichtich opened this issue Aug 25, 2022 · 5 comments

Comments

@nichtich
Copy link
Member

nichtich commented Aug 25, 2022

Related to #17 there should also be an update script that can handle partial updates. The update could be a .tsv or .tsv.gz file as well but it may include rows with empty vocabulary (just the PPN) to indicate removal of a record:

awk '{print $1}' update.tsv | uniq > ppns # filter out affected records
# TODO: remove rows with PPN in file ppns
awk -F'\t' '$2{print}' update.tsv > import.tsv # filter out rows with PPN only (records without subject indexing)
# TODO: import import.tsv into database without purging database

Alternatively keep a full dump as file and apply update to this file to get an updated full dump (may even be faster, depending on size of updates).

Use case: There are a daily jobs at K10plus CBS database to pass updated records to LBS and to K10plus central Solr index.

This was referenced Aug 25, 2022
@nichtich
Copy link
Member Author

TSV files are always grouped by PPN. The set of rows for each PPN is either as known, e.g:

12345   rvk      XY 333
12345   bk       33.33

resultung in rows [{voc: "rvk", notation: "XY 333"}, {voc: "bk", "notation": "33.33"}] or it's just one row with empty voc and notation to only delete the record (rows = []):

12345

See method updateRecord in SQLite Backend (dev branch) to be passed this parsed TSV data.

@stefandesu
Copy link
Member

So the next step would be to add an update script that calls methods in the SQLite backend, and that also allows both partial and full updates? Something like:

# partial update by default
./bin/import update.tsv
# full update with flag
./bin/import --full subjects.tsv

Full updates would clear the whole table instead of deleting records for single PPNs, so we would likely need an additional method in the backend.

Also needs a --modified flag for #25 and update the modified metadata in the database.

stefandesu added a commit that referenced this issue Sep 2, 2022
Currently only supports full import via TSV file. No documentation yet.
@stefandesu
Copy link
Member

@nichtich I feel like partial imports are not yet 100% clear. My suggestion for the TSV format for partial import would be this:

12345

= delete all records for PPN 12345

12345	rvk

= delete all RVK records for PPN 12345

12345	rvk	XY 333

= add record for PPN 12345 (but do not delete anything)

For example, if the update would 1) remove the existing DDC record, 2) replace the one existing RVK record, and 3) add an addition BK record, it would look like this:

12345	ddc
12345	rvk
12345	rvk	XY 333
12345	bk	33.33

Or would you prefer to do it differently? I think this would cover all cases, even though removal of a single record would mean all other record for that PPN/vocabulary would need to be listed again. (I think in your case, removal of a single record would mean ALL other records for that PPN, regardless of vocab, would need to be listed again.)

@stefandesu
Copy link
Member

stefandesu commented Sep 5, 2022

There's now a basic working implementation of the import script. It will be finished in #27.

@nichtich
Copy link
Member Author

nichtich commented Aug 8, 2023

This is not part of the software but its deployment and configuration, so closing this issue.

@nichtich nichtich closed this as completed Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants