mod-data-import

This software is distributed under the terms of the Apache License, Version 2.0. See the file "LICENSE" for more information.

Introduction
Compiling
Docker
Installing the module
Deploying the module
Maximum upload file size and java heap memory setups
- Example
Scalability
- Module properties to set up at mod-configuration
File splitting configuration
Interaction with AWS S3/Minio
Queue prioritization algorithm
Interaction with Kafka
Other system properties
Issue tracker
Additional information
Script to upload a batch of MARC records

Introduction

mod-data-import is responsible for uploading files (see documentation for file uploading), initial handling and sending records for further processing (see documentation for file processing).

Compiling

   mvn install

See that it says "BUILD SUCCESS" near the end.

Docker

Build the docker container with:

   docker build -t mod-data-import .

Test that it runs with:

   docker run -t -i -p 8081:8081 mod-data-import

Installing the module

Follow the guide of Deploying Modules sections of the Okapi Guide and Reference, which describe the process in detail.

First of all you need a running Okapi instance. (Note that specifying an explicit 'okapiurl' might be needed.)

   cd .../okapi
   java -jar okapi-core/target/okapi-core-fat.jar dev

We need to declare the module to Okapi:

curl -w '\n' -X POST -D -   \
   -H "Content-type: application/json"   \
   -d @target/ModuleDescriptor.json \
   http://localhost:9130/_/proxy/modules

That ModuleDescriptor tells Okapi what the module is called, what services it provides, and how to deploy it.

Deploying the module

Next we need to deploy the module. There is a deployment descriptor in target/DeploymentDescriptor.json. It tells Okapi to start the module on 'localhost'.

Deploy it via Okapi discovery:

curl -w '\n' -D - -s \
  -X POST \
  -H "Content-type: application/json" \
  -d @target/DeploymentDescriptor.json  \
  http://localhost:9130/_/discovery/modules

Then we need to enable the module for the tenant:

curl -w '\n' -X POST -D -   \
    -H "Content-type: application/json"   \
    -d @target/TenantModuleDescriptor.json \
    http://localhost:9130/_/proxy/tenants/<tenant_name>/modules

Maximum upload file size and java heap memory setups

Current implementation supports only storing of the file in a LOCAL_STORAGE (file system of the module). It has a couple of implications:

the request for processing the file can be processed only by the same instance of the module, which prevents mod-data-import from scaling
file size that can be uploaded is limited to the java heap memory allocated to the module. It is necessary to have the size of the java heap equal to the expected max file size plus 10 percent.

Example

File Size	Java Heap size
256mb	270+ mb
512mb	560+ mb
1GB	1.1+ GB

Scalability

To initialise processing of a file user should choose a Job Profile - that information is crucial as it basically contains the instructions on what to do with the uploaded file. However, this process happens after file is uploaded and comes to mod-data-import as a separate request. External storage is required to make mod-data-import scalable. Implementation of the module has the possibility to read the configuration settings from mod-configuration. To allow multiple instance deployment, for every instance the same persistent volume must be mounted to the mount point defined by the value of data.import.storage.path property.

Module properties to set up at mod-configuration

data.import.storage.type - type of data storage used for uploaded files. Default value is LOCAL_STORAGE. Other implementations for storage should be added.
data.import.storage.path - path where uploaded file will be stored

File splitting configuration

The file-splitting process may be configured with the following environment variables:

Name	Type	Required	Default	Description
`SPLIT_FILES_ENABLED`	`true` or `false`	yes, if enabling feature	`false`	Whether files should be split into chunks and processed separately
`RECORDS_PER_SPLIT_FILE`	integer > 0	no	`1000`	The maximum number of records to include in a single file
`ASYNC_PROCESSOR_POLL_INTERVAL_MS`	integer (msec) ≥ 0	no	`5000`	The number of milliseconds between times when the module checks the queue for waiting jobs
`ASYNC_PROCESSOR_MAX_WORKERS_COUNT`	integer ≥ 1	no	`1`	The maximum number of concurrent jobs to process at once, in this instance

For the polling interval, a lower number results in decreased latency between when a job is added to the queue and when it is processed. However, this also results in more frequent database queries, which may impact performance. Note that the number set here is the "worst case" — average waiting would be half of it — and that a few seconds delay on a large import is hardly noticeable.

The worker count is useful for production/multi-tenant environments, where you might want to provide more capacity without additional instances. However, note that this may cause some odd behavior when only one user is running a job, as multiple parts may appear to complete together.

Note

For full information about this feature, please view the release notes

Interaction with AWS S3/Minio

This module uses S3-compatible storage as part of the file upload process. The following environment variables must be set with values for your S3-compatible storage (AWS S3, Minio Server):

Name	Type	Required	Default	Description
`AWS_URL`	URL as string	yes	`http://127.0.0.1:9000/`	URL of S3-compatible storage
`AWS_REGION`	string	yes	none	S3 region
`AWS_BUCKET`	string	yes	none	Bucket to store and retrieve data
`AWS_ACCESS_KEY_ID`	string	yes	none	S3 access key
`AWS_SECRET_ACCESS_KEY`	string	yes	none	S3 secret key
`AWS_SDK`	`true` or `false`	no, if using MinIO	`false`	If AWS S3 is being used (`true` if so, `false` other platforms such as MinIO)
`S3_FORCEPATHSTYLE`	`true` or `false`	no	`false`	If path-style requests should be used instead of virtual-hosted style

Path-style vs virtual-hosted style requests are described on the AWS S3 documentation.

Warning

It is possible for files to be partially uploaded but abandoned in the UI. This module makes no effort to detect these cases and proactively delete them.

Instead, use the retention policies built into AWS S3 and MinIO, as described here.

Queue prioritization algorithm

This covers the following environment variables:

Note

None of these are required; if not set, the following default values will be used.

Name	Type (unit)	Default	Reasoning
`SCORE_JOB_SMALLEST`	integer	`40`
`SCORE_JOB_LARGEST`	integer	`-40`	Larger jobs should be deprioritized
`SCORE_JOB_REFERENCE`	integer (records)	`100000`
`SCORE_AGE_NEWEST`	integer	`0`	New jobs begin with no boost
`SCORE_AGE_OLDEST`	integer	`50`	As jobs age, their score increases rapidly, so this does not have to be too high. We want small jobs to "cut" in line effectively.
`SCORE_AGE_EXTREME_THRESHOLD_MINUTES`	integer (minutes)	`480`	8 hours
`SCORE_AGE_EXTREME_VALUE`	integer	`10000`	Jump to the top of the queue if waiting more than 8 hours
`SCORE_TENANT_USAGE_MIN`	integer	`100`	If the tenant has no jobs running, then it should be prioritized
`SCORE_TENANT_USAGE_MAX`	integer	`-200`	If the tenant is using all available workers, it should be significantly deprioritized. If no other tenants are competing, this will not matter (since all jobs would be offset by this)
`SCORE_PART_NUMBER_FIRST`	integer	`1`	Very small; we only want to order parts amongst others within a job (which would likely have the same score otherwise)
`SCORE_PART_NUMBER_LAST`	integer	`0`
`SCORE_PART_NUMBER_LAST_REFERENCE`	integer	`100`	Does not really matter due to the small range

For information on what these mean, how to configure them, how scores are calculated, and even a playground to try experiment with different values, please see this wiki page.

Important

To disable an individual metric (or the prioritization altogether), set the value(s) to 0.

Note

We recommend the suggested values above, however, there is a lot of room for customization and extension as needed. Please see the doc for more information.

Interaction with Kafka

All modules involved in data import (mod-data-import, mod-source-record-manager, mod-source-record-storage, mod-inventory, mod-invoice) are communicating via Kafka directly. Therefore, to enable data import Kafka should be set up properly and all the necessary parameters should be set for the modules.

Properties that are required for mod-data-import to interact with Kafka:

KAFKA_HOST
KAFKA_PORT
OKAPI_URL
ENV (unique env ID)

There are another important properties - number of partitions for topics DI_INITIALIZATION_STARTED and DI_RAW_RECORDS_CHUNK_READ which are created during tenant initialization, the values of which can be customized with DI_INITIALIZATION_STARTED_PARTITIONS and DI_RAW_RECORDS_CHUNK_READ_PARTITIONS env variables respectively. Default value - 1.

Other system properties

Initial handling of the uploaded file means chunking it and sending records for processing in other modules. The chunk size can be adjusted for different files, otherwise default values will be used:

"file.processing.marc.raw.buffer.chunk.size": 50 - applicable to MARC files in binary format
"file.processing.marc.json.buffer.chunk.size": 50 - applicable to json files with MARC data in json format
"file.processing.marc.xml.buffer.chunk.size": 10 - applicable to xml files with MARC data in xml format
"file.processing.edifact.buffer.chunk.size": 10 - applicable to EDIFACT files

Issue tracker

See project MODDATAIMP at the FOLIO issue tracker.

Additional information

The raml-module-builder framework.

Other modules.

See project MODDATAIMP at the FOLIO issue tracker.

Other FOLIO Developer documentation is at dev.folio.org

Script to upload a batch of MARC records

The scripts directory contains a shell-script, load-marc-data-into-folio.sh, and a file with a sample of 100 MARC records, sample100.marc. This script can be used to upload any batch of MARC files automatically, using the same sequence of WSAPI operations as the Secret Button. First, login to a FOLIO backend service using the Okapi command-line utility or any other means that leaves definitions of the Okapi URL, tenant and token in the .okapi file in the home directory. Then run the script, naming the MARC file as its own argument:

scripts$ echo OKAPI_URL=https://folio-snapshot-stable-okapi.dev.folio.org > ~/.okapi
scripts$ echo OKAPI_TENANT=diku >> ~/.okapi
scripts$ okapi login
username: diku_admin
password: ************
Login successful. Token saved to /Users/mike/.okapi
scripts$ ./load-marc-data-into-folio.sh sample100.marc
=== Stage 1 ===
=== Stage 2 ===
=== Stage 3 ===
HTTP/2 204
date: Thu, 27 Aug 2020 11:55:28 GMT
x-okapi-trace: POST mod-authtoken-2.6.0-SNAPSHOT.73 http://10.36.1.38:9178/data-import/uploadDefinitions/123a8d01-e389-4893-a53e-cc2de846471d/processFiles.. : 202 7078us
x-okapi-trace: POST mod-data-import-1.11.0-SNAPSHOT.140 http://10.36.1.38:9175/data-import/uploadDefinitions/123a8d01-e389-4893-a53e-cc2de846471d/processFiles.. : 204 6354us
scripts$

Name		Name	Last commit message	Last commit date
Latest commit History 419 Commits
.github/workflows		.github/workflows
descriptors		descriptors
ramls		ramls
scripts		scripts
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
.rancher-pipeline.yml		.rancher-pipeline.yml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
FileProcessingApi.md		FileProcessingApi.md
FileUploadApi.md		FileUploadApi.md
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
NEWS.md		NEWS.md
PERSONAL_DATA_DISCLOSURE.md		PERSONAL_DATA_DISCLOSURE.md
PULL_REQUEST_TEMPLATE.md		PULL_REQUEST_TEMPLATE.md
README.md		README.md
lombok.config		lombok.config
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mod-data-import

Introduction

Compiling

Docker

Installing the module

Deploying the module

Maximum upload file size and java heap memory setups

Example

Scalability

Module properties to set up at mod-configuration

File splitting configuration

Interaction with AWS S3/Minio

Queue prioritization algorithm

Interaction with Kafka

Other system properties

Issue tracker

Additional information

Script to upload a batch of MARC records

About

Releases 55

Packages

Contributors 36

Languages

License

folio-org/mod-data-import

Folders and files

Latest commit

History

Repository files navigation

mod-data-import

Introduction

Compiling

Docker

Installing the module

Deploying the module

Maximum upload file size and java heap memory setups

Example

Scalability

Module properties to set up at mod-configuration

File splitting configuration

Interaction with AWS S3/Minio

Queue prioritization algorithm

Interaction with Kafka

Other system properties

Issue tracker

Additional information

Script to upload a batch of MARC records

About

Resources

License

Stars

Watchers

Forks

Releases 55

Packages 0

Contributors 36

Languages

Packages