The Automation Service offers the possibility to have an automatically generated model based on its given data and endpoints to interact with it. In this file, the examples given will follow a NER concerning names, s.t., it tries to identify Firstnames, Middlenames, and Lastnames. However, you may define any entities to recognize that you want. A demo of the service can be accessed at http://demos.swe.htwk-leipzig.de.
The service can be run as standalone or within a Qanary-driven Question Answering system.
There are two options (requirements) for starting the service:
-
option 1: a pre-trained model, or
-
option 2: there must be either compatible datasets for training and testing.
If a pre-trained model is intended to be provided, it must be available in the folder AutomationService/AutomationServiceBackend/data/model (default configuration).
The service works only with spaCy models.
Hence, your model needs to follow the spaCy standards (or should be trained using spaCy).
In a netconsole, just copy the contents of a trained model (usually in the folder model-best
or model-last
) into the mentioned folder.
If no pre-trained model is provided, training and testing data must be provided to the system. Otherwise, the web service will not start. Both files must be provided in the folder AutomationService/AutomationServiceBackend/data/trainingdata. Additionally, the file names must be defined in the .env file. Both datasets must always be in CSV file format and meet the following requirements:
-
Each file contains columns for the input-text (first column) and each entity the model should be able to identify.
-
Then, each data-text is written into the text-column and additionally, the values for each entity inside the text are defined separately in the respective column.
-
If a text does not contain a value for a defined entity, the corresponding cell must be empty.
An example for an exemplary CSV-formatted dataset for recognizing names of people would be something like this:
Name | First_Name | Middle_Name | Last_Name |
---|---|---|---|
I am Ms Walters |
Walters |
||
Do you think Silke will come? |
Silke |
startquestionansweringwithtextquestion |
|
I do have a middlename, it’s Heinz-Wilhelm |
Heinz-Wilhelm |
||
You can send the data to Ingetraut Renz |
Ingetraut |
Renz |
|
When generating the training data, there can only be one of each entity type given. The training process will not work for multiple. However, the later model has the code-setup to recognize and work with multiple results.n |
Training and testing data must follow the same basic structure (i.e., they must have the same column name).
To start the service, docker-compose files are provided. Therefore, you need to have docker and docker-compose installed. Additionally, if you want to use a GPU to train the models, you might need additional requirements based on your drivers / hardware, if not you need to remove the lines from the docker-compose. Refer to the documentation needed for these. Nothing else is needed.
If you want to run the service as a standalone, in the root directory build the images. Please note that if the service runs as a standalone, it will be running on the port 8002 per default as opposed to 8080 and 8081.
docker-compose -f docker-compose_standalone.yml build
You can then run the service via:
docker-compose -f docker-compose_standalone.yml up
Add -d
to the call to have it run in the background and not be bound by the running console.
If you want to run the service as a Qanary component, in the root directory build the images for it. The setup in the docker-compose automatically creates a Qanary instance as well as a Stardog server to interact with.
docker-compose -f docker-compose_qanary-example-local-stardog.yml build
You can then run it via:
docker-compose -f docker-compose_qanary-example-local-stardog.yml up
Add -d
to the call to have it run in the background and not be bound by the running console.
Using the file docker-compose-full-example.yml
will connect the pipeline automatically to the HTWK Stardog server.
If you already have a Qanary pipeline, you might just want to add the component to it. In this case, you can build and start only the required component. To do this, the following commaned is used:
docker-compose -f docker-compose_QanaryComponent.yml build automation_component
You can then run it via:
docker-compose -f docker-compose_QanaryComponent.yml up automation_component
Add -d
to the call to have it run in the background and not be bound by the running console.
However, in that case additional configurations are needed to be done. To connect the service to an existing Qanary pipeline, the following steps must be taken:
-
In the highest .env file, the following values have to be adjusted:
-
SPRING_BOOT_ADMIN_URL
-
SPRING_BOOT_ADMIN_USERNAME
-
SPRING_BOOT_ADMIN_PASSWORD
-
-
In the same file, the component connection settings have to be adjusted:
-
SERVICE_HOST
-
SERVICE_PORT
-
-
You can also find the component name and description in this file
To connect the service with an already existing ML Flow Logger, the following steps must be taken:
-
In the .env file of the component, the following values have to be adjusted:
-
MLFLOW_URI
-
-
In the same file, if SFTP is used, the following values have to be adjusted:
-
USE_SFTP = True
-
MLFLOW_HOST
-
MLFLOW_PORT
-
-
In the highest .env file, the ML FLOW Logger values are only relevant for the complete system and do not need to be paid attention to for the standalone component
The full error message might look like this:
ERROR: The Compose file './docker-compose_QanaryComponent.yml' is invalid because: services.automation_component.deploy.resources.reservations value Additional properties are not allowed ('devices' was unexpected)
Reason: The prepared docker-compose file is integrating GPU capabilities.
Following the Docker documentation, to take advantage of this functionality you need at least docker-compose version v1.28.0+ (check by running the command: docker-compose --version
).
You might install the most recent version using pip:
pip install docker-compose --upgrade
The full error message might look like this:
ERROR: for automation_component device_requests param is not supported in API versions < 1.40
Reason: the docker-compose version used is too outdated.
In building this service, the lowest used version was 2.12.2
which worked fine.
IF the error occurs, you might install the newest docker-compose version using your preferred installation method.
On Arch Linux, the call to install / update docker compose would be:
sudo pacman -S docker-compose
For Ubuntu and Debain you can run:
sudo apt-get install docker-compose-plugin
Once a Qanary service is started, you may interact with it through a handful of endpoints offered as APIs that will either provide access to some way of information extraction from the given data or enable you to retrain (i.e., exchange) the model on runtime.
To interact with the Qanary interface, you can access it using the following webpage:
http://demos.swe.htwk-leipzig.de:40111/startquestionansweringwithtextquestion
It allows you to ask questions and the recognized entities will be saved in the Stardog server. The page also allows you to interact with Stardog.
If you enter a question such as "My name is Annemarie Wittig." with the default model, there will be two annotations created, one for the first- and one for the last name. The generated query will be something like this:
PREFIX dbr: <http://dbpedia.org/resource/> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX qa: <http://www.wdaqua.eu/qa#> PREFIX oa: <http://www.w3.org/ns/openannotation/core/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> INSERT { GRAPH <urn:graph:6ddac4c3-fbc1-4016-a107-d9126b806b65> { ?entityAnnotation0 a qa:AnnotationOfInstance . ?entityAnnotation0 oa:hasTarget [ a oa:SpecificResource; oa:hasSource <http://localhost:8080/question/stored-question__text_dc03e843-a2bf-4de0-aec0-280fc8d4adb1> ; oa:hasSelector [ a oa:TextPositionSelector ; oa:start "11"^^xsd:nonNegativeInteger ; oa:end "20"^^xsd:nonNegativeInteger ] ] . ?entityAnnotation0 oa:hasBody "FIRST_NAME"^^xsd:string ; oa:annotatedBy <urn:qanary:AutomationServiceComponent> ; oa:annotatedAt ?time ; qa:score "0.5"^^xsd:decimal . ?entityAnnotation1 a qa:AnnotationOfInstance . ?entityAnnotation1 oa:hasTarget [ a oa:SpecificResource; oa:hasSource <http://localhost:8080/question/stored-question__text_dc03e843-a2bf-4de0-aec0-280fc8d4adb1> ; oa:hasSelector [ a oa:TextPositionSelector ; oa:start "21"^^xsd:nonNegativeInteger ; oa:end "27"^^xsd:nonNegativeInteger ] ] . ?entityAnnotation1 oa:hasBody "MIDDLE_NAME"^^xsd:string ; oa:annotatedBy <urn:qanary:AutomationServiceComponent> ; oa:annotatedAt ?time ; qa:score "0.5"^^xsd:decimal . } } WHERE { BIND (IRI(str(RAND())) AS ?entityAnnotation0) . BIND (IRI(str(RAND())) AS ?entityAnnotation1) . BIND (now() as ?time) }
Querying data from the Qanary triplestore with a query like the following, will return the NER parts of the annotation:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX oa: <http://www.w3.org/ns/openannotation/core/> PREFIX qa: <http://www.wdaqua.eu/qa#> SELECT * FROM <urn:graph:6ddac4c3-fbc1-4016-a107-d9126b806b65> WHERE { ?annotationId rdf:type ?type. ?annotationId oa:hasBody ?body. ?annotationId oa:hasTarget ?target. ?target oa:hasSelector ?selector . ?selector oa:start ?start . ?selector oa:end ?end . }
The result then looks like this:
Alternatively, you can curl against the pipeline directly using a curl command such as:
curl --location --request POST 'http://demos.swe.htwk-leipzig.de:40170/questionanswering?textquestion=Who is Barack Obama?&language=en&componentlist%5B%5D=AutomationServiceComponent'
The /api endpoint offers two interfaces for interaction.
The GET interface offers the possibility to retrieve the NER of a single text by your model. This is only an endpoint for quick result checks and does not allow mlflow logging. You can interact with it by using a call like:
curl -X 'GET' 'http://demos.swe.htwk-leipzig.de:40170/api?text={YOUR%TEXT}'
Remember to replace spaces with '%'. The result will be the original text, recognized entities with their labels and content:
[
{
"text": "text",
"results": [
{
"Entity-Label1": "value1",
"Entity-Label2": "value2"
}
]
}
]
The POST interface offers a NER for multiple input possibilities:
-
upload a CSV file,
-
upload a JSON file, or
-
upload raw JSON data within the body of your request.
In all cases the matching "accept"-header must be set within the HTTP request.
It will define whether the output is of the type application/json
or text/csv
.
If another or an invalid "accept"-header is given, the service will either use the "Content-Type"-header of the uploaded file or, if no file was uploaded, it will use it from the request.
If none of these are valid, the request will fail.
Hence, if you consider problems, then add or check the headers that are defined in your Web service request.
You can also send the parameter use_ml_logger
with the value True
with these request to activate logging using mlflow.
This is recommended while using the component in a real Question Answering system to establish a tracking of the component’s behavior (i.e., the quality).
You can upload a CSV file, containing texts that are supposed to be run through NER in the first column, to the Web service.
There can be any other columns added if required.
For example, the expected entities could be added to compare expected and actual results.
The service will then annotate the CSV file with columns for all its recognizable entities and fill these up with the entities contained in each row.
The curl
command would be:
curl -X POST -H 'accept: application/json' -F "file_to_identify=@{YOUR CSV FILE PATH};type=text/csv" http://demos.swe.htwk-leipzig.de:40170/api
The service will answer with the annotated CSV file.
Additionally, the response file will also be saved locally in the container in the folder /code/app/spacy_model/intermediate/results/
.
As an example, if you want to upload a file such as:
Text | First_Name | Middle_Name | Last_Name |
---|---|---|---|
People call me Ida Clayton Henderson |
Ida |
Clayton |
Henderson |
I am happy to meet you, too. You can call me Kira. |
Kira |
||
You can send the data to Eberhard Rump |
Eberhard |
Rump |
|
Please send all business inquiries to Jessie Edwin Fowler |
Jessie |
Edwin |
Fowler |
Oh, I actually go by Lioba Alexandra. |
Lioba |
Alexandra |
with text/csv
as an "accept"-header, it would result in something like:
Text | First_Name | Middle_Name | Last_Name | FIRST_NAME | LAST_NAME | MIDDLE_NAME |
---|---|---|---|---|---|---|
People call me Ida Clayton Henderson |
Ida |
Clayton |
Henderson |
Ida |
Henderson |
Clayton |
I am happy to meet you, too. You can call me Kira. |
Kira |
Kira |
||||
You can send the data to Eberhard Rump |
Eberhard |
Rump |
Eberhard |
Rump |
||
Please send all business inquiries to Jessie Edwin Fowler |
Jessie |
Edwin |
Fowler |
Jessie |
Fowler |
Edwin |
Oh, I actually go by Lioba Alexandra. |
Lioba |
Alexandra |
Lioba |
Alexandra |
However, having defined the accept
-header as application/json
.
The response of the Web service would be:
[
{
"text": "People call me Ida Clayton Henderson",
"entities": [
{
"First_Name": "Ida",
"Middle_Name": "Clayton",
"Last_Name": "Henderson"
}
],
"results": [
{
"FIRST_NAME": "Ida",
"LAST_NAME": "Henderson",
"MIDDLE_NAME": "Clayton"
}
]
},
{
"text": "I am happy to meet you, too. You can call me Kira.",
"entities": [
{
"First_Name": "Kira",
"Middle_Name": null,
"Last_Name": ""
}
],
"results": [
{
"FIRST_NAME": "Kira",
"LAST_NAME": "",
"MIDDLE_NAME": ""
}
]
},
...
]
Additionally, the endpoint allows applying NER to all texts given in a JSON file much like the CSV Upload. The JSON file must follow this structure:
[
{
"text": "{TEXT TO CLASSIFY}",
"language": "{LANGUAGE}",
"entities": {
"{ENTITY1}": "{VALUE1}",
"{ENTITY2}": "{VALUE2}",
...
}
}
]
However, both the language and the entity tags can be left out (they default to null), if wanted. The NER via uploading a JSON file, much like the CSV file upload, allows the freedom to add any additional information that is wanted, as long as each object has the "attribute text". Hence, request data of sending two element might look like:
[
{
"text": "{TEXT TO CLASSIFY}"
},
{
"text": "{TEXT TO CLASSIFY}"
}
]
Example files to upload are the texts.json files found in the folder ./AutomationService/ExampleBodies/name and ./AutomationService/ExampleBodies/address directories.
A corresponding curl
call would be:
curl -X POST -H 'accept: application/json' -F "file_to_identify=@{YOUR JSON FILE PATH};type=application/json" http://demos.swe.htwk-leipzig.de:40170/api
The response will be the annotated JSON, but it will also be stored locally in the container.
It can be found as /code/app/spacy_model/intermediate/results/
.
The NER results can be found in the results
array.
An example response object looks like this:
[
{
"text": "I am called Marilyn Monroe.",
"language": "en",
"entities": [
{
"First_Name": "Marilyn",
"Last_Name": "Monroe"
}
],
"results": [
{
"FIRST_NAME": "Marilyn",
"LAST_NAME": "Monroe"
}
]
}
]
If this was entered with text/csv
as accept
-header, the result would be:
text | language | entities_First_Name | entities_Last_Name | results_FIRST_NAME | results_LAST_NAME |
---|---|---|---|---|---|
I am called Marilyn Monroe. |
en |
Marilyn |
Monroe |
Marilyn |
Monroe |
The direct upload works exactly as the JSON File Upload with the difference, that the request body is not a file but the JSON data as a string.
It has the same structure and response as in the JSON File Upload and all additional information can be referenced there.
The only difference is the curl
command, which will look something like this:
curl -X POST -H 'accept: application/json' -H "Content-Type: application/json" -d '{{YOUR JSON}}' http://demos.swe.htwk-leipzig.de:40170/api
Or an example of a curl
with content:
curl -X 'POST' \
'http://demos.swe.htwk-leipzig.de:40170/api' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '[
{
"text": "I am called Marilyn Monroe.",
"language": "en",
"entities": {
"First_Name": "Marilyn",
"Last_Name": "Monroe"
}
}
]'
Alternatively, the accept
-header can be set to CSV, too.
The retraining endpoint uses the data you provided to train a new NER model which will if all is successful, replace the original model. All following interactions will then be with the new model. The original model will be deleted. "accept"-headers will not be relevant, as the only return value is a success message in JSON format.
The retraining will, after formatting the input if needed, go through the data preparation as it is described in the documentation, save the created intermediate files within the container and will then use the created docbins to train a new model.
All of this happens in a folder located in the container as /code/app/spacy_model/intermediate/
.
Once the training concludes successfully, the files are moved into the system and overwrite other existing files, either of the original model or the original intermediate files.
Both, the (formatted) training- and testingdata as well as the generated docbins will be saved in the container (until overwritten again).
The used model will always be the model-best
generated by SpaCy.
After the training, you can find your files here:
-
Trainingdata is saved as
train.csv
in/code/app/spacy_model/corpus/trainingdata/
-
Testingdata is saved as
test.csv
in/code/app/spacy_model/corpus/trainingdata/
-
The generated docbins are saved as
train.spacy
andtest.spacy
in/code/app/spacy_model/corpus/spacy-docbins/
-
The model (only the contents of the model-best) will be found in
/code/app/spacy_model/output/model-best/
Everything else such as the other trained model will be deleted.
|
Please note that the process of retraining can, depending on your hardware, take some time to finish. The classification APIs can still be used with the original model while the training runs. |
You can also send the parameter use_ml_logger
with the value True
with these request to activate logging using mlflow. This is recommended when you use Qanary.
The endpoint allows to upload two CSV files, the trainingdata
and the testingdata
, as well as the options in a JSON file.
You can name them however you like, as long as the csv files have the exact structure as the ones needed in the Starting Conditions.
The options JSON file contains a list of all possible entities
the NER is supposed to recognize as well as the model language
and modeltype
.
None of these are optional and they all must be provided.
It has the following structure:
{
"entities": ["{ENTITY1}", "{ENTITY2}", ...],
"language": "en",
"modeltype": "spacy"
}
The corresponding curl
call would be:
curl -X POST -F 'trainingdata=@{YOUR TRAININGDATA CSV};type=text/csv' -F 'testingdata=@{YOUR VALIDATION CSV};type=text/csv' -F 'options=@{YOUR OPTIONS JSON}' http://demos.swe.htwk-leipzig.de:40170/retrain
The endpoint allows the upload of trainingfiles in JSON format. There are three files needed in total. The training data is structured like this:
{
"trainingdata": [
{
"text": "{TRAININGTEXT}",
"language": "{LANGUAGETEXT (not relevant for training and can be ignored, language is set in the model config)}",
"entities": {
"{ENTITY1}": "{VALUE1}",
"{ENTITY2}": "{VALUE2}",
...
}
}
]
}
The data for tests follows the same structure.
But, inside the file, the initial key is named testingdata
(instead of trainingsdata
).
For the JSON upload the same options file as before is needed, please refer to the the CSV Upload for details.
Example files for curl
commands can be found in the ExampleBodies/name and ExampleBodies/address directories.
Warning: Please note that those are minimal examples and will not generate a well-working NER model.
The following curl
command would start the retraining of the component’s model:
curl -X POST -F 'trainingdata=@{YOUR TRAININGDATA JSON};type=application/json' -F 'testingdata=@{YOUR VALIDATION JSON};type=application/json' -F 'options=@{YOUR OPTIONS JSON};type=application/json' http://demos.swe.htwk-leipzig.de:40170/retrain
The json/upload-direct
endpoint allows the data needed to be retrained raw within the body of the request.
The data itself is structured as is for the JSON File Upload, but all put in one file like the following:
{
"trainingdata": [
{
"text": "{TRAININGTEXT}",
"language": "{LANGUAGETEXT (not relevant for training and can be ignored, language is set in the model config)}",
"entities": {
"{ENTITY1}": "{VALUE1}",
"{ENTITY2}": "{VALUE2}",
...
}
}
],
"testingdata": [
{
"text": "{TRAININGTEXT}",
"language": "{LANGUAGETEXT (not relevant for training and can be ignored, language is set in the model config)}",
"entities": {
"{ENTITY1}": "{VALUE1}",
"{ENTITY2}": "{VALUE2}",
...
}
}
],
"entities": ["{ENTITY1}", "{ENTITY2}", ...],
"language": "en",
"modeltype": "spacy"
}
It is generally not recommended using this endpoint for curl
commands, as it easily gets chaotic and is fairly long, but the general curl
command would be:
curl -X POST -H "Content-Type: application/json" -d '{YOUR JSON OBJECT}' http://demos.swe.htwk-leipzig.de:40170/retrain
and a working example is:
curl -X 'POST' \
'http://demos.swe.htwk-leipzig.de:40170/retrain' \
-H 'Content-Type: application/json' \
-d '{
"testingdata": [
{
"text": "I am called Marilyn Monroe.",
"language": "en",
"entities": {
"First_Name": "Marilyn",
"Last_Name": "Monroe"
}
}
],
"trainingdata": [
{
"text": "I am called Marilyn Monroe.",
"language": "en",
"entities": {
"First_Name": "Marilyn",
"Last_Name": "Monroe"
}
}
],
"entities": [
"First_Name",
"Middle_Name",
"Last_Name"and this is
}'
To check if the service is active, just run: http://demos.swe.htwk-leipzig.de40170/health
You can use ML Flow Logging with this service.
For information on the setup and usage of an ML Flow Server, please refer to its Documentation.
ML Flow Logging is always activated for interactions with the service from the Qanary interface, triggering the (NER Logging).
It might as well be used for interactions with the /retrain (Training Logging) and the /api (NER Logging) endpoint by setting the parameter MLFLOW_ACTIVATED
to True
.
The parameter is found in the inner .env file.
When starting a training process via the \retrain
-endpoint with the use_ml_logger
parameter set to True
, the training will be logged once its concluded.
The logs can be found in the AutoML Model Training
tab.
The logged data contains the attributes:
-
component_name
: The name of the component that triggered this log -
component_type
: The type of the component, in this case always NER -
entities
: The entities this trained model could recognize -
hardware
: The hardware the model was trained on -
language
: The language of the model, specified by the user -
model
: The model that was used. SpaCy returns multiple models (the last and the best), but the component always takes "model-best", which was the best performing. -
model_uuid
: The UUID that’s assigned to this training run. -
modeltype
: The model type entered with the training options -
time
: The time needed to conclude the training
Within the "Artifacts", there are some files logged:
-
Datasets
: In this directory, text files are stored that contain the training and testing data given -
config.json
: The configuration used to train the model -
model_metrics.json
: This file is the meta.json of the model, it contains all kinds of information such as the performance while training.
When the training is concluded, the testdata is used to trigger the NER process and log the results for each given input. This logging happens within the NER Logging and the UUID will be the same for the training-logs as well as the NER logs.
When a POST request is sent to the /api
endpoint (found in the AutoML Model Testing
tab), with the use_ml_logger
parameter set to True
, the NER results will be logged for each of the given input texts.
Files will not be logged as one but each input line by itself.
The logged values are:
-
input
: The given input text -
model_uuid
: The UUID of this call; It will be the same for all input texts of the same file and if the process is triggered through the training, it will be the same as the training process, too. -
runtime
: The time needed for the result for this text.
Within the Artifacts
, there are two files logged:
-
predicted_target.json
: The result of the NER -
true_target.json
: The expected result, if provided with the input
When a text is entered in the Qanary interface (found in the AutoML Component Annotations
tab), the created annotations are logged, too.
There are no additional parameters to be set as this is a requirement.
The logged data is:
-
input
: The given input text -
model_uuid
: The UUID of this call -
predicted_target
: The result of the NER, containing the recognized entities and their positions within the input -
qanary_graph_id
: The graph the annotations was saved to
|
Please note that the process of logging NER uploads can take up some time if bigger datasets are provided. |
There are Docker images available that have pre-trained models for name and address recognition - one using a spacy model as a base and one using no base at all. They can be found in the Qanary Dockerhub, named qanary/qanary-component-ner-automl-pretrained-{the model you want}
. Note that these are built to be run as part of the qanary pipeline. For example, you could replace the build call in the
-
The image with a spacy based model for name (first, middle and last name) recognition in GER
-
The image with a spacy based model for name (first, middle and last name) recognition in EN
-
The image with a bert based model for name (first, middle and last name) recognition in GER
-
The image with a bert based model for name (first, middle and last name) recognition in EN
-
The image without a base model for name (first, middle and last name) recognition in GER
-
The image without a base model for name (first, middle and last name) recognition in EN
There is one example for the standalone version of the AutoML service. It works outside of Qanary systems and has the same NER- and Retrain endpoints, but does not offer any Qanary or MLFlow services.
-
The standalone image with a spacy based model for address (street, house number, postal code and city) recognition in GER. It is not updated as frequently as the qanary images.