-
-
Notifications
You must be signed in to change notification settings - Fork 3
NMT: Usage
The tools described in this section are the tools that are most commonly used in setting up and running an experiment.
The experiment tool runs the preprocess, train, and test tools in succession if none of the individual parts are specified.
usage: python -m silnlp.nmt.experiment [-h] [--stats] [--force-align] [--disable-mixed-precision] [--memory-growth]
[--num-devices NUM_DEVICES] [--clearml-queue QUEUE] [--save-checkpoints]
[--preprocess] [--train] [--test] [--translate] [--score-by-book] [--mt-dir DIR] [--debug]
[--commit ID] [--scorers [scorer [scorer ...]]] [--multiple-translations]
experiment
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--stats |
Compute tokenization statistics | Compute tokenization statistics. |
--force-align |
Force recalculation of all alignment scores | Only relevant when using the --stats option. |
--disable-mixed-precision |
Disable mixed precision | Only use this option if your GPU doesn't support mixed precision. It is considerably faster than full precision and has lower memory requirements allowing you to train larger models. It has a negligible effect on the final model. More... |
--memory-growth |
Enable memory growth | With this option GPU memory is allocated to the model training as required. Without this option all the available GPU memory will be reserved for training from the start. Use this option in order to simultaneously train multiple models on a single GPU. |
--num-devices NUM_DEVICES |
Number of devices to train on | To train a single model on multiple GPUs use this option to set how many GPUs to use. Ensure that the environment variable CUDA_VISIBLE_DEVICES is also set so that multiple GPUs are visible. eg. if using --num-devices 2 then set CUDA_VISIBLE_DEVICES=0,1
|
--clearml-queue QUEUE |
ClearML queue | Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML. |
--save-checkpoints |
Save checkpoints to s3 bucket | Save checkpoints to s3 bucket. |
--preprocess |
Run the preprocess step | Run the preprocess step. |
--train |
Run the train step | Run the train step. |
--test |
Run the test step | Run the test step. |
--translate |
Create drafts | See here for more details. |
--score-by-book |
Score individual books | In addition to providing an overall score for all the books in the test set, provide individual scores for each book in the test set. |
--mt-dir DIR |
The machine translation directory | Use an alternative machine translation directory for the location of the experiment. |
--debug |
Show debug information | Show information about the environment variables and arguments. |
--commit ID |
Commit ID | The silnlp git commit id with which to run a remote job. |
--scorers [scorer [scorer ...]] |
Set scorers | Specifies the list of scorers to be used on the predictions. Default is ['bleu', 'sentencebleu', 'chrf3', 'chrf3++', 'wer', 'ter', 'spbleu']. Additional options are 'chrf+' and 'meteor'. |
--multiple-translations |
Produce multiple drafts | If the translate or test steps are being performed, produce multiple drafts of the input data or test data, respectively. When translating, the system will produce multiple output files, one for each draft. In testing, a new column has been added to the output to specify the draft number (1, 2, etc.). See here for more details. |
The preprocess tool prepares the various data files needed to train a model. Preprocessing steps include:
- splitting the source and target files into the training, validation, and test data sets;
- writing the train/validate/test data sets to files in the subfolder;
- adapting the tokenizer of the parent model to be used by this experiment.
- generating tokenization statistics about the data
usage: python -m silnlp.nmt.preprocess [-h] [--stats] [--force-align] experiment
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--stats |
Compute tokenization statistics | Compute tokenization statistics. |
--force-align |
Force recalculation of all alignment scores | Only relevant when using the --stats option. |
The train tool trains a neural model for one or more specified experiments. The experiment's configuration file (config.yml) and the data files created by the preprocess tool are used to control the training process.
usage: python -m silnlp.nmt.train [-h] [--diable-mixed-precision] [--memory-growth]
[--num-devices NUM_DEVICES] [--eager-execution]
experiments [experiments ...]
Arguments:
Argument | Purpose | Description |
---|---|---|
experiments |
Experiment names | The names of the experiments to train. Each experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--disable-mixed-precision |
Disable mixed precision | Only use this option if your GPU doesn't support mixed precision. It is considerably faster than full precision and has lower memory requirements allowing you to train larger models. It has a negligible effect on the final model. More... |
--memory-growth |
Enable memory growth | With this option GPU memory is allocated to the model training as required. Without this option all the available GPU memory will be reserved for training from the start. Use this option in order to simultaneously train multiple models on a single GPU. |
--num-devices NUM_DEVICES |
Number of devices to train on | To train a single model on multiple GPUs use this option to set how many GPUs to use. Ensure that the environment variable CUDA_VISIBLE_DEVICES is also set so that multiple GPUs are visible. eg. if using --num-devices 2 then set CUDA_VISIBLE_DEVICES=0,1
|
--eager-execution |
Enable Tensorflow eager execution | More... |
The test tool tests the neural model for an experiment. If no trained model exists in the experiment folder, the base model will be used.
usage: python -m silnlp.nmt.test [-h] [--memory-growth] [--checkpoint CHECKPOINT]
[--last] [--best] [--avg] [--ref-projects [project [project ...]]]
[--force-infer] [--scorers [scorer [scorer ...]]]
[--books BOOKS] [--by-book] [--eager-execution]
experiment
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiment to test. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--memory growth |
Enable memory growth | With this option GPU memory is allocated to the model training as required. Without this option all the available GPU memory will be reserved for training from the start. Use this option in order to simultaneously train multiple models on a single GPU. |
--checkpoint CHECKPOINT |
Test specified checkpoint | Use the specified checkpoint (e.g., '--checkpoint 6000') to generate target language predictions from the test set. The specified checkpoint must be available in the run subfolder of the specified experiment. |
--last |
Test the last checkpoint | Use the last training checkpoint to generate target language predictions. |
--best |
Test the best checkpoint | Use the best training checkpoint to generate target language predictions. The best checkpoint must be available in the run > export subfolder of the specified experiment. |
--avg |
Test the averaged checkpoint | Use the averaged training checkpoint to generate target language predictions. The averaged checkpoint must be available in the 'run > avg' subfolder of the specified experiment. An averaged checkpoint can be automatically generated during training using the train: average_last_checkpoints: _<n>_ option, or it can be manually generated after training by using the average_checkpoints tool. |
--ref-projects [project [project ...]] |
Reference projects | The generated target language predictions are typically scored using the target language test set as the reference. If multiple reference projects were configured, this option can be used to specify which of these reference projects should be considered when scoring the predictions. |
--force-infer |
Force inferencing | If the test tool has already been used to generate and score predictions for an experiment's checkpoint, it will only score the predictions when it is run again on that same checkpoint. This option can be used to force the tool to re-generate the target language predictions. |
--scorers [scorer [scorer ...]] |
Set scorers | Specifies the list of scorers to be used on the predictions. Options are 'bleu' (default), 'sentencebleu', 'chrf3', 'chrf3+', 'chrf3++', 'meteor', 'ter', 'wer', and 'spbleu'. |
--books BOOKS |
Books to score | Specifies one or more books/chapters to be scored. When this option is used, the test tool will generate predictions for the entire target language test set, but provide a score only for the specified book(s)/chapter(s). Book must be specified using the 3 character abbreviations from the USFM 3.0 standard (e.g., "GEN" for Genesis) and follow the syntax found here. |
--by-book |
Score individual books | In addition to providing an overall score for all the books in the test set, provide individual scores for each book in the test set. If this option is used in combination with the --books option, individual scores are provided for each of the specified books. |
--eager-execution |
Enable Tensorflow eager execution | More... |
The translate tool uses a trained neural model to translate text to a new language. Three translation scenarios are supported, with differing command line arguments for each scenario. The supported scenarios are:
- Using a trained model to translate the text in a file from the source language to a target language.
- Using a trained model to translate the text in a sequence of files into a target language.
- Using a trained model to translate a USFM-formatted book in a Paratext project into a target language.
The command line arguments for each of these scenarios are described below.
usage: python -m silnlp.nmt.translate [-h] [--memory-growth] [--checkpoint CHECKPOINT]
[--src SRC] [--trg TRG]
[--src-prefix SRC_PREFIX] [--trg-prefix TRG_PREFIX] [--start-seq START_SEQ] [--end-seq END_SEQ]
[--src-project SRC_PROJECT] [--trg-project TRG_PROJECT]
[--books BOOKS] [--src-iso LANG] [--trg-iso LANG]
[--include-inline-elements] [--stylesheet-field-update ACTION] [--multiple-translations]
[--eager-execution] [--clearml-queue QUEUE] [--debug] [--commit ID]
experiment
Using the combination of command line arguments described in this section, the translate command will translate the sentences in a text file from the source language to the target language, using the requested checkpoint from a trained model.
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario). |
--memory growth |
Enable memory growth | With this option GPU memory is allocated to the model training as required. Without this option all the available GPU memory will be reserved for training from the start. Use this option in order to simultaneously train multiple models on a single GPU. |
--eager-execution |
Enable Tensorflow's eager execution | More... |
--checkpoint CHECKPOINT |
Test specified checkpoint | Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment. |
--src SRC |
Source file | Name of a text file with the source language sentences to be translated (one sentence per line). The translate tool looks for the file in the current working directory or, if a full/relative path is specified, it looks for the file in the specified folder. Each line in the specified source file is translated and written to the specified target file. |
--trg TRG |
Target file | Name of the text file where the translated sentences will be written (one per line). |
--src-iso LANG |
Source language ISO code | The ISO code for the source language. |
--trg-iso LANG |
Target language ISO code | The ISO code for the target language. |
--multiple-translations |
Produce multiple drafts | Produce a number of drafts equal to num_drafts in config.yml . The way that source and target files are specified does not need to be changed when using this. Instead, a suffix will be added to the output file, corresponding to the draft number. For example, if you specified --trg output.txt , files named output.1.txt , output.2.txt , etc. will be created. See here for more details. |
--clearml-queue QUEUE |
ClearML queue | Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML. |
--debug |
Show debug information | Show information about the environment variables and arguments. |
--commit ID |
Commit ID | The silnlp git commit id with which to run a remote job. |
Using the combination of command line arguments described in this section, the translate command will translate sentences from a sequence of source language text files. The sentences in these source language text files are translated to the target language using the requested checkpoint from a trained model, and written to a corresponding sequence of target language text files.
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario). |
--memory growth |
Enable memory growth | With this option GPU memory is allocated to the model training as required. Without this option all the available GPU memory will be reserved for training from the start. Use this option in order to simultaneously train multiple models on a single GPU. |
--eager-execution |
Enable Tensorflow's eager execution | More... |
--checkpoint CHECKPOINT |
Test specified checkpoint | Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment. |
--src-prefix SRC_PREFIX |
Source file prefix (e.g., de-news2019-) | The file name prefix for the source files. The translate tool looks for the sequence of source files in the current working directory. |
--trg-prefix TRG_PREFIX |
Target file prefix (e.g., en-news2019-) | The file name prefix for the target files. The translate tool will write the translated text to a series of files with this specified file name prefix; the translated files will be written to the current working directory. |
--start-seq START_SEQ |
Starting file sequence number | The first source language file to translate (e.g., '--start-seq 0'). The source files must use a 4 digit, zero-padded numbering sequence ('en-news2019-0000.txt', 'en-news2019-0001.txt', etc). |
--end-seq START_SEQ |
Ending file sequence number | The final source language file sequence number to translate. |
--src-iso LANG |
Source language ISO code | The ISO code for the source language. |
--trg-iso LANG |
Target language ISO code | The ISO code for the target language. |
--multiple-translations |
Produce multiple drafts | Produce a number of drafts equal to num_drafts in config.yml . The way that source and target files are specified does not need to be changed when using this. Instead, a suffix will be added to the output file, corresponding to the draft number. For example, if you specified --trg-prefix output_ and --end-seq 2 , files named output_0000.1.txt , output_0000.2.txt , output_0001.1.txt , etc. will be created. See here for more details. |
--clearml-queue QUEUE |
ClearML queue | Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML. |
--debug |
Show debug information | Show information about the environment variables and arguments. |
--commit ID |
Commit ID | The silnlp git commit id with which to run a remote job. |
Using the combination of command line arguments described in this section, the translate command will translate a book from a Paratext project into the requested target language. The translated text is written into a USFM-formatted file with markup that closely follows the markup in the source book.
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiments to test. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--memory growth |
Enable memory growth | With this option GPU memory is allocated to the model training as required. Without this option all the available GPU memory will be reserved for training from the start. Use this option in order to simultaneously train multiple models on a single GPU. |
--eager-execution |
Enable Tensorflow's eager execution | More... |
--checkpoint CHECKPOINT |
Test specified checkpoint | Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment. |
--src-project SRC_PROJECT |
The source project to translate | The name of the source Paratext project. The project name must correspond to a subfolder in the SIL_NLP_DATA_PATH > Paratext > projects folder. |
--trg-project TRG_PROJECT |
Target project | The name of the target Paratext project that will fill in missing text for books that are not entirely translated. The project name must correspond to a subfolder in the SIL_NLP_DATA_PATH > Paratext > projects folder. |
--books BOOKS |
The books to translate | A list of the books/chapters in the source Paratext project to be translated. Book identifiers should follow the USFM 3.0 standard and the selections should follow the syntax found here. If multiple selections are being made, put the selections in quotes so that the semicolons are not misinterpreted. |
--trg-iso LANG |
Target language ISO code | The ISO code for the target language. |
--include-inline-elements |
Keep inline elements in USFM files | Keeps inline USFM elements such as footnotes and cross references. Default behavior is to remove these elements before translating. |
--stylesheet-field-update ACTION |
Handle USFM style conflicts | What to do with the OccursUnder and TextProperties fields of a project's custom stylesheet. Possible values are 'replace', 'merge' (default), and 'ignore'. |
--multiple-translations |
Produce multiple drafts | Produce a number of drafts equal to num_drafts in config.yml . The way that source and target files are specified does not need to be changed when using this. Instead, a suffix will be added to the output file, corresponding to the draft number. For example, if you specified --books JOL , then in the target project's run directory, files named 29JOL.1.SFM , 29JOL.2.SFM , etc. will be created. See here for more details. |
--clearml-queue QUEUE |
ClearML queue | Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML. |
--debug |
Show debug information | Show information about the environment variables and arguments. |
--commit ID |
Commit ID | The silnlp git commit id with which to run a remote job. |
Gets verse counts and computes alignment scores for pairs of biblical texts. Outputs the raw counts/scores and optionally summarizes the information in Excel files
Configuration information: The script functions the same way as an experiment in that it operates within an experiment folder and uses a reduced version of an experiment's config.yml file. It only expects the "data" section of the config file to exist*. Within the data section, it only looks at the "aligner" and "corpus_pairs" fields. Within each corpus pair, it uses the "src", "trg", "mapping", "corpus_books", and "score_threshold" fields. See here for definitions and default values for each field.
*It will also optionally look at the "model" field to check if the model was trained on any data with the same script as the given data.
usage: python -m silnlp.nmt.analyze_project_pairs [-h] [--create-summaries] [--recalculate]
[--deutero] [--clearml-queue QUEUE] experiment
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment folder | The name of the subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder containing the config.yml file and where outputs will be written to. |
--create-summaries |
Create summary Excel files | Creates two files, one more general file containing verse counts and high level alignment stats, and another with a more in-depth breakdown of the alignment scores. |
--recalculate |
Force recalculation of all verse counts and alignment scores | Verse counts are cached globally but alignments will always be created from scratch the first time a given experiment is run and will be stored in the experiment folder. |
--deutero |
Include books from the Deuterocanon | A warning message will be printed for each text that has books from the Deuterocanon when this option is not used. |
--clearml-queue QUEUE |
ClearML queue | Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML. analyze_project_pairs is a CPU-intensive script that will not benefit from (and in fact will probably be slowed down by) a GPU-only queue. |
Calculates alphabet similarity between text corpora in a multilingual data set.
usage: python -m silnlp.nmt.alphabet_similarity [-h] experiment
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder. |
Display a histogram of segment lengths in tokens.
usage: python -m silnlp.nmt.segment_length [-h] experiment filename
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder. |
filename |
Tokenized file in experiment folder | Tokenized file in experiment folder. |
Calculate the vocab overlap between two experiments.
usage: python -m silnlp.nmt.vocab_overlap [-h] exp1 exp2
Arguments:
Argument | Purpose | Description |
---|---|---|
exp1 |
Experiment 1 name | The name of the first experiment to compare. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. |
exp2 |
Experiment 2 name | The name of the second experiment to compare. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. |
After a model has been trained and used to generate predictions for the test set, the check_train_val_test_split tool can be used to analyze the word distributions across the train, validate, and test sets for the source and target corpora. By default, the tool will generate high-level statistics regarding the occurrence of "unknown" words (i.e., words that occur in the validation set or in the test set, but not in the training set). The tool can also be used to generate detailed listings of these unknown words and their occurrence counts. It is also possible to have the tool compare these unknown words to the valid words found in the training set to identify possible misspellings. Output is saved in the word_count.xlsx file in the specified experiment folder.
usage: python -m silnlp.nmt.check_train_val_test_split [-h]
[--details] [--similar-words]
[--distance DIST] [--detok-val]
experiment
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiments to check. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--details |
Show detailed word lists | Generate detailed lists of validation set and test set words that are not found in the training set. Separate lists are generated for the source and target corpora. Occurrence counts are provided for each identified word. |
--similar-words |
Find similar words | Compare each unknown words to the valid words found in the training set and identify possible misspellings in the validation and test set. Levenshtein distance is used to identify the possible misspellings. |
--distance DIST |
Maximum Levenshtein distance for word similarity | By default, a Levenshtein distance of 1 is used to identify similar words in the training set. This parameter can be used to specify a different distance. |
--detok-val |
Detokenize the target validation set | Detokenize the target validation set. |
The diff_predictions tool can be used to compare the test set predictions to the reference sentences for an experiment. The tool generates a spreadsheet (diff_predictions.xlsx) with multiple comparison tabs. The comparison includes the test set source text, the target language reference text, the predictions, and the sentence-level BLEU scores for the predictions. Optionally, the tool can mark-up each prediction to identify the differences between the reference text and the prediction. The source text can also be marked up to highlight test set words that are not found in the training set. Optionally, the training set source / target sentence pairs can be included in the output spreadsheet on a separate tab.
usage: python -m silnlp.nmt.diff_predictions [-h] [--last]
[--show-diffs] [--show-unknown] [--show-dict]
[--include-train] [--include-dict] [--analyze-digits]
[--preserve-case] [--tokenize TOK] [--scorers [scorer [scorer ...]]]
exp1
Arguments:
Argument | Purpose | Description |
---|---|---|
exp1 |
Experiment name | The name of the experiment to compare. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--last |
Use last result | Use last result instead of best one. |
--show-diffs |
Show differences (predictions vs reference) | Mark up the predictions to indicate where they differ from the reference text. |
--show-unknown |
Show unknown words in source verse | Mark up the test set source sentences to indicate words that do not occur in the training set. |
--show-dict |
Show dictionary words in source verse | Show dictionary words in source verse. |
--include-train |
Include the src/trg training corpora in the spreadsheet | Include the parallel source/target training sentence pairs in another tab in the spreadsheet. |
--include-dict |
Include the src/trg dictionary in the spreadsheet | Include the src/trg dictionary in the spreadsheet. |
--analyze-digits |
Perform digits analysis | Perform digits analysis. |
--preserve-case |
Score predictions with case preserved | Preserve case when calculating the sentence-level BLEU score for the source/target sentence pairs. By default, the tool will lower case the source and target. Note that this behavior is secondary to the source / target case settings specified in the config.yml file; if those settings specified lower casing, then this argument has no effect. |
--tokenize TOKENIZE |
Sacrebleu tokenizer (none,13a,intl,zh,ja-mecab,char) | Specifies the Sacrebleu tokenizer that will be used to calculate the sentence-level BLEU score for each source/target sentence pair. (Default: 13a) |
--scorers [scorer [scorer ...]] |
List of scorers | Specifies the list of scorers to be used on the predictions. Options are 'bleu' (default), 'sentencebleu', 'chrf3', 'chrf3+', 'chrf3++', 'meteor', 'ter', 'wer', and 'spbleu'. |