Skip to content

NMT: Usage

Isaac Schifferer edited this page May 23, 2024 · 23 revisions

Setting up and running an experiment

The tools described in this section are the tools that are most commonly used in setting up and running an experiment.

experiment

The experiment tool runs the preprocess, train, and test tools in succession if none of the individual parts are specified.

usage: python -m silnlp.nmt.experiment [-h] [--stats] [--force-align] [--disable-mixed-precision] [--memory-growth]
[--num-devices NUM_DEVICES] [--clearml-queue QUEUE] [--save-checkpoints]
[--preprocess] [--train] [--test] [--translate] [--score-by-book] [--mt-dir DIR] [--debug]
[--commit ID] [--scorers [scorer [scorer ...]]]
experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
--stats Output corpus statistics Using a statistical model, calculate an alignment score for the source and target texts. Use of this option requires the SIL.Machine library to be available.
--force-align Force recalculation of all alignment scores Only relevant when using the --stats option.
--disable-mixed-precision Disable mixed precision Only use this option if your GPU doesn't support mixed precision. It is considerably faster than full precision and has lower memory requirements allowing you to train larger models. It has a negligible effect on the final model. More...
--memory-growth Enable memory growth With this option GPU memory is allocated to the model training as required. Without this option all the available GPU memory will be reserved for training from the start. Use this option in order to simultaneously train multiple models on a single GPU.
--num-devices NUM_DEVICES Number of devices to train on To train a single model on multiple GPUs use this option to set how many GPUs to use. Ensure that the environment variable CUDA_VISIBLE_DEVICES is also set so that multiple GPUs are visible. eg. if using --num-devices 2 then set CUDA_VISIBLE_DEVICES=0,1
--clearml-queue QUEUE ClearML queue Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML.
--save-checkpoints Save checkpoints to s3 bucket Save checkpoints to s3 bucket.
--preprocess Run the preprocess step Run the preprocess step.
--train Run the train step Run the train step.
--test Run the test step Run the test step.
--translate Create drafts See here for more details.
--score-by-book Score individual books In addition to providing an overall score for all the books in the test set, provide individual scores for each book in the test set.
--mt-dir DIR The machine translation directory Use an alternative machine translation directory for the location of the experiment.
--debug Show debug information Show information about the environment variables and arguments.
--commit ID Commit ID The silnlp git commit id with which to run a remote job.
--scorers [scorer [scorer ...]] Set scorers Specifies the list of scorers to be used on the predictions. Default is ['bleu', 'sentencebleu', 'chrf3', 'chrf3++', 'wer', 'ter', 'spbleu']. Additional options are 'chrf+' and 'meteor'.

preprocess

The preprocess tool prepares the various data files needed to train a model. Preprocessing steps include:

  • splitting the source and target files into the training, validation, and test data sets;
  • writing the train/validate/test data sets to files in the subfolder;
  • adapting the tokenizer of the parent model to be used by this experiment.
  • generating tokenization statistics about the data

usage: python -m silnlp.nmt.preprocess [-h] [--stats] [--force-align] experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
--stats Output corpus statistics Using a statistical model, calculate an alignment score for the source and target texts. Use of this option requires the SIL.Machine library to be available.
--force-align Force recalculation of all alignment scores Only relevant when using the --stats option.

train

The train tool trains a neural model for one or more specified experiments. The experiment's configuration file (config.yml) and the data files created by the preprocess tool are used to control the training process.

usage: python -m silnlp.nmt.train [-h] [--diable-mixed-precision] [--memory-growth]
[--num-devices NUM_DEVICES] [--eager-execution]
experiments [experiments ...]

Arguments:

Argument Purpose Description
experiments Experiment names The names of the experiments to train. Each experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--disable-mixed-precision Disable mixed precision Only use this option if your GPU doesn't support mixed precision. It is considerably faster than full precision and has lower memory requirements allowing you to train larger models. It has a negligible effect on the final model. More...
--memory-growth Enable memory growth With this option GPU memory is allocated to the model training as required. Without this option all the available GPU memory will be reserved for training from the start. Use this option in order to simultaneously train multiple models on a single GPU.
--num-devices NUM_DEVICES Number of devices to train on To train a single model on multiple GPUs use this option to set how many GPUs to use. Ensure that the environment variable CUDA_VISIBLE_DEVICES is also set so that multiple GPUs are visible. eg. if using --num-devices 2 then set CUDA_VISIBLE_DEVICES=0,1
--eager-execution Enable Tensorflow eager execution More...

test

The test tool tests the neural model for an experiment. If no trained model exists in the experiment folder, the base model will be used.

usage: python -m silnlp.nmt.test [-h] [--memory-growth] [--checkpoint CHECKPOINT]
[--last] [--best] [--avg] [--ref-projects [project [project ...]]]
[--force-infer] [--scorers [scorer [scorer ...]]]
[--books BOOKS] [--by-book] [--eager-execution]
experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment to test. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--memory growth Enable memory growth With this option GPU memory is allocated to the model training as required. Without this option all the available GPU memory will be reserved for training from the start. Use this option in order to simultaneously train multiple models on a single GPU.
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint (e.g., '--checkpoint 6000') to generate target language predictions from the test set. The specified checkpoint must be available in the run subfolder of the specified experiment.
--last Test the last checkpoint Use the last training checkpoint to generate target language predictions.
--best Test the best checkpoint Use the best training checkpoint to generate target language predictions. The best checkpoint must be available in the run > export subfolder of the specified experiment.
--avg Test the averaged checkpoint Use the averaged training checkpoint to generate target language predictions. The averaged checkpoint must be available in the 'run > avg' subfolder of the specified experiment. An averaged checkpoint can be automatically generated during training using the train: average_last_checkpoints: _<n>_ option, or it can be manually generated after training by using the average_checkpoints tool.
--ref-projects [project [project ...]] Reference projects The generated target language predictions are typically scored using the target language test set as the reference. If multiple reference projects were configured, this option can be used to specify which of these reference projects should be considered when scoring the predictions.
--force-infer Force inferencing If the test tool has already been used to generate and score predictions for an experiment's checkpoint, it will only score the predictions when it is run again on that same checkpoint. This option can be used to force the tool to re-generate the target language predictions.
--scorers [scorer [scorer ...]] Set scorers Specifies the list of scorers to be used on the predictions. Options are 'bleu' (default), 'sentencebleu', 'chrf3', 'chrf3+', 'chrf3++', 'meteor', 'ter', 'wer', and 'spbleu'.
--books BOOKS Books to score Specifies one or more books/chapters to be scored. When this option is used, the test tool will generate predictions for the entire target language test set, but provide a score only for the specified book(s)/chapter(s). Book must be specified using the 3 character abbreviations from the USFM 3.0 standard (e.g., "GEN" for Genesis) and follow the syntax found here.
--by-book Score individual books In addition to providing an overall score for all the books in the test set, provide individual scores for each book in the test set. If this option is used in combination with the --books option, individual scores are provided for each of the specified books.
--eager-execution Enable Tensorflow eager execution More...

translate

The translate tool uses a trained neural model to translate text to a new language. Three translation scenarios are supported, with differing command line arguments for each scenario. The supported scenarios are:

  1. Using a trained model to translate the text in a file from the source language to a target language.
  2. Using a trained model to translate the text in a sequence of files into a target language.
  3. Using a trained model to translate a USFM-formatted book in a Paratext project into a target language.

The command line arguments for each of these scenarios are described below.

usage: python -m silnlp.nmt.translate [-h] [--memory-growth] [--checkpoint CHECKPOINT]
[--src SRC] [--trg TRG]
[--src-prefix SRC_PREFIX] [--trg-prefix TRG_PREFIX] [--start-seq START_SEQ] [--end-seq END_SEQ]
[--src-project SRC_PROJECT] [--trg-project TRG_PROJECT]
[--books BOOKS] [--src-iso LANG] [--trg-iso LANG]
[--include-inline-elements] [--stylesheet-field-update ACTION] [--eager-execution]
[--clearml-queue QUEUE] [--debug] [--commit ID]
experiment

Text file

Using the combination of command line arguments described in this section, the translate command will translate the sentences in a text file from the source language to the target language, using the requested checkpoint from a trained model.

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario).
--memory growth Enable memory growth With this option GPU memory is allocated to the model training as required. Without this option all the available GPU memory will be reserved for training from the start. Use this option in order to simultaneously train multiple models on a single GPU.
--eager-execution Enable Tensorflow's eager execution More...
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment.
--src SRC Source file Name of a text file with the source language sentences to be translated (one sentence per line). The translate tool looks for the file in the current working directory or, if a full/relative path is specified, it looks for the file in the specified folder. Each line in the specified source file is translated and written to the specified target file.
--trg TRG Target file Name of the text file where the translated sentences will be written (one per line).
--src-iso LANG Source language ISO code The ISO code for the source language.
--trg-iso LANG Target language ISO code The ISO code for the target language.
--clearml-queue QUEUE ClearML queue Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML.
--debug Show debug information Show information about the environment variables and arguments.
--commit ID Commit ID The silnlp git commit id with which to run a remote job.

Sequence of Text Files

Using the combination of command line arguments described in this section, the translate command will translate sentences from a sequence of source language text files. The sentences in these source language text files are translated to the target language using the requested checkpoint from a trained model, and written to a corresponding sequence of target language text files.

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario).
--memory growth Enable memory growth With this option GPU memory is allocated to the model training as required. Without this option all the available GPU memory will be reserved for training from the start. Use this option in order to simultaneously train multiple models on a single GPU.
--eager-execution Enable Tensorflow's eager execution More...
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment.
--src-prefix SRC_PREFIX Source file prefix (e.g., de-news2019-) The file name prefix for the source files. The translate tool looks for the sequence of source files in the current working directory.
--trg-prefix TRG_PREFIX Target file prefix (e.g., en-news2019-) The file name prefix for the target files. The translate tool will write the translated text to a series of files with this specified file name prefix; the translated files will be written to the current working directory.
--start-seq START_SEQ Starting file sequence number The first source language file to translate (e.g., '--start-seq 0'). The source files must use a 4 digit, zero-padded numbering sequence ('en-news2019-0000.txt', 'en-news2019-0001.txt', etc).
--end-seq START_SEQ Ending file sequence number The final source language file sequence number to translate.
--src-iso LANG Source language ISO code The ISO code for the source language.
--trg-iso LANG Target language ISO code The ISO code for the target language.
--clearml-queue QUEUE ClearML queue Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML.
--debug Show debug information Show information about the environment variables and arguments.
--commit ID Commit ID The silnlp git commit id with which to run a remote job.

Paratext book (USFM file)

Using the combination of command line arguments described in this section, the translate command will translate a book from a Paratext project into the requested target language. The translated text is written into a USFM-formatted file with markup that closely follows the markup in the source book.

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiments to test. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--memory growth Enable memory growth With this option GPU memory is allocated to the model training as required. Without this option all the available GPU memory will be reserved for training from the start. Use this option in order to simultaneously train multiple models on a single GPU.
--eager-execution Enable Tensorflow's eager execution More...
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment.
--src-project SRC_PROJECT The source project to translate The name of the source Paratext project. The project name must correspond to a subfolder in the SIL_NLP_DATA_PATH > Paratext > projects folder.
--trg-project TRG_PROJECT Target project The name of the target Paratext project that will fill in missing text for books that are not entirely translated. The project name must correspond to a subfolder in the SIL_NLP_DATA_PATH > Paratext > projects folder.
--books BOOKS The books to translate A list of the books/chapters in the source Paratext project to be translated. Book identifiers should follow the USFM 3.0 standard and the selections should follow the syntax found here. If multiple selections are being made, put the selections in quotes so that the semicolons are not misinterpreted.
--trg-iso LANG Target language ISO code The ISO code for the target language.
--include-inline-elements Keep inline elements in USFM files Keeps inline USFM elements such as footnotes and cross references. Default behavior is to remove these elements before translating.
--stylesheet-field-update ACTION Handle USFM style conflicts What to do with the OccursUnder and TextProperties fields of a project's custom stylesheet. Possible values are 'replace', 'merge' (default), and 'ignore'.
--clearml-queue QUEUE ClearML queue Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML.
--debug Show debug information Show information about the environment variables and arguments.
--commit ID Commit ID The silnlp git commit id with which to run a remote job.

Analyzing experiment metadata

alphabet_similarity

Calculates alphabet similarity between text corpora in a multilingual data set.

usage: python -m silnlp.nmt.alphabet_similarity [-h] experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.

segment_length

Display a histogram of segment lengths in tokens.

usage: python -m silnlp.nmt.segment_length [-h] experiment filename

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
filename Tokenized file in experiment folder Tokenized file in experiment folder.

vocab_overlap

Calculate the vocab overlap between two experiments.

usage: python -m silnlp.nmt.vocab_overlap [-h] exp1 exp2

Arguments:

Argument Purpose Description
exp1 Experiment 1 name The name of the first experiment to compare. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
exp2 Experiment 2 name The name of the second experiment to compare. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.

Analyzing the results of an experiment

check_train_val_test_split

After a model has been trained and used to generate predictions for the test set, the check_train_val_test_split tool can be used to analyze the word distributions across the train, validate, and test sets for the source and target corpora. By default, the tool will generate high-level statistics regarding the occurrence of "unknown" words (i.e., words that occur in the validation set or in the test set, but not in the training set). The tool can also be used to generate detailed listings of these unknown words and their occurrence counts. It is also possible to have the tool compare these unknown words to the valid words found in the training set to identify possible misspellings. Output is saved in the word_count.xlsx file in the specified experiment folder.

usage: python -m silnlp.nmt.check_train_val_test_split [-h]
[--details] [--similar-words]
[--distance DIST] [--detok-val]
experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiments to check. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--details Show detailed word lists Generate detailed lists of validation set and test set words that are not found in the training set. Separate lists are generated for the source and target corpora. Occurrence counts are provided for each identified word.
--similar-words Find similar words Compare each unknown words to the valid words found in the training set and identify possible misspellings in the validation and test set. Levenshtein distance is used to identify the possible misspellings.
--distance DIST Maximum Levenshtein distance for word similarity By default, a Levenshtein distance of 1 is used to identify similar words in the training set. This parameter can be used to specify a different distance.
--detok-val Detokenize the target validation set Detokenize the target validation set.

diff_predictions

The diff_predictions tool can be used to compare the test set predictions to the reference sentences for an experiment. The tool generates a spreadsheet (diff_predictions.xlsx) with multiple comparison tabs. The comparison includes the test set source text, the target language reference text, the predictions, and the sentence-level BLEU scores for the predictions. Optionally, the tool can mark-up each prediction to identify the differences between the reference text and the prediction. The source text can also be marked up to highlight test set words that are not found in the training set. Optionally, the training set source / target sentence pairs can be included in the output spreadsheet on a separate tab.

usage: python -m silnlp.nmt.diff_predictions [-h] [--last]
[--show-diffs] [--show-unknown] [--show-dict]
[--include-train] [--include-dict] [--analyze-digits]
[--preserve-case] [--tokenize TOK] [--scorers [scorer [scorer ...]]]
exp1

Arguments:

Argument Purpose Description
exp1 Experiment name The name of the experiment to compare. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--last Use last result Use last result instead of best one.
--show-diffs Show differences (predictions vs reference) Mark up the predictions to indicate where they differ from the reference text.
--show-unknown Show unknown words in source verse Mark up the test set source sentences to indicate words that do not occur in the training set.
--show-dict Show dictionary words in source verse Show dictionary words in source verse.
--include-train Include the src/trg training corpora in the spreadsheet Include the parallel source/target training sentence pairs in another tab in the spreadsheet.
--include-dict Include the src/trg dictionary in the spreadsheet Include the src/trg dictionary in the spreadsheet.
--analyze-digits Perform digits analysis Perform digits analysis.
--preserve-case Score predictions with case preserved Preserve case when calculating the sentence-level BLEU score for the source/target sentence pairs. By default, the tool will lower case the source and target. Note that this behavior is secondary to the source / target case settings specified in the config.yml file; if those settings specified lower casing, then this argument has no effect.
--tokenize TOKENIZE Sacrebleu tokenizer (none,13a,intl,zh,ja-mecab,char) Specifies the Sacrebleu tokenizer that will be used to calculate the sentence-level BLEU score for each source/target sentence pair. (Default: 13a)
--scorers [scorer [scorer ...]] List of scorers Specifies the list of scorers to be used on the predictions. Options are 'bleu' (default), 'sentencebleu', 'chrf3', 'chrf3+', 'chrf3++', 'meteor', 'ter', 'wer', and 'spbleu'.