NMT: Usage

Setting up and running an experiment

The tools described in this section are the tools that are most commonly used in setting up and running an experiment.

config

The config tool can be used to set up a simple configuration file (config.yml) for an experiment. The configuration settings are specified on the command line, and the tool generates a valid config.yml file with those settings in the specified experiment subfolder (SIL_NLP_DATA_PATH > MT > experiments > <experiment>)

usage: config.py [-h] [--src-langs [lang [lang ...]]]
[--trg-langs [lang [lang ...]]] [--vocab-size VOCAB_SIZE]
[--src-vocab-size SRC_VOCAB_SIZE]
[--trg-vocab-size TRG_VOCAB_SIZE] [--parent PARENT]
[--mirror] [--force] [--seed SEED] [--model MODEL]
experiment

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--src-langs [lang [lang ...]]`	Source language files	The name of one (or more) files in the source language(s). Each file must be located in the `SIL_NLP_DATA_PATH > MT > corpora` folder or the `SIL_NLP_DATA_PATH > MT > scripture` folder. Only the base of the file name is specified; e.g., to use the file `abp-ABP.txt', specify` abp-ABP`.
`--trg-langs [lang [lang ...]]`	Target language files	The name of one (or more) files in the target language(s). Each file must be located in the `SIL_NLP_DATA_PATH > MT > corpora` folder or the `SIL_NLP_DATA_PATH > MT > scripture` folder. Only the base of the file name is specified; e.g., to use the file `en-ABPBTE.txt', specify` en-ABPBTE`.
`--vocab-size VOCAB_SIZE`	Shared vocabulary size	Specifies the size (e.g, '32000') of the shared SentencePiece vocabulary that will be constructed from the text in the source and target files.
`--src-vocab-size SRC_VOCAB_SIZE`	Source vocabulary size	Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the source files (only). This option should be used in combination with the `--trg-vocab-size` argument.
`--trg-vocab-size SRC_VOCAB_SIZE`	Target vocabulary size	Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the target files (only). This option should be used in combination with the `--src-vocab-size` argument.
`--parent PARENT`	Parent experiment name	The name of an experiment subfolder with a trained parent model. The subfolder must be located in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--mirror`	Mirror train and validation data sets (default: False)	Specifies that the training and validation data sets constructed from the source and target files should be mirrored. With mirroring, each source/target sentence pair is added to the training (or validation) data set as both a source/target pair and as a target/source pair. Without mirroring, each sentence pair is only added as a source/target pair.
`--force`	Overwrite existing config file	If a configuration file already exists in the specified experiment subfolder, the tool will report an error. If this argument is provided, the tool will overwrite the existing configuration file.
`--seed SEED`	Randomization seed	Specifies the randomization seed that will be used during preprocessing and training.
`--model MODEL`	Neural network model	Specifies the neural network model that will be trained. Options: TransformerBase (default), TransformerBig, SILTransformerBaseNoResidual, or SILTransformerBaseAlignmentEnhanced).

preprocess

The preprocess tool prepares the various data files needed to train a model. Preprocessing steps include:

creating SentencePiece vocabulary models from the experiment's source and target files;
splitting the source and target files into the training, validation, and test data sets;
writing the train/validate/test data sets to files in the subfolder;
adapting the parent model (if one is specified) to be used by this experiment.

usage: preprocess.py [-h] [--stats] experiment

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--stats`	Output corpus statistics	Using a statistical model, calculate an alignment score for the source and target texts. Use of this option requires the SIL.Machine library to be available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NMT: Usage

Setting up and running an experiment

config

preprocess

train

test

translate

Analyzing the results of an experiment

analyze

check_train_val_test_split

diff_predictions

Miscellaneous commands

average_checkpoints

export_embeddings

Clone this wiki locally