Skip to content

NMT: Usage

Michael A. Martin edited this page Jun 11, 2021 · 23 revisions

Setting up and running an experiment

The tools described in this section are the tools that are most commonly used in setting up and running an experiment.

config

The config tool can be used to set up a simple configuration file (config.yml) for an experiment. The configuration settings are specified on the command line, and the tool generates a valid config.yml file with those settings in the specified experiment subfolder (SIL_NLP_DATA_PATH > MT > experiments > <experiment>)

usage: config.py [-h] [--src-langs [lang [lang ...]]]
[--trg-langs [lang [lang ...]]] [--vocab-size VOCAB_SIZE]
[--src-vocab-size SRC_VOCAB_SIZE]
[--trg-vocab-size TRG_VOCAB_SIZE] [--parent PARENT]
[--mirror] [--force] [--seed SEED] [--model MODEL]
experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
--src-langs [lang [lang ...]] Source language files The name of one (or more) files in the source language(s). Each file must be located in the SIL_NLP_DATA_PATH > MT > corpora folder or the SIL_NLP_DATA_PATH > MT > scripture folder. Only the base of the file name is specified; e.g., to use the file abp-ABP.txt', specify abp-ABP`.
--trg-langs [lang [lang ...]] Target language files The name of one (or more) files in the target language(s). Each file must be located in the SIL_NLP_DATA_PATH > MT > corpora folder or the SIL_NLP_DATA_PATH > MT > scripture folder. Only the base of the file name is specified; e.g., to use the file en-ABPBTE.txt', specify en-ABPBTE`.
--vocab-size VOCAB_SIZE Shared vocabulary size Specifies the size (e.g, '32000') of the shared SentencePiece vocabulary that will be constructed from the text in the source and target files.
--src-vocab-size SRC_VOCAB_SIZE Source vocabulary size Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the source files (only). This option should be used in combination with the --trg-vocab-size argument.
--trg-vocab-size SRC_VOCAB_SIZE Target vocabulary size Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the target files (only). This option should be used in combination with the --src-vocab-size argument.
--parent PARENT Parent experiment name The name of an experiment subfolder with a trained parent model. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
--mirror Mirror train and validation data sets (default: False) Specifies that the training and validation data sets constructed from the source and target files should be mirrored. With mirroring, each source/target sentence pair is added to the training (or validation) data set as both a source/target pair and as a target/source pair. Without mirroring, each sentence pair is only added as a source/target pair.
--force Overwrite existing config file If a configuration file already exists in the specified experiment subfolder, the tool will report an error. If this argument is provided, the tool will overwrite the existing configuration file.
--seed SEED Randomization seed Specifies the randomization seed that will be used during preprocessing and training.
--model MODEL Neural network model Specifies the neural network model that will be trained. Options: TransformerBase (default), TransformerBig, SILTransformerBaseNoResidual, or SILTransformerBaseAlignmentEnhanced).

preprocess

The preprocess tool prepares the various data files needed to train a model. Preprocessing steps include:

  • creating SentencePiece vocabulary models from the experiment's source and target files;
  • splitting the source and target files into the training, validation, and test data sets;
  • writing the train/validate/test data sets to files in the subfolder;
  • adapting the parent model (if one is specified) to be used by this experiment.

usage: preprocess.py [-h] [--stats] experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
--stats Output corpus statistics Using a statistical model, calculate an alignment score for the source and target texts. Use of this option requires the SIL.Machine library to be available.

train

test

translate

Analyzing the results of an experiment

analyze

check_train_val_test_split

diff_predictions

Miscellaneous commands

average_checkpoints

export_embeddings