-
-
Notifications
You must be signed in to change notification settings - Fork 3
NMT: Usage
The tools described in this section are the tools that are most commonly used in setting up and running an experiment.
The config tool can be used to set up a simple configuration file (config.yml) for an experiment. The configuration settings are specified on the command line, and the tool generates a valid config.yml file with those settings in the specified experiment subfolder (SIL_NLP_DATA_PATH > MT > experiments > <experiment>
)
usage: config.py [-h] [--src-langs [lang [lang ...]]]
[--trg-langs [lang [lang ...]]] [--vocab-size VOCAB_SIZE]
[--src-vocab-size SRC_VOCAB_SIZE]
[--trg-vocab-size TRG_VOCAB_SIZE] [--parent PARENT]
[--mirror] [--force] [--seed SEED] [--model MODEL]
experiment
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--src-langs [lang [lang ...]] |
Source language files | The name of one (or more) files in the source language(s). Each file must be located in the SIL_NLP_DATA_PATH > MT > corpora folder or the SIL_NLP_DATA_PATH > MT > scripture folder. Only the base of the file name is specified; e.g., to use the file abp-ABP.txt', specify abp-ABP`. |
--trg-langs [lang [lang ...]] |
Target language files | The name of one (or more) files in the target language(s). Each file must be located in the SIL_NLP_DATA_PATH > MT > corpora folder or the SIL_NLP_DATA_PATH > MT > scripture folder. Only the base of the file name is specified; e.g., to use the file en-ABPBTE.txt', specify en-ABPBTE`. |
--vocab-size VOCAB_SIZE |
Shared vocabulary size | Specifies the size (e.g, '32000') of the shared SentencePiece vocabulary that will be constructed from the text in the source and target files. |
--src-vocab-size SRC_VOCAB_SIZE |
Source vocabulary size | Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the source files (only). This option should be used in combination with the --trg-vocab-size argument. |
--trg-vocab-size SRC_VOCAB_SIZE |
Target vocabulary size | Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the target files (only). This option should be used in combination with the --src-vocab-size argument. |
--parent PARENT |
Parent experiment name | The name of an experiment subfolder with a trained parent model. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--mirror |
Mirror train and validation data sets (default: False) | Specifies that the training and validation data sets constructed from the source and target files should be mirrored. With mirroring, each source/target sentence pair is added to the training (or validation) data set as both a source/target pair and as a target/source pair. Without mirroring, each sentence pair is only added as a source/target pair. |
--force |
Overwrite existing config file | If a configuration file already exists in the specified experiment subfolder, the tool will report an error. If this argument is provided, the tool will overwrite the existing configuration file. |
--seed SEED |
Randomization seed | Specifies the randomization seed that will be used during preprocessing and training. |
--model MODEL |
Neural network model | Specifies the neural network model that will be trained. Options: TransformerBase (default), TransformerBig, SILTransformerBaseNoResidual, or SILTransformerBaseAlignmentEnhanced). |
The preprocess tool prepares the various data files needed to train a model. Preprocessing steps include:
- creating SentencePiece vocabulary models from the experiment's source and target files;
- splitting the source and target files into the training, validation, and test data sets;
- writing the train/validate/test data sets to files in the subfolder;
- adapting the parent model (if one is specified) to be used by this experiment.
usage: preprocess.py [-h] [--stats] experiment
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--stats |
Output corpus statistics | Using a statistical model, calculate an alignment score for the source and target texts. Use of this option requires the SIL.Machine library to be available. |