Skip to content

Translate Word Documents with NLLB

David Baines edited this page Feb 8, 2023 · 2 revisions

SILNLP can translate docx files. The translate.py script supports three formats: txt, docx, and USFM. The paragraph structure is preserved but any inline formatting is lost. It would be necessary to fine tune NLLB to recognize docx markup to preserve the formatting. The primary reason for supporting docx was to add the ability to translate between any of the 200 languages supported by NLLB. The translate script can use an NLLB model without fine tuning.

Setup for translating docx files.

Create an experiment folder with a config file as usual, but don't specify any corpus pairs. This will allow you to configure decoding hyperparameters and which model to use. Here is an example of a config file:

model: facebook/nllb-200-1.3B
data:
  seed: 111
  lang_codes:
    en: eng_Latn
    es: spa_Latn
params:
  label_smoothing_factor: 0.2
infer:
  infer_batch_size: 16
  num_beams: 2

Here is an example of how to call the translate script: python -m silnlp.nmt.translate <source_folder> --src MT/experiments/experiment --src-iso en --trg-iso es

This will translate every file in the MT/experiments/<source_folder> directory recursively and output the results to the infer directory in the experiment directory. The --src parameter will also accept a file path. You can specify the target directory or file using --trg. The new changes to the translate script should work on ClearML.