Code for the paper: Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora.
We here release the multilingual multimodal models with parallel attention (Figure 1 in the paper).
The original model used in the paper is de_en_fr_it
.
We also release another version of the segmenter trained on all MuST-Cinema languages,
which provides overall higher scores for the languages that are not covered by the de_en_fr_it
model.
- de_en_fr_it: checkpoint | config.yaml | spm_model | fairseq_vocabulary
- all_langs: checkpoint | config.yaml | spm_model | fairseq_vocabulary
Results (Sigma-CPL%) | de | en | es | fr | it | nl | pt | ro |
---|---|---|---|---|---|---|---|---|
de_en_fr_it | 86.4-89.1 | 88.2-94.6 | 81.2-89.3 | 86.7-93.3 | 85.5-89.3 | 81.2-89.0 | 81.4-89.3 | 75.3-83.3 |
all_langs | 86.3-88.4 | 87.1-94.4 | 84.5-93.0 | 87.1-93.2 | 85.2-89.6 | 86.5-86.0 | 87.2-89.6 | 86.6-91.4 |
Preprocess the MuST-Cinema dataset as already explained here. Then, run the following code:
for subset in train dev amara; do
cut -f 5 ${DATA_ROOT}/en-${LANG}/${subset}_st_src.tsv > \
${DATA_ROOT}/en-${LANG}/${subset}.${lang}.multimod
sed 's/<eob>//g; s/<eol>//g; s/ / /g; s/^ //g; s/ $//g' \
${DATA_ROOT}/en-${LANG}/${subset}.${LANG}.multimod > \
${DATA_ROOT}/en-${LANG}/${subset}.${LANG}.multimod.unsegm
paste ${DATA_ROOT}/en-${LANG}/${subset}_st_src.tsv \
${DATA_ROOT}/en-${LANG}/${subset}.${LANG}.multimod.unsegm \
| cut -f 1,2,3,5,6,7 > ${DATA_ROOT}/en-${LANG}/${subset}_multi_segm.tsv
sed -i '1s/tgt_text$/src_text/g' ${DATA_ROOT}/en-${LANG}/${subset}_multi_segm.tsv
done
where DATA_ROOT
is the folder containing the preprocessed data, LANG
is the language
(en, de, fr, it for train, dev, and amara sets and es, nl only for the amara set).
Lastly, add the target language as a tsv column to enable Fairseq-ST multiligual training/inference for each subset and
for each language:
awk 'NR==1 {printf("%s\t%s\n", $0, "tgt_lang")} NR>1 {printf("%s\t%s\n", $0, "'"${LANG}"'")}' \
${DATA_ROOT}/en-${LANG}/${subset}_multi_segm.tsv > ${DATA_ROOT}/${subset}_${LANG}_multi.tsv
To generate a unique SentencePiece model and, consequently, vocabulary for all the training languages (as we do in our paper), run the script below:
python ${FBK_fairseq}/examples/speech_to_text/scripts/gen_multilang_spm_vocab.py \
--data-root ${DATA_ROOT} --save-dir ${DATA_ROOT} \
--langs en,de,fr,it --splits train_en_multi,train_de_multi,train_fr_multi,train_it_multi \
--vocab-type unigram --vocab-size 10000
To train the multilingual multimodal model with parallel attention, run the code below:
python ${FBK_fairseq}/train.py ${DATA_ROOT} \
--train-subset train_de_multi,train_en_multi,train_fr_multi,train_it_multi \
--valid-subset dev_de_multi,dev_en_multi,dev_fr_multi,dev_it_multi \
--save-dir ${SAVE_DIR} \
--num-workers 2 --max-update 200000 \
--max-tokens 40000 \
--user-dir examples/speech_to_text \
--task speech_to_text_multimodal --config-yaml ${CONFIG_YAML} \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--arch s2t_transformer_dual_encoder_s \
--ctc-encoder-layer 8 --ctc-compress-strategy avg --ctc-weight 0.5 \
--context-encoder-layers 12 --decoder-layers 3 \
--context-dropout 0.3 --context-ffn-embed-dim 1024 \
--share-encoder-decoder-embed \
--context-decoder-attention-type parallel \
--optimizer adam --lr 1e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 \
--clip-norm 10.0 \
--seed 1 --update-freq 4 \
--patience 15 \
--ignore-prefix-size 1 \
--skip-invalid-size-inputs-valid-test \
--log-format simple --find-unused-parameters
where FBK_fairseq
is the folder of our repository, DATA_ROOT
is the folder containing the preprocessed data,
SAVE_DIR
is the folder in which to save the checkpoints of the model, CONFIG_YAML
is the path to the config
yaml file.
This training setup is intended for 2 NVIDIA A40 48GB, please adjust --max-tokens
and --update-freq
such as
max_tokens * update_freq * number of GPUs used for training = 320,000
.
First, average the checkpoint as already explained in our repository here.
Second, run the code below:
python ${FBK_fairseq}/generate.py ${DATA_ROOT} \
--config-yaml ${CONFIG_YAML} --gen-subset amara_${LANG}_multi \
--task speech_to_text_multimodal \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy \
--user-dir examples/speech_to_text \
--path ${SAVE_DIR}/${CHECKPOINT_FILENAME} \
--max-tokens 25000 --beam 5 --scoring sacrebleu \
--results-path ${SAVE_DIR}
where LANG
is the language selected for inference and CHECKPOINT_FILENAME
is the file containing the average of the checkpoints of the previous step.
Please use sacrebleu to obtain BLEU scores and EvalSubtitle to obtain Sigma and CPL.
@inproceedings{papi-etal-2022-dodging,
title = "Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented {ST} Corpora",
author = "Papi, Sara and
Karakanta, Alina and
Negri, Matteo and
Turchi, Marco",
booktitle = "Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
month = nov,
year = "2022",
address = "Online only",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.aacl-short.59",
pages = "480--487",
}