lm_selector

This script is forked from @lukeorland's implementation.
README written in Japanese is README_jp.md.

Diff

This script is aim to select in-domain data for monolingual corpus.
We use KenLM for N-gram LM.
- The original implementation uses SRILM.
We use the difference of cross entropy (the log probability normalized by the sentence length) as a selection filter.
- The original implementation uses the difference of the perplexity.
We do not restrict the size of the general domain text in training the general LM.
- This size is the same as the specific domain text in the original.

Requirement

KenLM: https://github.com/kpu/kenlm

Usage

./lm_select.sh \
    GENERAL_DOMAIN_CORPUS \
    SPECIFIC_DOMAIN_CORPUS \
    DEST_DIR \
    [KENLM_BIN_DIR]

GENERAL_DOMAIN_CORPUS: tokenized general domain text data; one line one sentence.
SPECIFIC_DOMAIN_CORPUS: tokenized specific domain text data; one line one sentence.
DEST_DIR: path to the directory where the sorted general domain text data is written.
[KENLM_BIN_DIR] (option): path to the directory where the KenLM binary have been compiled in your system.
- It assumed by default that this path is $PATH.
- [KENLM_BIN_DIR] should be including lmplz and build_binary.

Outputs

DEST_DIR/sorted-uniq-scores_general.tsv: TSV file; f1 is difference of the cross entropy and f2 is each sentence.
DEST_DIR/general_corpus_sorted.txt: sorted general domain text.

Example

./lm_select.sh \
    /path/to/general-domain.txt \
    /path/to/specific-domain.txt \
    selection_result \
    /path/to/kenlm_bin

Caution

We use sort -n -r as a sorting command.
- If you have a necessary to use the E notation, use sort -n -r -g at line 163.
If input texts include the escape chars, use script/escape-special-chars.perl for KenLM.

Cite

This soting method is implementation of the following paper.

Intelligent Selection of Language Model Training Data.
Moore and Lewis., ACL2010

Original README is following.

Description

This script applies Moore and Lewis's approach to intelligently selecting training data to domain adaptation of translation models for machine translation.

The result of running this script is a copy of a randomly-ordered general-domain corpus, sorted such that the sentences at the top are most "similar" to the domain-specific data and the least "similar" sentences are at the bottom. It is then possible to subsample by selecting the first N sentences from the sorted files.

This script operates on the type of parallel translation corpora that have two files, one sentence per line in the source language file, and corresponding translation appears on each line of the target-side translation file.

The ranking computations can be done on one language side or both. When sorting bilingually, the ranking of a training sentence pair is its sum of rankings for both languages.

Usage

A recommended first step is to tokenize and normalize the general-domain training data and domain-specific data before processing them with this script.

Command format

./ml_select.sh \
    GENERAL_DOMAIN_CORPUS_PREFIX \
    SPECIFIC_DOMAIN_CORPUS_PREFIX \
    DEST_DIR \
    SOURCE_LANG \
    TARGET_LANG \
    RANK_BY_SOURCE_LANG \
    RANK_BY_TARGET_LANG \
    [ SRILM_BIN_DIR ]

Parameters

GENERAL_DOMAIN_CORPUS_PREFIX

This is the path to the pair of (normalized, tokenized) parallel general-domain files, with the final "." and language extension removed (e.g. .fr, .en). The is the corpus that will be resorted.

E.g. if the general-domain files are located at
```
/path/to/general-domain.lc.tok.fr
/path/to/general-domain.lc.tok.en
```
then the value to pass for this parameter is
```
/path/to/general-domain.lc.tok.fr
```
SPECIFIC_DOMAIN_CORPUS_PREFIX

This is the path to the (pair of) (normalized, tokenized) parallel specific-domain files, with the final "." and language extension removed (e.g. .fr, .en). The general-domain corpus is resorted by similarity to this corpus.

E.g. if the specific-domain files are located at
```
/path/to/specific-domain.lc.tok.fr
/path/to/specific-domain.lc.tok.en
```
then the value to pass for this parameter is
```
/path/to/general-domain.lc.tok
```
DEST_DIR

The path to the existing directory where the sorted general-domain files should be written
SOURCE_LANG

The (probably two-letter) file extension of the source language.

E.g. for the examples above, the value to pass for this parameter is fr.
TARGET_LANG

The (probably two-letter) file extension of the target language.

E.g. for the examples above, the value to pass for this parameter is en.
RANK_BY_SOURCE_LANG

Whether to calculate the specific domain's language model's perplexity for each source-language sentence in the general-domain.

The value to pass is either true or false.
RANK_BY_TARGET_LANG

Whether to calculate the specific domain's language model's perplexity for each target-language sentence in the general-domain.

The value to pass is either true or false.
[ SRILM_BIN_DIR ] (optional)

This parameter specifies the path to the directory where the SRILM binary tools (e.g. ngram, ngram-count) have been compiled in your system. It assumed by default that this location is the directory $SRILM/bin/i686_m64, where SRILM is a system variable that has been assigned to the path to the SRILM source code directory.

Example command invocation

./ml_select.sh \
    /path/to/general-domain.lc.tok \
    /path/to/specific-domain.lc.tok \
    /path/to/destination_directory \
    fr \
    en \
    false \
    true \
    $HOME/code/srilm/bin/i686_m64

Notes

The two parallel files in a corpus should have the same path prefix, and the filename extension should be the language abbreviations followed by a period (e.g. .fr, .en).

After calling the above command, the general-domain corpus is sorted, duplicates are removed, and the result is copied into the two files such as the following:

/dest/dir/general_corpus_sorted.es
/dest/dir/general_corpus_sorted.en

How it works

This Bash script takes the following steps to accomplish its task.

Calculate perplexity difference

The following computation process is performed first on the source-language side, then on the target-language side if desired.

Extract the vocabulary from the specific-domain corpus. The vocabulary consists of all non-singleton types.
Randomly select the equivalent number of segments from the the general domain as in the specific domain for building a language model.
Build a language model from general-domain text, with vocabulary restricted by non-singleton tokens from the specific-domain corpus.
Build a language model from specific-domain text, with vocabulary restricted by non-singleton tokens from the specific-domain corpus.
Calculate the perplexity of the general-domain LM for the general-domain text segment.
Calculate the perplexity of the specific-domain LM for the general-domain text segment.
Subtract the perplexity of the general-domain LM from the perplexity of the specific-domain LM for each sentence

Sum the two differences together if doing bilingual ranking

Combine perplexity differences, unprocessed/raw source-, and target-side text segments into a single file and print these rows and colums out to a tab-delimited file.
Add together the perplexity differences for both target- and source-languages

Sort the general domain lines by ranking

Sort training data by the (summed) perplexity difference scores and delete consecutive duplicate training candidates.
Write separated source- and target-language training corpus files in sorted order.

TODO

Sort files aligned by Berkeley Aligner or GIZA++

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
script		script
README.md		README.md
README_jp.md		README_jp.md
calc_sentlength.py		calc_sentlength.py
lm_select.sh		lm_select.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lm_selector

Diff

Requirement

Usage

Outputs

Example

Caution

Cite

Description

Usage

Command format

Parameters

Example command invocation

Notes

How it works

Calculate perplexity difference

Sum the two differences together if doing bilingual ranking

Sort the general domain lines by ranking

TODO

About

Releases

Packages

Languages

Katsumata420/lm_selector

Folders and files

Latest commit

History

Repository files navigation

lm_selector

Diff

Requirement

Usage

Outputs

Example

Caution

Cite

Description

Usage

Command format

Parameters

Example command invocation

Notes

How it works

Calculate perplexity difference

Sum the two differences together if doing bilingual ranking

Sort the general domain lines by ranking

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages