Slovene Speech and Transcription Alignment using Montreal Forced Aligner

This repository provides instructions for aligning transcriptions with corresponding audio files containing Slovene speech using the Montreal Forced Aligner (MFA). Follow the steps below to set up the environment and align your speech corpus.

Installation

Create a virtual environment and install MFA by running the following commands:

conda create -n aligner -c conda-forge montreal-forced-aligner
conda activate aligner
mfa -help

The next steps assume that your speech corpus is located at ~/mfa_data/corpus and the pronunciation dictionary is at ~/mfa_data/dictionary.txt. Adjust the paths accordingly based on your setup. The speech corpus should be in the format as explained in the MFA documentation.

Validation

Before aligning the data, you can validate whether your dataset is in the proper format for MFA. Run the following command:

mfa validate <~/mfa_data/corpus> <~/mfa_data/dictionary.txt> --clean

Acoustic Model Training

To train the acoustic model and export it to a zip file, use the following command:

mfa train <~/mfa_data/corpus> <~/mfa_data/dictionary.txt> <~/mfa_data/acoustic_model.zip>

Alternatively, one can download a pretrained model and the pronunciation dictionary. If using the pretrained model, skip the training step.

You can save the acoustic model stored in zip file for later convenience by running:

mfa model save acoustic <~/mfa_data/acoustic_model.zip>

To inspect the acoustic model, run the following command:

mfa model inspect acoustic acoustic_model

Alignment

Finally, you can align your input data using a pronunciation dictionary and the trained acoustic model by running:

mfa align </path/to/input/wavs/and/txt/> <~/mfa_data/dictionary.txt> <acoustic_model> </path/to/aligned/outputs/>

Make sure to replace /path/to/input/wavs/and/txt/ with the actual path to your input data, and /path/to/aligned/outputs/ with the desired location for the aligned output files. The above command will output the TextGrid file with word and phoneme level alignments for each wav/txt input pair.

For more details and advanced usage, please refer to the official documentation of Montreal Forced Aligner.

Enhancing TextGrid Files with Additional Tiers

The MFA delivers alignments at both the word and phoneme levels. To introduce a syllable level, add_cnvrstl-syllables_tier.py function can be utilized as follows:

python add_cnvrstl-syllables_tier.py </path/to/input.TextGrid> </path/to/input.trs> </path/to/output.TextGrid>

Similary, other tiers can be added. A list of implemented tiers and their brief explanation:

add_speaker-ID_tier.py: The script parses speaker intervals from an XML (TEI) file and adds them as a new 'speaker-ID' tier to an existing TextGrid file, which is then saved to a specified output TextGrid file.
add_standardized-trs_tier.py: The script aligns a standardized text transcription with speaker intervals in a TextGrid file by concatenating words within each speaker's interval, and then adds this speaker-specific, aligned transcription as a new tier named 'standardized-trs' to the output TextGrid file.
add_conversational-trs_tier.py: The script aligns a conversational text transcription with speaker intervals in a TextGrid file, concatenating words within each speaker interval, and then adds this aligned speaker-specific transcription as a new tier to the output TextGrid file.
add_cnvrstl-wrd-sgmnt_tier.py: The script aligns the input conversational transcript with the word intervals from the 'strd-wrd-sgmnt' tier in a TextGrid file by matching words from the transcription to their corresponding time intervals and then adds this aligned data as a new tier to the output TextGrid file. The number of words in the input transcription should be the same as in 'strd-wrd-sgmnt' tier for the script to work as intended.
add_discourse-marker_tier.py: The script processes a TextGrid file to detect and label discourse markers in speech, based on a list loaded from an external file, and then adds these labeled intervals to a new tier in the output TextGrid file.
add_pitch-reset_tier.py: The script features two methods for detecting pitch resets: the "average-neighboring" method, which compares a syllable's mean pitch with the average of its neighbors and labels significant differences as pitch resets, and the "intrasyllabic" method, which examines pitch changes within a single syllable, identifying a pitch reset if the difference exceeds 4 semitones.
add_intensity-reset_tier.py: The script employs two methods to detect intensity resets. The "near" method compares a syllable's mean intensity with the average of its two closest neighbors, labeling significant differences as intensity resets. Conversely, the "extended" method contrasts a syllable's mean intensity with the average of its four closest neighbors, identifying significant differences as intensity resets.
add_speech-rate-reduction_tier.py: The script includes two methods for detecting speech rate reduction. The "near" method compares the syllable length with the average length of its immediate neighbors, while the "extended" method uses the average length of the four closest neighbors. The script then labels syllables that are significantly longer than this average, as determined by the reduction_threshold argument.
add_pause_tier.py: The script identifies and labels pauses in speech by analyzing a TextGrid file, marking intervals as 'POS' for pauses (empty or whitespace-only intervals) in the 'strd-wrd-sgmnt' tier.
add_speaker-change_tier.py: The script analyzes a TextGrid file to identify and label speaker changes within conversational syllable intervals, marking these changes as 'POS' when a change occurs or 'NEG' otherwise.
add_word-ID_tier.py: The script extracts and synchronizes word identifiers from the input XML, then adds these as a new tier to the output TextGrid file.

Acoustic Measurements

The script acoustic_measurements.py computes various acoustic measurements from a given TextGrid file and WAV audio file, extracting phoneme durations, pitch-related features, formants, intensity, VOT (Voice Onset Time), COG (Center of Gravity), and related annotations. The computed values are then stored in a CSV file for analysis and further processing.

Usage:

python acoustic_measurements.py <input.TextGrid> <input.wav> <output.csv>

Input:

input.TextGrid: A TextGrid file containing phoneme boundaries and other annotations
input.wav: The corresponding audio file
output.csv: The output CSV file to save the acoustic measurements

Output:

The output CSV file will contain the following columns for each phoneme:

Phone: The phoneme label
Duration: The duration of the phoneme (seconds)
AvgPitch: The average pitch of the phoneme (Hz)
PitchTrend: The pitch trend of the phoneme (rising, falling, or mixed)
F1Formant: The F1 formant frequency (Hz)
F2Formant: The F2 formant frequency (Hz)
F3Formant: The F3 formant frequency (Hz)
F4Formant: The F4 formant frequency (Hz)
Intensity: The average intensity of the phoneme (dB)
VOT: The voice onset time (seconds)
COG: The centroid of gravity of the phoneme (Hz)
PreviousPhone: The phoneme preceding the current phoneme
NextPhone: The phoneme following the current phoneme
Word: The word containing the current phoneme
Sentence: The sentence containing the current phoneme
AudioID: The ID of the audio file
SpeakerID: The ID of the speaker

Batch Processing

Processing forced alignments

This Bash script align.sh is designed to automate the process of performing forced alignment and adding multiple tiers to TextGrid files using a set of Python scripts. The script iterates through WAV files in a specified directory, performs multiple operations including forced alignment over short time intervals of input audio/trainscription pairs, and adding various tiers to the TextGrid files for detailed analysis. Execute the script using the following command:

./align.sh <wav_dir> <out_dir> <lexicon> <xml_dir> <duration>

The script accepts these input arguments:

wav_dir: Path to the directory containing WAV files.
out_dir: Path to the directory where output files and intermediate files will be stored.
lexicon: Path to the lexicon file used for MFA forced alignment.
xml_dir: Path to the directory containing XML files, i.e. transcriptions in TEI format.
duration: A floating-point number that defines the length of the audio segments in seconds. These segments are created from the input audio and transcriptions prior to MFA forced alignment. The value 'Inf' implies no segmentation will be performed.

Processing acoustic measurements

The script acoustics.sh is designed to facilitate the processing of audio files for acoustic measurements. It takes three directory paths as input arguments: one for TextGrid files, one for WAV files, and one for output CSV files. The script iterates over each TextGrid file in the specified directory, locates its corresponding WAV file, performs acoustic measurements using a Python script acoustic_measurements.py, and outputs the results in CSV format. It can be called as follows:

./acoustics.sh <textgrid_dir> <wav_dir> <csv_dir>

Databases

GOS database (korpus GOvorjene Slovenščine)

Spoken corpus Gos 2.1 (transcriptions): https://www.clarin.si/repository/xmlui/handle/11356/1863
Spoken corpus Gos VideoLectures 4.0 (audio): https://www.clarin.si/repository/xmlui/handle/11356/1222
ASR database ARTUR 1.0 (audio): https://www.clarin.si/repository/xmlui/handle/11356/1776
IRISS, SST and SPOG subsets: https://nl.ijs.si/nikola/mezzanine/

Performance Evaluation

The accuracy of forced alignment can be assessed using the aligner_eval.py script. This script compares word intervals calculated by the alignment process against intervals found in XML(TEI) files from the GOS database:

python aligner_eval.py <xml_dir> <textgrid_or_ctm_dir>

Alternative to MFA for performance comparison: NeMo Forced Aligner

To install NeMo, follow the instructions.

Pretrained model available at: https://www.clarin.si/repository/xmlui/handle/11356/1737.

Forced alignment using NeMo can be performed by the following commands:

python nemo_manifest.py <wav_dir> <xml_dir> <manifest_dir>
python nemo_align.py <nemo_dir> <model_path> <manifest_dir> <output_dir>

Or use script nemo_align.sh to perform alignment on shorter audio sections. To convert NeMo *.ctm files to *.TextGrid files use the following command

python ctm2textgrid.py <input.ctm> <output.TextGrid>

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
docs		docs
misc		misc
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md
README.sl.md		README.sl.md
acoustic_measurements.py		acoustic_measurements.py
acoustics.sh		acoustics.sh
add_cnvrstl-syllables_tier.py		add_cnvrstl-syllables_tier.py
add_cnvrstl-wrd-sgmnt_tier.py		add_cnvrstl-wrd-sgmnt_tier.py
add_conversational-trs_tier.py		add_conversational-trs_tier.py
add_discourse-marker_tier.py		add_discourse-marker_tier.py
add_intensity-reset_tier.py		add_intensity-reset_tier.py
add_pause_tier.py		add_pause_tier.py
add_pitch-reset_tier.py		add_pitch-reset_tier.py
add_speaker-ID_tier.py		add_speaker-ID_tier.py
add_speaker-change_tier.py		add_speaker-change_tier.py
add_speech-rate-reduction_tier.py		add_speech-rate-reduction_tier.py
add_standardized-trs_tier.py		add_standardized-trs_tier.py
add_word-ID_tier.py		add_word-ID_tier.py
align.sh		align.sh
aligner_eval.py		aligner_eval.py
anonymize_audio.py		anonymize_audio.py
check_tiers.py		check_tiers.py
combine_textgrid.py		combine_textgrid.py
compensate_timing.sh		compensate_timing.sh
compensate_trimming.py		compensate_trimming.py
convert_audio.sh		convert_audio.sh
ctm2textgrid.py		ctm2textgrid.py
discourse_marker_overlap.py		discourse_marker_overlap.py
environment.yml		environment.yml
extract_audio.sh		extract_audio.sh
fragmentize_trs_wav.py		fragmentize_trs_wav.py
nemo_align.py		nemo_align.py
nemo_align.sh		nemo_align.sh
nemo_manifest.py		nemo_manifest.py
prosodic_param_overlap.py		prosodic_param_overlap.py
requirements.txt		requirements.txt
slovene_phoneme_syllable_splitter.py		slovene_phoneme_syllable_splitter.py
slovene_syllable_splitter.py		slovene_syllable_splitter.py
transcript_from_tei.py		transcript_from_tei.py
transcript_from_trs.py		transcript_from_trs.py
trs2txt.sh		trs2txt.sh
utils.py		utils.py
utils_tei.py		utils_tei.py
utils_trs.py		utils_trs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Slovene Speech and Transcription Alignment using Montreal Forced Aligner

Installation

Validation

Acoustic Model Training

Alignment

Enhancing TextGrid Files with Additional Tiers

Acoustic Measurements

Batch Processing

Databases

Performance Evaluation

Alternative to MFA for performance comparison: NeMo Forced Aligner

Use Case Examples

About

Releases

Packages

Languages

jan3zk/forced_alignment

Folders and files

Latest commit

History

Repository files navigation

Slovene Speech and Transcription Alignment using Montreal Forced Aligner

Installation

Validation

Acoustic Model Training

Alignment

Enhancing TextGrid Files with Additional Tiers

Acoustic Measurements

Batch Processing

Databases

Performance Evaluation

Alternative to MFA for performance comparison: NeMo Forced Aligner

Use Case Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages