Audio-visual speaker recognition on VoxCeleb2. which includes the speaker recognition code (speaker folder), face recognition code (face folder) and speaker-face recognition code (speaker_face folder). We seperate the codes into three folders to convinent the usage.
Our paper is here. This project contains the code of audio-visual speaker recognition and the cleansed training list.
This code uses Mixed Precision Training (torch.cuda.amp).
pip install -r requirements.txt
The link of the pretrain model can be found here.
A-Vox2.model: The speaker network (ECAPA-L) trained on VoxCeleb2
V-Vox2.model: The face network (ResNet18) trained on VoxCeleb2
V-Glint.model: The face network (ResNet50) trained on Glint360K
Create a pretrain folder in root directory, put these models into the pretrain folder.
The .txt
files can be found here.
The faces on VoxCeleb1 can be downloaded here
There is no offical link for downloading the videos from VoxCeleb2. So sorry I can not help - -.
The structure of the dataset looks like:
# VoxCeleb2
# ├── frame_align (face frames dataset, after alignment)
# │ ├── id00012 (speaker id)
# │ ├── 21Uxsk56VDQ (video id)
# │ ├── 00001 (utterance id)
# │ ├── 0010.jpg (face frames, I extract one frame every 0.4 second)
# │ ├── 0020.jpg
# │ ├── 0030.jpg
# │ ├── ...
# ├── wav (speech dataset)
# │ ├── id00012 (speaker id)
# │ ├── 21Uxsk56VDQ (video id)
# │ ├── 00001.wav (utterance id)
# │ ├── ...
# ├── train_all.txt (speaker_id-wav_file_name-duraition)
# ├── train_all_clean.txt (speaker_id-wav_file_name-duraition-audio_sim_score-visual_sim_score-clean_or_noisy)
# VoxCeleb1
# ├── frame_align (face frames dataset, be similar to Vox2)
# ├── wav (speech dataset, be similar to Vox2)
# ├── O_list.txt (data list of VoxCeleb1-O, wav_file_name-duration)
# ├── E_list.txt (data list of VoxCeleb1-E)
# ├── H_list.txt (data list of VoxCeleb1-H)
# ├── O_trials.txt (original test trials of VoxCeleb1-O)
# ├── E_trials.txt (original test trials of VoxCeleb1-E)
# ├── H_trials.txt (original test trials of VoxCeleb1-H)
The O_list
, E_list
and H_list
are used to speed up the testing process; train_all.txt
is the original training list; train_all_clean.txt
is the cleansed training list.
For face alignment, I do it based on here. Here is my code for reference (not write so well -_-...)
In speaker folder, we train the ECAPA-TDNN speaker network on VoxCeleb2. The details of speaker recognition can be found here: ECAPA-TDNN.
Modality | System | Vox1-O | Vox1-E | Vox1-H |
---|---|---|---|---|
Speech | (1) ECAPA-L-Vox2 | 0.98 | 1.21 | 2.30 |
It is noted that the results in our paper are the mean performance of training three times, so be slightly different with these results.
In face folder, we train a face recognition model on VoxCeleb2, here are the results:
Modality | System | Vox1-O | Vox1-E | Vox1-H |
---|---|---|---|---|
Face | (2) ResNet18-Vox2 | 0.97 | 0.81 | 1.16 |
Face | (3) ResNet50-Glint | 0.03 | 0.07 | 0.09 |
It is noted that (3) is a pretrain-model on the Glint360K dataset. We did not check if there is the identity overlap between Glint360K (360K person) and VoxCeleb (6K person) set (lack the identity files). This result is only used to show that the multi-modality is strong, but that will not effect our cleansing purpose since our final target is a cleansed VoxCeleb2.
We did not do experiments that had no alignments for both training and evaluation. For the pretrain face model on Glint360K, the result is bad if no alignment for evaluation.
The pipeline of face recognition is in speaker_face folder. We suggest to train two networks separately. Here are the results:
Modality | System | Vox1-O | Vox1-E | Vox1-H |
---|---|---|---|---|
Fusion | (1)+(2) | 0.15 | 0.23 | 0.41 |
Fusion | (1)+(3) | 0.01 | 0.08 | 0.15 |
Get into the speaker
, face
or speaker_face
folder.
For training: bash run_train.sh
Noted to set the path of training data, evaluation data, and the output path.
For evaluation: bash run_eval.sh
.
Noted to set the path of evaluation data.
I have optimized the code to speed up the evalution process (': mins, '': seconds)
Modality | System | Vox1-O | Vox1-E | Vox1-H |
---|---|---|---|---|
Speech | (1) | 0'14'' | 5'30'' | 5'16'' |
Face | (2) | 0'13'' | 5.33'' | 5'30'' |
Face | (3) | 0'24'' | 11‘26’‘ | 10'52'' |
Fusion | (1)+(2) | 0'25'' | 11'26'' | 10'55'' |
Fusion | (1)+(3) | 0'38'' | 16‘57’‘ | 16‘11’‘ |
For speech modality only, we can evaluate Vox1-E and Vox1-H within 6 mins. (One RTX-3090)