Skip to content
/ FOP Public

Official implementation of FOP method as described in "Fusion and Orthogonal Projection for Improved Face-Voice Association"

Notifications You must be signed in to change notification settings

msaadsaeed/FOP

Repository files navigation

FOP (ICASSP 2022)

Official implementation of FOP method as described in "Fusion and Orthogonal Projection for Improved Face-Voice Association".

Paper: https://arxiv.org/abs/2112.10483

Presentation: https://youtu.be/mnV7FSsKIuM

Proposed Methodology

(Left) Overall method. Fundamentally, it is a two-stream pipeline which generates face and voice embeddings. We propose fusion and orthogonal projection (FOP) mechanism (dotted red box). (Right) The architecture of multimodal fusion.

Installation

We have used python==3.6.5 and torch==1.8.0 for these experiments. It may not run on other versions of Python/Torch. To install dependencies run:

pip install -r requirements.txt

For installation of Pytorch and CUDA (For GPU):

conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch

Feature Extraction

We have used VoxCeleb1 for the experimentation in this work. The dataset and train/test splits can be downloaded here

Facial Feature Extraction

For Face Embeddings we use VGGFace. We use the Keras implementation of this paper from this repository

Voice Feature Extraction

For Voice Embeddings we use the method described in Utterance Level Aggregator. The code we used is released by authors and is publicly available hereGitHub stars

Once the features are extracted, write them to a .csv file in features directory. The csv files of train and test splits (random_unseen_unheard) can be downloaded here

Training and Testing

Training

  • Linear Fusion:
python main.py --cuda 1 --save_dir ./model --lr 1e-5 --batch_size 128 --max_num_epoch 50 --alpha_list [0.0, 0.1, 0.5, 1.0, 2.0, 5.0] --dim_embed 128 --fusion linear --test_epoch 5
  • Gated Fusion:
python main.py --cuda 1 --save_dir ./model --lr 1e-5 --batch_size 128 --max_num_epoch 50 --alpha_list [0.0, 0.1, 0.5, 1.0, 2.0, 5.0] --dim_embed 128 --fusion gated --test_epoch 5

Testing

  • Linear Fusion:
python test.py --cuda 1 --ckpt <path to checkpoint.pth.tar> --dim_embed 128 --fusion linear --alpha 1
  • Gated Fusion:
python test.py --cuda 1 --ckpt <path to checkpoint.pth.tar> --dim_embed 128 --fusion gated --alpha 1

Comparison

Cross-modal matching results: (Left) FOP vs other losses used in V-F methods. (Right) Our method vs SOTA methods.

Citing FOP

@inproceedings{saeed2022fusion,
  title={Fusion and Orthogonal Projection for Improved Face-Voice Association},
  author={Saeed, Muhammad Saad and Khan, Muhammad Haris and Nawaz, Shah and Yousaf, Muhammad Haroon and Del Bue, Alessio},
  booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={7057--7061},
  year={2022},
  organization={IEEE}
}
@article{saeed2022learning,
  title={Learning Branched Fusion and Orthogonal Projection for Face-Voice Association},
  author={Saeed, Muhammad Saad and Nawaz, Shah and Khan, Muhammad Haris and Javed, Sajid and Yousaf, Muhammad Haroon and Del Bue, Alessio},
  journal={arXiv preprint arXiv:2208.10238},
  year={2022}
}

About

Official implementation of FOP method as described in "Fusion and Orthogonal Projection for Improved Face-Voice Association"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages