Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model
This repository contains code used to build, train and evalute the SLF systems.
** Note: I used 2 different machines but only the second machine was able to produce all results due to VRAM requirements (VRAM >= 24 GB). **
-
AMD Ryzen 9 6900HS CPU.
-
12th Gen Intel(R) Core(TM) i9-12900KF
-
AMD Radeon RX 6700S, VRAM = 8GB
-
Quadro RTX 6000, VRAM = 24GB
-
16 GB
-
32 GB
https://ubuntu.com/download/desktop
-
Ubuntu 22.04.2 LTS, Kernal: 5.15.0-67-generic
-
Ubuntu 22.04.2 LTS, Kernal: 5.15.0-25-generic
** Note: This will dictate a certain version of kernal that you will have to install. Please review requirements carefully.**
-
ROCM 5.4.3
https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.4.3/page/How_to_Install_ROCm.html
-
CUDA 12
https://developer.nvidia.com/cuda-downloads
-
v3.10.6
-
v22.9.0
https://www.anaconda.com/products/distribution
-
v2.34.1
sudo apt-get install git-all
-
v1.77.0
On Ubuntu Software --> Visual Studio Code --> Install
In a folder of your choosing run this in a terminal
git clone --recurse-submodules https://github.com/AhmedGamal411/DiffusionSpeech2Face
This clones this repo with all its submodules
Open that folder with Visual Studio Code
View the readme of every submodule to download and configure its depedencies.
https://github.com/AhmedGamal411/voxceleb_trainer
To download VoxCeleb2 Dataset
conda create -n ds2f python=3.8
conda activate ds2f
Install the modules listed in requirements.txt either using conda or pip
Open configuration.txt
. change the path of the dataset (datasetPathVideo). It doesn't matter what structure it is, as long as it's videos only. Then change datasetPathDatabase, datasetPathAudio, datasetPathFrames and datasetPathFaces to where you want the data to be extracted.
The 'configuration.txt' file is divided into parts.
Make sure you run conda activate ds2f
Creates a database file at "datasetPathDatabase" to hold paths to videos. This database file is used by other scripts to hold other information like paths to face images and results to various analysises.
Extracts audio from videos then creates 3, 6, 12 and 24 versions of these audio either by trimming the audio or looping it, then extracts speaker embeddings
[3] python extractFaces.py
References
=======
DeepFace
https://github.com/serengil/deepface
@inproceedings{serengil2020lightface,
title = {LightFace: A Hybrid Deep Face Recognition Framework},
author = {Serengil, Sefik Ilkin and Ozpinar, Alper},
booktitle = {2020 Innovations in Intelligent Systems and Applications Conference (ASYU)},
pages = {23-27},
year = {2020},
doi = {10.1109/ASYU50717.2020.9259802},
url = {https://doi.org/10.1109/ASYU50717.2020.9259802},
organization = {IEEE}
}
@inproceedings{serengil2021lightface,
title = {HyperExtended LightFace: A Facial Attribute Analysis Framework},
author = {Serengil, Sefik Ilkin and Ozpinar, Alper},
booktitle = {2021 International Conference on Engineering and Emerging Technologies (ICEET)},
pages = {1-4},
year = {2021},
doi = {10.1109/ICEET53442.2021.9659697},
url = {https://doi.org/10.1109/ICEET53442.2021.9659697},
organization = {IEEE}
}
@misc{serengil2023db,
author = {Serengil, Sefik Ilkin and Ozpinar, Alper},
title = {An Evaluation of SQL and NoSQL Databases for Facial Recognition Pipelines},
year = {2023},
publisher = {Cambridge Open Engage},
doi = {10.33774/coe-2023-18rcn},
url = {https://doi.org/10.33774/coe-2023-18rcn},
howpublished = {https://www.cambridge.org/engage/coe/article-details/63f3e5541d2d184063d4f569},
note = {Preprint}
}
https://github.com/clovaai/voxceleb_trainer
[1] In defence of metric learning for speaker recognition
@inproceedings{chung2020in,
title={In defence of metric learning for speaker recognition},
author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},
booktitle={Proc. Interspeech},
year={2020}
}
[2] The ins and outs of speaker recognition: lessons from VoxSRC 2020
@inproceedings{kwon2021ins,
title={The ins and outs of speaker recognition: lessons from {VoxSRC} 2020},
author={Kwon, Yoohwan and Heo, Hee Soo and Lee, Bong-Jin and Chung, Joon Son},
booktitle={Proc. ICASSP},
year={2021}
}
[3] Pushing the limits of raw waveform speaker recognition
@inproceedings{jung2022pushing,
title={Pushing the limits of raw waveform speaker recognition},
author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},
booktitle={Proc. Interspeech},
year={2022}
}
TODO: finish documentation
Thank you for using and supporting this project! If you have found this code helpful or have used it in your own work, please consider acknowledging the project. Here are a few ways you can do this:
- Mention in your Project's Documentation.
This project uses code from ["Speaking the Language of Faces (SLF)" Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model](https://github.com/AhmedGamal411/DiffusionSpeech2Face).
- Cite in Academic Papers:
Abotaleb, A, "Speaking the Language of Faces (SLF)" Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model, GitHub repository, 2024. Available at: https://github.com/AhmedGamal411/DiffusionSpeech2Face
- Link Back to This Repository:
If you have a website or a blog where you share your projects, consider adding a link back to this repository.
"Speaking the Language of Faces (SLF)" Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model © 2024 by Ahmed Abotaleb is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/