"Speaking the Language of Faces (SLF)"

Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model

This repository contains code used to build, train and evalute the SLF systems.

Requirments

** Note: I used 2 different machines but only the second machine was able to produce all results due to VRAM requirements (VRAM >= 24 GB). **

CPU:

AMD Ryzen 9 6900HS CPU.
12th Gen Intel(R) Core(TM) i9-12900KF

GPU:

AMD Radeon RX 6700S, VRAM = 8GB
Quadro RTX 6000, VRAM = 24GB

RAM:

16 GB
32 GB

Ubuntu


https://ubuntu.com/download/desktop

Ubuntu 22.04.2 LTS, Kernal: 5.15.0-67-generic
Ubuntu 22.04.2 LTS, Kernal: 5.15.0-25-generic

CUDA (Nvidia GPU) or ROCM (AMD GPU)

** Note: This will dictate a certain version of kernal that you will have to install. Please review requirements carefully.**

ROCM 5.4.3 https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.4.3/page/How_to_Install_ROCm.html
CUDA 12 https://developer.nvidia.com/cuda-downloads

Python

v3.10.6

Anaconda Environment

v22.9.0


https://www.anaconda.com/products/distribution

Git

v2.34.1


sudo apt-get install git-all

Visual Studio Code (Can use any development enviroment or none if you want)

v1.77.0


On Ubuntu Software --> Visual Studio Code --> Install

In a folder of your choosing run this in a terminal


git clone --recurse-submodules https://github.com/AhmedGamal411/DiffusionSpeech2Face

This clones this repo with all its submodules

Open that folder with Visual Studio Code

View the readme of every submodule to download and configure its depedencies.

Submodules

Voxceleb Trainer


https://github.com/AhmedGamal411/voxceleb_trainer

To download VoxCeleb2 Dataset

Python Environment


conda create -n ds2f python=3.8

conda activate ds2f

Install the modules listed in requirements.txt either using conda or pip

Configuration

Open configuration.txt. change the path of the dataset (datasetPathVideo). It doesn't matter what structure it is, as long as it's videos only. Then change datasetPathDatabase, datasetPathAudio, datasetPathFrames and datasetPathFaces to where you want the data to be extracted.

The 'configuration.txt' file is divided into parts.

Running files

Make sure you run conda activate ds2f

dbCreateAndPopulate.py

Creates a database file at "datasetPathDatabase" to hold paths to videos. This database file is used by other scripts to hold other information like paths to face images and results to various analysises.

extractAudio.py

Extracts audio from videos then creates 3, 6, 12 and 24 versions of these audio either by trimming the audio or looping it, then extracts speaker embeddings

[3] python extractFaces.py

References

=======

Mentions of Projects used

DeepFace


https://github.com/serengil/deepface


@inproceedings{serengil2020lightface,

title = {LightFace: A Hybrid Deep Face Recognition Framework},

author = {Serengil, Sefik Ilkin and Ozpinar, Alper},

booktitle = {2020 Innovations in Intelligent Systems and Applications Conference (ASYU)},

pages = {23-27},

year = {2020},

doi = {10.1109/ASYU50717.2020.9259802},

url = {https://doi.org/10.1109/ASYU50717.2020.9259802},

organization = {IEEE}

}


@inproceedings{serengil2021lightface,

title = {HyperExtended LightFace: A Facial Attribute Analysis Framework},

author = {Serengil, Sefik Ilkin and Ozpinar, Alper},

booktitle = {2021 International Conference on Engineering and Emerging Technologies (ICEET)},

pages = {1-4},

year = {2021},

doi = {10.1109/ICEET53442.2021.9659697},

url = {https://doi.org/10.1109/ICEET53442.2021.9659697},

organization = {IEEE}

}


@misc{serengil2023db,

author = {Serengil, Sefik Ilkin and Ozpinar, Alper},

title = {An Evaluation of SQL and NoSQL Databases for Facial Recognition Pipelines},

year = {2023},

publisher = {Cambridge Open Engage},

doi = {10.33774/coe-2023-18rcn},

url = {https://doi.org/10.33774/coe-2023-18rcn},

howpublished = {https://www.cambridge.org/engage/coe/article-details/63f3e5541d2d184063d4f569},

note = {Preprint}

}

Mentions to Projects and Libraries used

Voxceleb Trainer


https://github.com/clovaai/voxceleb_trainer

[1] In defence of metric learning for speaker recognition


@inproceedings{chung2020in,

title={In defence of metric learning for speaker recognition},

author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},

booktitle={Proc. Interspeech},

year={2020}

}

[2] The ins and outs of speaker recognition: lessons from VoxSRC 2020


@inproceedings{kwon2021ins,

title={The ins and outs of speaker recognition: lessons from {VoxSRC} 2020},

author={Kwon, Yoohwan and Heo, Hee Soo and Lee, Bong-Jin and Chung, Joon Son},

booktitle={Proc. ICASSP},

year={2021}

}

[3] Pushing the limits of raw waveform speaker recognition


@inproceedings{jung2022pushing,

title={Pushing the limits of raw waveform speaker recognition},

author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},

booktitle={Proc. Interspeech},

year={2022}

}

TODO: finish documentation

Contributing and Attribution

Thank you for using and supporting this project! If you have found this code helpful or have used it in your own work, please consider acknowledging the project. Here are a few ways you can do this:

Mention in your Project's Documentation.

This project uses code from ["Speaking the Language of Faces (SLF)" Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model](https://github.com/AhmedGamal411/DiffusionSpeech2Face).

Cite in Academic Papers:

Abotaleb, A, "Speaking the Language of Faces (SLF)" Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model, GitHub repository, 2024. Available at: https://github.com/AhmedGamal411/DiffusionSpeech2Face

Link Back to This Repository:

If you have a website or a blog where you share your projects, consider adding a link back to this repository.

License

"Speaking the Language of Faces (SLF)" Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model © 2024 by Ahmed Abotaleb is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

"Speaking the Language of Faces (SLF)"

Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model

Requirments

CPU:

GPU:

RAM:

Ubuntu

CUDA (Nvidia GPU) or ROCM (AMD GPU)

Python

Anaconda Environment

Git

Visual Studio Code (Can use any development enviroment or none if you want)

Submodules

Voxceleb Trainer

Python Environment

Configuration

Running files

dbCreateAndPopulate.py

extractAudio.py

Mentions of Projects used

Mentions to Projects and Libraries used

Voxceleb Trainer

Contributing and Attribution

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

"Speaking the Language of Faces (SLF)"

Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model

Requirments

CPU:

GPU:

RAM:

Ubuntu

CUDA (Nvidia GPU) or ROCM (AMD GPU)

Python

Anaconda Environment

Git

Visual Studio Code (Can use any development enviroment or none if you want)

Submodules

Voxceleb Trainer

Python Environment

Configuration

Running files

dbCreateAndPopulate.py

extractAudio.py

Mentions of Projects used

Mentions to Projects and Libraries used

Voxceleb Trainer

Contributing and Attribution

License