"Speaking the Language of Faces (SLF)"

Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model

This repository contains code used to build, train and evalute the SLF systems.

Requirments

** Note: I used 2 different machines but only the second machine was able to produce all results due to VRAM requirements (VRAM >= 24 GB). **

CPU:

AMD Ryzen 9 6900HS CPU.
12th Gen Intel(R) Core(TM) i9-12900KF

GPU:

AMD Radeon RX 6700S, VRAM = 8GB
Quadro RTX 6000, VRAM = 24GB

RAM:

16 GB
32 GB

Ubuntu


https://ubuntu.com/download/desktop

Ubuntu 22.04.2 LTS, Kernal: 5.15.0-67-generic
Ubuntu 22.04.2 LTS, Kernal: 5.15.0-25-generic

CUDA (Nvidia GPU) or ROCM (AMD GPU)

** Note: This will dictate a certain version of kernal that you will have to install. Please review requirements carefully.**

ROCM 5.4.3 https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.4.3/page/How_to_Install_ROCm.html
CUDA 12 https://developer.nvidia.com/cuda-downloads

Python

v3.10.6

Anaconda Environment

v22.9.0


https://www.anaconda.com/products/distribution

Git

v2.34.1


sudo apt-get install git-all

Visual Studio Code (Can use any development enviroment or none if you want)

v1.77.0


On Ubuntu Software --> Visual Studio Code --> Install

In a folder of your choosing run this in a terminal


git clone --recurse-submodules https://github.com/AhmedGamal411/DiffusionSpeech2Face

This clones this repo with all its submodules

Open that folder with Visual Studio Code

View the readme of every submodule to download and configure its depedencies.

Submodules

Voxceleb Trainer


https://github.com/AhmedGamal411/voxceleb_trainer

To download VoxCeleb2 Dataset

Python Environment


conda create -n ds2f python=3.8

conda activate ds2f

Install the modules listed in requirements.txt either using conda or pip

Configuration

Open configuration.txt. change the path of the dataset (datasetPathVideo). It doesn't matter what structure it is, as long as it's videos only. Then change datasetPathDatabase, datasetPathAudio, datasetPathFrames and datasetPathFaces to where you want the data to be extracted.

The 'configuration.txt' file is divided into parts.

Running files

Make sure you run conda activate ds2f

dbCreateAndPopulate.py

Creates a database file at "datasetPathDatabase" to hold paths to videos. This database file is used by other scripts to hold other information like paths to face images and results to various analysises.

extractAudio.py

Extracts audio from videos then creates 3, 6, 12 and 24 versions of these audio either by trimming the audio or looping it, then extracts speaker embeddings

[3] python extractFaces.py

References

=======

Mentions of Projects used

DeepFace


https://github.com/serengil/deepface


@inproceedings{serengil2020lightface,

title = {LightFace: A Hybrid Deep Face Recognition Framework},

author = {Serengil, Sefik Ilkin and Ozpinar, Alper},

booktitle = {2020 Innovations in Intelligent Systems and Applications Conference (ASYU)},

pages = {23-27},

year = {2020},

doi = {10.1109/ASYU50717.2020.9259802},

url = {https://doi.org/10.1109/ASYU50717.2020.9259802},

organization = {IEEE}

}


@inproceedings{serengil2021lightface,

title = {HyperExtended LightFace: A Facial Attribute Analysis Framework},

author = {Serengil, Sefik Ilkin and Ozpinar, Alper},

booktitle = {2021 International Conference on Engineering and Emerging Technologies (ICEET)},

pages = {1-4},

year = {2021},

doi = {10.1109/ICEET53442.2021.9659697},

url = {https://doi.org/10.1109/ICEET53442.2021.9659697},

organization = {IEEE}

}


@misc{serengil2023db,

author = {Serengil, Sefik Ilkin and Ozpinar, Alper},

title = {An Evaluation of SQL and NoSQL Databases for Facial Recognition Pipelines},

year = {2023},

publisher = {Cambridge Open Engage},

doi = {10.33774/coe-2023-18rcn},

url = {https://doi.org/10.33774/coe-2023-18rcn},

howpublished = {https://www.cambridge.org/engage/coe/article-details/63f3e5541d2d184063d4f569},

note = {Preprint}

}

Mentions to Projects and Libraries used

Voxceleb Trainer


https://github.com/clovaai/voxceleb_trainer

[1] In defence of metric learning for speaker recognition


@inproceedings{chung2020in,

title={In defence of metric learning for speaker recognition},

author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},

booktitle={Proc. Interspeech},

year={2020}

}

[2] The ins and outs of speaker recognition: lessons from VoxSRC 2020


@inproceedings{kwon2021ins,

title={The ins and outs of speaker recognition: lessons from {VoxSRC} 2020},

author={Kwon, Yoohwan and Heo, Hee Soo and Lee, Bong-Jin and Chung, Joon Son},

booktitle={Proc. ICASSP},

year={2021}

}

[3] Pushing the limits of raw waveform speaker recognition


@inproceedings{jung2022pushing,

title={Pushing the limits of raw waveform speaker recognition},

author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},

booktitle={Proc. Interspeech},

year={2022}

}

TODO: finish documentation

Contributing and Attribution

Thank you for using and supporting this project! If you have found this code helpful or have used it in your own work, please consider acknowledging the project. Here are a few ways you can do this:

Mention in your Project's Documentation.

This project uses code from ["Speaking the Language of Faces (SLF)" Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model](https://github.com/AhmedGamal411/DiffusionSpeech2Face).

Cite in Academic Papers:

Abotaleb, A, "Speaking the Language of Faces (SLF)" Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model, GitHub repository, 2024. Available at: https://github.com/AhmedGamal411/DiffusionSpeech2Face

Link Back to This Repository:

If you have a website or a blog where you share your projects, consider adding a link back to this repository.

License

"Speaking the Language of Faces (SLF)" Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model © 2024 by Ahmed Abotaleb is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

Name		Name	Last commit message	Last commit date
Latest commit History 242 Commits
__pycache__		__pycache__
old		old
voxceleb_trainer @ 1629c08		voxceleb_trainer @ 1629c08
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
_commands.txt		_commands.txt
commands.txt		commands.txt
configuration _amd.txt		configuration _amd.txt
configuration.txt		configuration.txt
configuration_README.txt		configuration_README.txt
configuration_nvidia.txt		configuration_nvidia.txt
dask_optimization.txt		dask_optimization.txt
dbCreateAdditional.py		dbCreateAdditional.py
dbCreateAndPopulate.py		dbCreateAndPopulate.py
dbCreateFacesBlurred.py		dbCreateFacesBlurred.py
evaluate_imagen1.ipynb		evaluate_imagen1.ipynb
evaluate_imagen2.ipynb		evaluate_imagen2.ipynb
evaluate_imagen3.ipynb		evaluate_imagen3.ipynb
evaluate_imagen4.ipynb		evaluate_imagen4.ipynb
extractAudio.py		extractAudio.py
extractAudioFeatures.ipynb		extractAudioFeatures.ipynb
extractAudioFeatures.py		extractAudioFeatures.py
extractAudioFeatures.txt		extractAudioFeatures.txt
extractFaces.py		extractFaces.py
extractPyannoteTitaNet.py		extractPyannoteTitaNet.py
extractPyannoteTitaNetInit.txt		extractPyannoteTitaNetInit.txt
extractVggBlurred.py		extractVggBlurred.py
language.dict		language.dict
muse_train_batch.py		muse_train_batch.py
muse_train_maskgit1.ipynb		muse_train_maskgit1.ipynb
muse_vqganvae_train.ipynb		muse_vqganvae_train.ipynb
no2		no2
parti_train.ipynb		parti_train.ipynb
parti_train1.py		parti_train1.py
references.txt		references.txt
requirements.txt		requirements.txt
stackOutputImages.ipynb		stackOutputImages.ipynb
tasks.txt		tasks.txt
test copy.ipynb		test copy.ipynb
test.ipynb		test.ipynb
testing_imagen.ipynb		testing_imagen.ipynb
testing_imagen2.ipynb		testing_imagen2.ipynb
testing_imagen_audio_features.py		testing_imagen_audio_features.py
testing_imagen_audio_transformer.py		testing_imagen_audio_transformer.py
testing_imagen_face.py		testing_imagen_face.py
testing_imagen_pyannote_titanet.py		testing_imagen_pyannote_titanet.py
testing_imagen_speechbrain.py		testing_imagen_speechbrain.py
testing_imagen_vision_transformer.py		testing_imagen_vision_transformer.py
train_imagen_all_stable copy.ipynb		train_imagen_all_stable copy.ipynb
train_imagen_all_stable.ipynb		train_imagen_all_stable.ipynb
train_imagen_all_u1.ipynb		train_imagen_all_u1.ipynb
train_imagen_batch.py		train_imagen_batch.py
vectorDb.ipynb		vectorDb.ipynb
vectorDbCreateAndPopulate.py		vectorDbCreateAndPopulate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

"Speaking the Language of Faces (SLF)"

Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model

Requirments

CPU:

GPU:

RAM:

Ubuntu

CUDA (Nvidia GPU) or ROCM (AMD GPU)

Python

Anaconda Environment

Git

Visual Studio Code (Can use any development enviroment or none if you want)

Submodules

Voxceleb Trainer

Python Environment

Configuration

Running files

dbCreateAndPopulate.py

extractAudio.py

Mentions of Projects used

Mentions to Projects and Libraries used

Voxceleb Trainer

Contributing and Attribution

License

About

Releases

Packages

Languages

AhmedGamal411/DiffusionSpeech2Face

Folders and files

Latest commit

History

Repository files navigation

"Speaking the Language of Faces (SLF)"

Scalable Multimodal Approach for Face Generation and Super Resolution using a Conditional Diffusion Model

Requirments

CPU:

GPU:

RAM:

Ubuntu

CUDA (Nvidia GPU) or ROCM (AMD GPU)

Python

Anaconda Environment

Git

Visual Studio Code (Can use any development enviroment or none if you want)

Submodules

Voxceleb Trainer

Python Environment

Configuration

Running files

dbCreateAndPopulate.py

extractAudio.py

Mentions of Projects used

Mentions to Projects and Libraries used

Voxceleb Trainer

Contributing and Attribution

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages