This program classifies a given MP3-file of human speech into 2 types: American English or British English.

You need to put an MP3-file of someone's speech into the data/inference/mp3/ directory. The speech is better to be 17-25 seconds long.

The algorithm can be described as following:

Convert an mp3 file into a wav file via FFmpeg.
Trim the first n and m seconds (default: n = 1, m = 1).
Remove all the silent gaps and noises between sentences and words so it becomes a continuous speech.
Cut this sound into pieces of a fixed size (default: 4 seconds) and save them.
Create spectrograms on different scales and combine them into one image. Short-time Fourier Transform and its variation are used to get spectrograms. Here's how it looks:

Load a network that is already trained out by me. Basically, I took the ResNet50 network pre-trained on the ImageNet dataset and added a custom output layer (transfer learning). So, it's a convolutional model:

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
models		models
networks		networks
notebooks		notebooks
utilities		utilities
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback