This repository contains code, labels and metadata for AVASpeech - SMAD dataset presented in late-breaking demo, ISMIR 2021.
AVASpeech-SMAD: A Strongly Labelled Speech and Music Activity Detection Dataset with Label Co-Occurrence [arXiv]
- Yun-Ning Hung, Karn N. Watcharasupat, Chih-Wei Wu, Iroro Orife, Kelian Li, Pavan Seshadri, Junyoung Lee
@article{avaspeechSMAD,
title = {AVASpeech-SMAD: A Strongly Labelled Speech and Music Activity Detection Dataset with Label Co-Occurrence},
author = {Hung, Yun-Ning and Watcharasupat, Karn and Wu, Chih-Wei and Orife, Iroro and Li, Kelian and Seshadri, Pavan and Lee, Junyoung},
year = {2021},
journal={arXiv preprint arXiv:2111.01320}
}
-
Download audio:
- Install youtube-dl
- Run the download script
python3 process.py
-
Labels (in labels/):
- Speech labels: from original AVASpeech dataset [1]
- Music labels: manually created by the authors
-
Benchmark result from the existings models (in evaluation/):
-
Statistic
- statistic.cvs: music, speech and overlap labels percentage of each song.
- distribution/: music, speech and overlap labels percentage distribution for the entire dataset
- process.py: code to download the audio and calculate the statistics
- Pitched sounds with more than one note
- Singing voice
- Ident
- Melodic ringtone
- Multiple instrumental sounds played simultaneously
- Any rhythmic sequence of musical elements (moving melody, or drums/percussion)
- Ambient sound effect (e.g., low frequency sound)
- Pitched sound with only one note (no moving melody)
- Traditional phone bell ring or buzz with no apparent musical elements
- human voice in different languages
- Oh (is considered speech, as in Oh my! Or Oh no!)
- Singing with lyrics
- Sighing
- Screaming
- Laughing
- Ah, Hm, Uh-hum, Uh, Err
- Groaning, moaning, heavy breathing
[1] S. Chaudhuri, J. Roth, D. P. Ellis, A. Gallagher, L. Kaver, R. Marvin, C. Pantofaru, N. Reale, L. G. Reid, K. Wilson et al., “AVA-speech: A densely la- beled dataset of speech activity in movies,” in Proceed- ings of the 19th Annual Conference of the International Speech Communication Association, 2018.
[2] S. Venkatesh, D. Moffat, and E. R. Miranda, “Inves- tigating the effects of training set synthesis for audio segmentation of radio broadcast,” Electronics, vol. 10, no. 7, p. 827, 2021.
[3] D. Doukhan, E. Lechapt, M. Evrard, and J. Carrive, “INA’s MIREX 2018 music and speech detection sys- tem,” in 14th Music Information Retrieval Evaluation eXchange, 2018.
Yun-Ning (Amy) Hung