-
Notifications
You must be signed in to change notification settings - Fork 19
May: analyzing and report
matthijs van keirsbilck edited this page May 4, 2017
·
3 revisions
- Overview: why?
- SR in general
- audio
- lipreading
- General overview: history and research
- probabilistic models/Markov Chain/HMM/NN/RNN/LSTM/Deep networks/Language modeling/Attention mechanisms/...
- Dataset: TIMIT + TCDTIMIT
- overview (amount and quality of data, phoneme classes, train/test split)
- preprocessing (resampling, format, MFCC generation, normalization)
- Used network:
- why LSTM for speech
- why bidirectional & deep
- Training method (train/validation, LR, weight regularization, stopping)
- Performance analysis:
- inputs: nbMFCCs, number of derivatives, window length
- 61 vs 39 phonemes
- uni vs bidirectional
- LSTM computation complexity: compromises for embedded HW applications
(mem, operations,...). (eg conv -> many ops, little params; FC: many params, little ops)
- network depth
- number of units per layer
- TIMIT vs TCDTIMIT vs combined.
- Optional:
- impact of noisy audio data: white noise, diff speakers
- General: history, viseme vs phoneme, research.
- Datasets:
- VidTimit, GRID, TCDTIMIT
- preprocessing (mouth extraction from video using labels), normalization etc)
- Used networks:
- why CNN for images
- different CNN networks: Deepmind, Cifar10, ResNet
- CNN-LSTM networks
- Training method (train/validation, LR, weight regularization, stopping)
- Performance analysis:
- CNN network comparison: # params/layers vs performance, train/evaluation time
- CNN-LSTM performance
- speaker dependence; lipspeaker vs volunteers. -> more data needed (like Deepmind)
- Optional:
- impact of noisy images
- binary Nets
- Research overview
- Used network:
- CNN / CNN-LSTM for lipreading
- LSTM for audio
- deep FC for combining
- Training method, specifics for combinedSR
- networks and performance
- lip CNN (conv features vs softmax vs intermediate) + audio LSTM
- lip CNN-LSTM + audio LSTM
- Performance analysis:
- audio/lip/combined performance on each phoneme
- weight comparison audio vs lip
- Optional:
- impact of bad audio
- inference when loaded weights are reduced precision