May: analyzing and report

Report Thesis

General overview: history and research
- probabilistic models/Markov Chain/HMM/NN/RNN/LSTM/Deep networks/Language modeling/Attention mechanisms/...
Dataset: TIMIT + TCDTIMIT
- overview (amount and quality of data, phoneme classes, train/test split)
- preprocessing (resampling, format, MFCC generation, normalization)
Used network:
- why LSTM for speech
- why bidirectional & deep
Training method (train/validation, LR, weight regularization, stopping)
Performance analysis:
- inputs: nbMFCCs, number of derivatives, window length
- 61 vs 39 phonemes
- uni vs bidirectional
- LSTM computation complexity: compromises for embedded HW applications (mem, operations,...). (eg conv -> many ops, little params; FC: many params, little ops)
  - network depth
  - number of units per layer
- TIMIT vs TCDTIMIT vs combined.
Optional:
- impact of noisy audio data: white noise, diff speakers

General: history, viseme vs phoneme, research.
Datasets:
- VidTimit, GRID, TCDTIMIT
- preprocessing (mouth extraction from video using labels), normalization etc)
Used networks:
- why CNN for images
- different CNN networks: Deepmind, Cifar10, ResNet
- CNN-LSTM networks
Training method (train/validation, LR, weight regularization, stopping)
Performance analysis:
- CNN network comparison: # params/layers vs performance, train/evaluation time
- CNN-LSTM performance
- speaker dependence; lipspeaker vs volunteers. -> more data needed (like Deepmind)
Optional:
- impact of noisy images
- binary Nets

Research overview
Used network:
- CNN / CNN-LSTM for lipreading
- LSTM for audio
- deep FC for combining
Training method, specifics for combinedSR
networks and performance
- lip CNN (conv features vs softmax vs intermediate) + audio LSTM
- lip CNN-LSTM + audio LSTM
Performance analysis:
- audio/lip/combined performance on each phoneme
- weight comparison audio vs lip
Optional:
- impact of bad audio
- inference when loaded weights are reduced precision