Skip to content

May: analyzing and report

matthijs van keirsbilck edited this page May 4, 2017 · 3 revisions

Report Thesis

Introduction

  • Overview: why?
    • SR in general
    • audio
    • lipreading

Audio SR

  • General overview: history and research
    • probabilistic models/Markov Chain/HMM/NN/RNN/LSTM/Deep networks/Language modeling/Attention mechanisms/...
  • Dataset: TIMIT + TCDTIMIT
    • overview (amount and quality of data, phoneme classes, train/test split)
    • preprocessing (resampling, format, MFCC generation, normalization)
  • Used network:
    • why LSTM for speech
    • why bidirectional & deep
  • Training method (train/validation, LR, weight regularization, stopping)
  • Performance analysis:
    • inputs: nbMFCCs, number of derivatives, window length
    • 61 vs 39 phonemes
    • uni vs bidirectional
    • LSTM computation complexity: compromises for embedded HW applications (mem, operations,...). (eg conv -> many ops, little params; FC: many params, little ops)
      • network depth
      • number of units per layer
    • TIMIT vs TCDTIMIT vs combined.
  • Optional:
    • impact of noisy audio data: white noise, diff speakers

Lipreading

  • General: history, viseme vs phoneme, research.
  • Datasets:
    • VidTimit, GRID, TCDTIMIT
    • preprocessing (mouth extraction from video using labels), normalization etc)
  • Used networks:
    • why CNN for images
    • different CNN networks: Deepmind, Cifar10, ResNet
    • CNN-LSTM networks
  • Training method (train/validation, LR, weight regularization, stopping)
  • Performance analysis:
    • CNN network comparison: # params/layers vs performance, train/evaluation time
    • CNN-LSTM performance
    • speaker dependence; lipspeaker vs volunteers. -> more data needed (like Deepmind)
  • Optional:
    • impact of noisy images
    • binary Nets

Multimodal SR

  • Research overview
  • Used network:
    • CNN / CNN-LSTM for lipreading
    • LSTM for audio
    • deep FC for combining
  • Training method, specifics for combinedSR
  • networks and performance
    • lip CNN (conv features vs softmax vs intermediate) + audio LSTM
    • lip CNN-LSTM + audio LSTM
  • Performance analysis:
    • audio/lip/combined performance on each phoneme
    • weight comparison audio vs lip
  • Optional:
    • impact of bad audio
    • inference when loaded weights are reduced precision