Skip to content

10.04 16.04

matthijs van keirsbilck edited this page Apr 13, 2017 · 2 revisions
  1. More network evaluations: different depths, # units, trained on different data etc.
  2. Combining Audio/Lipreading:
    • get valid frames from audio outputs

    • put images per video file instead of all together. Network will need to output per timeframe: (frameNo, phonemePredictions)

    • batch processing is an issue: different # valid frames for each video. Audio and video need different kind of padding. For now, use batch_size = 1 to circumvent issue.

    • general structure of combinedSR program (buildNetworks -> buildAudioNetwork, buildLipreadingNetwork, combineNetworks)

    • combineNetworks can use lasagne concatenate layer to concat the features. Then we'll have something like (# validframes, 78). First 39 predictions from audio, other 39 from lipreading. Add some dense layers behind to combine them optimally, then output 39 phonemes.

    • batch sizes:
      I need a batch size of len(valid_frames) for the lipreading part, to reuse the net for all the images of the video. Not sure if I can also add a second batch_size for processing multiple videos in parallel. That would require a lot of memory, possibly too much.
      If I preload the networks withe the weights from pre-trained lipreading and audio parts, only the denseLayers need to be trained, and it should go quite fast.
      If it doesn't I'll need a top-level batch size, which complicates things as padding is needed for both audio and for the images...

Clone this wiki locally