10.04 16.04

More network evaluations: different depths, # units, trained on different data etc.
Combining Audio/Lipreading:
- get valid frames from audio outputs
- put images per video file instead of all together. Network will need to output per timeframe: (frameNo, phonemePredictions)
- batch processing is an issue: different # valid frames for each video. Audio and video need different kind of padding. For now, use batch_size = 1 to circumvent issue.
- general structure of combinedSR program (buildNetworks -> buildAudioNetwork, buildLipreadingNetwork, combineNetworks)
- combineNetworks can use lasagne concatenate layer to concat the features. Then we'll have something like (# validframes, 78). First 39 predictions from audio, other 39 from lipreading. Add some dense layers behind to combine them optimally, then output 39 phonemes.
- batch sizes:
  I need a batch size of len(valid_frames) for the lipreading part, to reuse the net for all the images of the video. Not sure if I can also add a second batch_size for processing multiple videos in parallel. That would require a lot of memory, possibly too much.
  If I preload the networks withe the weights from pre-trained lipreading and audio parts, only the denseLayers need to be trained, and it should go quite fast.
  If it doesn't I'll need a top-level batch size, which complicates things as padding is needed for both audio and for the images...

Provide feedback