-
Notifications
You must be signed in to change notification settings - Fork 19
10.04 16.04
- More network evaluations: different depths, # units, trained on different data etc.
- Combining Audio/Lipreading:
-
get valid frames from audio outputs
-
put images per video file instead of all together. Network will need to output per timeframe: (frameNo, phonemePredictions)
-
batch processing is an issue: different # valid frames for each video. Audio and video need different kind of padding. For now, use batch_size = 1 to circumvent issue.
-
general structure of combinedSR program (buildNetworks -> buildAudioNetwork, buildLipreadingNetwork, combineNetworks)
-
combineNetworks can use lasagne concatenate layer to concat the features. Then we'll have something like (# validframes, 78). First 39 predictions from audio, other 39 from lipreading. Add some dense layers behind to combine them optimally, then output 39 phonemes.
-
batch sizes:
I need a batch size of len(valid_frames) for the lipreading part, to reuse the net for all the images of the video. Not sure if I can also add a second batch_size for processing multiple videos in parallel. That would require a lot of memory, possibly too much.
If I preload the networks withe the weights from pre-trained lipreading and audio parts, only the denseLayers need to be trained, and it should go quite fast.
If it doesn't I'll need a top-level batch size, which complicates things as padding is needed for both audio and for the images...
-