-
Notifications
You must be signed in to change notification settings - Fork 14
Home
The Feature extractor is responsible for turning PCM encoded audio data into 8bit mel-spectrogram features. You might question yourself why the feature extractor is separate and why it uses only 8bits. Some applications like verifying if a hotword is issued by a certain speaker require two models running the same audio data. Having the feature extractor as separate entity saves the duplicate computation of the mel-features. Secondly, it can be a convenient way of compressing and transmitting data. One second of audio contains 40X98 mel-features. You can capture your audio on a lightweight system (like ESP32) and transmit the features to a more powerful system. This only requires 40X98X8bit = 3kbit per second. Using 8bit looses almost no audible information.
Note The C interface of the feature extractor expects 16-bit unsigned integers. The python interface, however, expects the audio in a byte format. For this reason, the length of the input data is detector.GetInputDataSize() * 2
The audio recognition module detects audio events. Depending on the model this can be a hotword, command, or any other audio event. Currently, all models watch a 1-second sliding frame (40x98 mel features) with a 200ms sliding step. So 5 predictions per second are made. The recognition module returns nothing if an unknown occurrence has been detected, and an index if a known event occurred.
The feature extractor works on 16 bit PCM encoded signed integer data with one channel at a sample-rate of 16kHz. It expects input frames of 200ms length.
In Python two implementations are available. cross_record.py uses pyaudio and works for multiple platforms. record.py uses arecord and only works for Linux. This version causes less trouble with under/overflows under heavy CPU usage and should be used if possible.
This python module detects on which hardware the software is running. It can set the default library path if the architecture is correctly detected.