-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming discussion #7
Comments
I know @jpsamaroo was experimenting along these lines. |
As a quick overview of what I implemented: I use PortAudio.jl to provide the input stream in 4-second increments, and write it into a rotating buffer of 5 seconds of total length (although these periods are configurable; they just seem to work for me). I convert all audio into 16K with this code from the README: # Whisper expects 16kHz sample rate and Float32 data
sout = SampleBuf(Float32, 16000, round(Int, length(s)*(16000/samplerate(s))), nchannels(s))
write(SampleBufSink(sout), SampleBufSource(s)) # Resample All this happens continuously in one task, and a copy of the 5-second buffer is copied into a Gist here: https://gist.github.com/jpsamaroo/aff348ae04f392f1e8683b59cbe6bda7 One thing you'll notice is the |
So there's not really an streaming API, more like a POC from https://github.com/ggerganov/whisper.cpp/blob/master/examples/stream/stream.cpp
the main idea is this:
you start with some buffer (the
audio_async
is a thin wrapper around a circular buffer)https://github.com/ggerganov/whisper.cpp/blob/70567eff232773d6786c91585d040f53c36b87a4/examples/common-sdl.h#L15
in the
!use_vad
case, you simply wait until enough audio are available, andaudio.get(params.length_ms, pcmf32)
dumps into the float32 vectorpcmf32
run
whisper_full(ctx, wparams, pcmf32.data(), pcmf32.size())
normallyuse
whisper_full_n_segments(ctx)
andwhisper_full_get_segment_text(ctx, i)
normallythe only different thing is afterwards you want to add token from last full segment into
wparams.prompt_tokens
for next segmentthe general idea of
audio
buffer is to padn seconds, n < 30
into 30s, so as you speak, you're inference1s + 29s silence
, then2s + 28s silence
etc. depending on how largestep_ms
is.In the
use_vad
case, we have morepcmf32
related vectors to swap audio data around (~slide window)https://github.com/ggerganov/whisper.cpp/blob/70567eff232773d6786c91585d040f53c36b87a4/examples/stream/stream.cpp#L162-L164
the
pcmf32
and friends are the actual sample you copy to and from for direct usageThe text was updated successfully, but these errors were encountered: