-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some way to allow for real time noise reduction, or direct audio input? #7
Comments
Yea Whisper has a tendency to hallucinate on non-speech segments. That is why I added extra voice activity detection by Silero VAD. Maybe you can improve the issue by setting the default threshold for the VAD to something higher than 0.5 here. Currently segments are only discarded if they contain no speech. I have an idea to use the VAD to cut out non-speech parts from segments that contain only a little bit of speech. Maybe that also helps. I don't know what the best way would be to feed the audio directly. This repo uses some tool to capture the speaker output. There are plenty of Whisper showcases that use mic input. I will see if I can add something like that as an option when I have some free time. |
I frankenstein'd an absurd way to do this. Restream the stream to localhost. Install OBS Studio Install NVIDIA Audio Effects SDK: Open OBS setup your stream to stream your desktop audio to a custom rtmp server on localhost. Right click 'Desktop Audio' Audio Mixer, select 'Filters' Then I changed this ffmpeg call in translator.py to keep trying to open a stream instead of bailing:
(Not sure all the FFMPEG parameters are needed, I brute forced it and stopped when it worked.) Then run 'python translator.py --direct_url rtmp://127.0.0.1:1234' and wait a second until the script is listening, then click Start Streaming in OBS. This will stream your regular computer desktop audio, whatever you have playing, from any source. I'm sure someone who knows OSB can setup OBS in a more selective way so it's not ALL your desktop audio. One the one hand, restreaming a stream to your computer is an abomination of a solution. On the other hand, OBS is a state of the art real time audio processing powerhouse -- so you could leverage any other audio processing in the chain here, the sky's the limit. It's also kind of nice to freely browse different streams, leaving the translator open. Sometimes it even works fine jumping between languages! Also OBS should be able to use streamlink directly: That plugin should separate stream and desktop audio cleanly, because OBS pulls in the audio directly using streamlink. I tried it and the audio kept stuttering though. So I just used my desktop audio for now. While doing this I was thinking, if you do cleanly separate the desktop/stream audio (via the streamlink plugin, or a 'virtual audio cable', or whatever) then a second OBS instance could restream original audio and video, but with subtitles overlayed on top using OBS text/greenscreen features. I briefly tested just showing the terminal window with the translation on the screen, with the background made transparent, like a typical stream chat. And potentially delaying the video so the subs line up. A little overkill though. |
I'm trying out OBS for the first time, but I found that you can select "Application audio capture (BETA)" as a source and it does exactly what you want. So no need for streamlink in that case. Integrating with OBS seems like an interesting idea. It would have quite a few nice features:
We would want a pipeline like: There doesn't seem to be ways to pipe data in/out of OBS directly, the best we get is streaming to an URL. It would definitely be possible to do this with plugins/scripts, but I couldn't find any useful ones so far. Maybe I would have to build that myself. At that point, I think it would be easier to turn the entire pipeline into an OBS script. It toyed around with that idea for a bit and unfortunately there are some roadblocks: |
I've been poking at this. I think just outputting the captions to a webpage gets you a lot of the value by itself, without OBS. As long as you can point a browser at it, you can position it on screen where you want (even from Colab, with a little work), even make it transparent with some browser extensions. And it's just a web page so it's super easy for anyone to change the size, appearance, layout, use browser tools to auto translate the words, copy and paste, whatever. BTW faster_whisper integrated Silero Vad. Seems to have some rough spots so far, but could be a nice upgrade, and they'll improve the integration over time. I'll post my translator.py experiments, but I ripped the guts out just messing around so it's sort of a mess and sort of refactored, and probably somewhat broken. Some of it was refactored by ChatGPT because I was trying to test out different processes of refactoring and this project was the perfect small size that fits entirely in the context window memory. Some cool actual upgrades like using numpy_ringbuffer, way too many list comprehensions added, and an amazing 100% failure rate every single time it tried to extract functions out of the long main() function for some reason. BTW there's some cool features in this project that in a perfect world could be integrated. You can see the first-pass caption, and then the update when Whisper reruns given more context:
https://github.com/davabase/transcriber_app |
I'm pretty sure I made things worse but I left the code here anyway, as an example of using OBS like this for anyone else searching: https://github.com/JonathanFly/faster-whisper-livestream-translator |
This is a pretty sweet repo, been using it a couple times a week recently. Faster-whisper lets you actually run the large model in real time with good latency on a 3090. Actually it's even more insane, I run TWO LARGE MODELS AT THE SAME TIME, two stream-translators, so that I can have dual subtitles: one transcribed, one translated. It works fine 3090 as long as you are just doing normal desktop stuff! Wild.
But when streams have a lot of background noise (music, game sounds) I found you NEED to add some decent real-time noise reduction or Whisper just faceplants over and over.
Mainly I've used the Nvidia Broadcast tool to do this in real time, with a virtual cable if needed to get the audio routed correctly. Whisper is back at full power if I do this. But since stream-translator streams the audio directly I have to use something else instead.
If this could take in mic/speaker device audio alternative to streamlink, that would do it. Using this option loses the simplicity and latency benefits of streaming directly, but the alternative is Whisper collapsing in confusion on some streams. I know other repos already take in direct audio but ideally I want to stop bouncing between them...
Maybe there's a more elegant way to accept direct audio input that doesn't require a wacky virtual-cable or whatever? OBS studio integrates NVIDIA noise reduction via the broadcast SDK. Or there could be a good open source solution. I tried a couple but none of the real-times were close to good enough, compared to the NVIDIA version.
The text was updated successfully, but these errors were encountered: