Adjusting segment length for generation of subtitles #1665
Replies: 2 comments 2 replies
-
I'd glady consider a PR adding this feature though I am not quite sure how this would be implemented. How would you ensure that long speaker segments are split at the right time? |
Beta Was this translation helpful? Give feedback.
-
From my experience I had more accurate results when using Pyannote, that is why I not used Whisper segments in the first place. I am not sure what you mean by postprocessing, something like matching the output text of whisper with pyannote to get more timestamps? I am not sure how to implement this without retraining the whole model, but I will think about it and come back to the topic if I can come up with something. Maybe the VAD can be adjusted to interpret smaller parts without speech as a break. |
Beta Was this translation helpful? Give feedback.
-
So as many people probably do, I use pyannote as a pre-step for transcription services. One disadvantage in comparison to End-to-End models like Whisper currently is how segment timestamps are generated. Lets say for example I want to provide the transcribed text of pyannote as subtitles - this will not work, mainly because pyannote segments are based on the speaker, which can produce very long segments.
Whisper on the other hand was trained on a dataset with subtitles, so it naturally produces segments which can be easily displayed as such.
I think it would be beneficial if there was a hyperparameter in pyannote to adjust the length of a segment, so that a speaker segment can be divided into multiple subsegments.
Beta Was this translation helpful? Give feedback.
All reactions