Adjusting segment length for generation of subtitles #1665

asusdisciple · 2024-03-06T08:56:01Z

asusdisciple
Mar 6, 2024

So as many people probably do, I use pyannote as a pre-step for transcription services. One disadvantage in comparison to End-to-End models like Whisper currently is how segment timestamps are generated. Lets say for example I want to provide the transcribed text of pyannote as subtitles - this will not work, mainly because pyannote segments are based on the speaker, which can produce very long segments.

Whisper on the other hand was trained on a dataset with subtitles, so it naturally produces segments which can be easily displayed as such.

I think it would be beneficial if there was a hyperparameter in pyannote to adjust the length of a segment, so that a speaker segment can be divided into multiple subsegments.

hbredin · 2024-03-06T09:10:18Z

hbredin
Mar 6, 2024
Maintainer

I'd glady consider a PR adding this feature though I am not quite sure how this would be implemented.

How would you ensure that long speaker segments are split at the right time?
Wouldn't it make more sense to postprocess Whisper transcripts instead?

0 replies

asusdisciple · 2024-03-07T09:03:08Z

asusdisciple
Mar 7, 2024
Author

From my experience I had more accurate results when using Pyannote, that is why I not used Whisper segments in the first place. I am not sure what you mean by postprocessing, something like matching the output text of whisper with pyannote to get more timestamps?

I am not sure how to implement this without retraining the whole model, but I will think about it and come back to the topic if I can come up with something. Maybe the VAD can be adjusted to interpret smaller parts without speech as a break.

2 replies

hbredin Mar 7, 2024
Maintainer

By postprocessing, I meant, for instance, using punctuation returned by Whisper to split sentences at the right place? Small non speech breaks are not necessarily synonymous of end of sentences. Not an easy problem :)

asusdisciple Mar 11, 2024
Author

Yes that could be a possible solution, but matching this with the output of pyannote would probably lead to a lot of edge cases in terms of splitting I think. Maybe using something like this https://github.com/bminixhofer/wtpsplit and calculating an average "time per sentence" for each segment based on the number of characters could produce some good results, but this needs testing first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjusting segment length for generation of subtitles #1665

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Adjusting segment length for generation of subtitles #1665

asusdisciple Mar 6, 2024

Replies: 2 comments · 2 replies

hbredin Mar 6, 2024 Maintainer

asusdisciple Mar 7, 2024 Author

hbredin Mar 7, 2024 Maintainer

asusdisciple Mar 11, 2024 Author

asusdisciple
Mar 6, 2024

Replies: 2 comments 2 replies

hbredin
Mar 6, 2024
Maintainer

asusdisciple
Mar 7, 2024
Author

hbredin Mar 7, 2024
Maintainer

asusdisciple Mar 11, 2024
Author