Using previously computed speaker embedding clusters for podcast speaker diarization #1226

bghorvath · 2023-01-13T22:09:57Z

bghorvath
Jan 13, 2023

Hi everyone!

I am using the speaker diarization module to match podcast episode transcripts with the speakers, and it works very well, but it would be really nice to be able to only match the speaker names once with the recognized speakers.

So I was wondering, if instead of calling the pretrained speaker-diarization pipeline each time on the separate audio files and thus fitting the clusters each time, is it possible to just fit the speaker embedding cluster centroids once, and reuse them for the new audio files by just predicting them on the original clusters?

The diarization time also doesn't seem to scale linearly (it takes 3:50min to diarize a 1h audio sample on an RTX3060, but 20min to do a 3h one), and for the full 5h episodes it even fails sometimes, so I was thinking it may be because of the clustering.

hbredin · 2023-01-14T17:13:40Z

hbredin
Jan 14, 2023
Maintainer

This is related to #1205 #1218 #1085 and many other issues/discussions.

Same answer from my side: feel free to contribute a PR adding this feature as this seems to be a recurrent request...

3 replies

bghorvath Jan 14, 2023
Author

Thank you for the reply! I saw the previous discussions, I just thought my usecase was different, but in this case, I might try to implement this feature. Do you think fitting the clusters could be the bottleneck of diarizing very long audio files?

hbredin Jan 15, 2023
Maintainer

As of version 2.1.1, pyannote/speaker-diarization internally relies on hierarchical agglomerative clustering which (like most clustering approaches) is O(n^2) so, yes, that might be the reason.

bghorvath Jan 15, 2023
Author

Thank you very much!
And sorry for posting this here, I'm not sure how Github works in this regard. I ran into an error while using the SpeakerDiarization pipeline. Just trying to init with the default parameters:

pipeline = SpeakerDiarization(
    use_auth_token=auth_token,
)

Results in:

HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'speechbrain/spkrec-ecapa-voxceleb@5c0be3875fda05e81f3c004ed8c7c06be308de1e'.

The issue is inside SpeechBrainPretrainedSpeakerEmbedding:

self.classifier_ = SpeechBrain_EncoderClassifier.from_hparams(
            source=self.embedding,
            savedir=f"{CACHE_DIR}/speechbrain",
            run_opts={"device": self.device},
            use_auth_token=self.use_auth_token,
        )

self.embedding should be split into embedding and revision first. I fixed it and now it works, should I open an issue and create a PR for it?
Edit: or did I miss something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using previously computed speaker embedding clusters for podcast speaker diarization #1226

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Using previously computed speaker embedding clusters for podcast speaker diarization #1226

bghorvath Jan 13, 2023

Replies: 1 comment · 3 replies

hbredin Jan 14, 2023 Maintainer

bghorvath Jan 14, 2023 Author

hbredin Jan 15, 2023 Maintainer

bghorvath Jan 15, 2023 Author

bghorvath
Jan 13, 2023

Replies: 1 comment 3 replies

hbredin
Jan 14, 2023
Maintainer

bghorvath Jan 14, 2023
Author

hbredin Jan 15, 2023
Maintainer

bghorvath Jan 15, 2023
Author