Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to use reference speaker embeddings in Pyannote diarization pipeline? #1750

Open
Arche151 opened this issue Aug 8, 2024 · 3 comments

Comments

@Arche151
Copy link

Arche151 commented Aug 8, 2024

Hey everyone,

I am trying to use Pyannote with Whisper for transcribing meetings between my business partner and me, but the result hasn't been that great, since about 50% of the times, the wrong speaker is assigned.

So, I thought about ways to enhance the accuracy of the diarization and found the Pyannote API docs for creating Voiceprints from reference audios and then using them in the diarization pipeline.

But since I want to do everything locally, I searched for the open-source Pyannote equivalent of the Voiceprint feature, which seems to be https://huggingface.co/pyannote/embedding

The problem: While I was able to extract embeddings from reference audios of my business partner and me, I have no idea how to use them in the diarization pipeline.

I didn't find any docs about this approach and was wondering, if it's even possible or only available in the Pyannote API.

I would greatly appreciate any kind of help/clarification :)

@hbredin
Copy link
Member

hbredin commented Aug 20, 2024

The diarization pipeline has a return_embeddings option that might help you in this endeavour:

# perform diarization and get one representative embedding per speaker
>>> diarization, embeddings = pipeline("/path/to/audio.wav", return_embeddings=True)
>>> for s, speaker in enumerate(diarization.labels()):
... # embeddings[s] is the embedding of speaker `speaker`

  • Step 1. Run speaker diarization pipeline with return_embeddings=True on your reference audio files to get corresponding embeddings
  • Step 2. Run speaker diarization pipeline on your test file and get one embedding per speaker
  • Step 3. Match embeddings based on cosine similarity.

@Arche151
Copy link
Author

@hbredin Omg, can't believe, I got an answer from Mr. Pyannote himself.

I will try out your suggested approach and report back. Thanks a lot! :)

@Aduomas
Copy link

Aduomas commented Dec 13, 2024

@hbredin is there a way to load these returned embeddings back into the pipeline for next inference?

the main goal being to preserve the state of the pipeline across multiple audio files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants