Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internet Archive Podcasts #54

Open
nkandpa2 opened this issue Jan 29, 2024 · 1 comment
Open

Internet Archive Podcasts #54

nkandpa2 opened this issue Jan 29, 2024 · 1 comment

Comments

@nkandpa2
Copy link
Collaborator

There are many podcasts published on Internet Archive under permissive licenses that can be transcribed with an ASR system like whisperX.

Below are the Internet Archive search queries for English podcasts under different licenses (see IA Search Guide for how to filter by license):

I have not tested and cannot verify whisperX transcription quality on non-English audio, but here are the search queries for permissively licensed podcasts in all languages:

With these search queries we can follow these instructions to bulk download the podcasts and then pass them through an ASR system.

@craffel
Copy link
Collaborator

craffel commented May 6, 2024

Napkin math: 20k podcasts, ~30 minutes/podcast, 150 tokens/minute = 90M tokens? Not sure we should put a ton of effort into this but if it's easy to reuse the YouTube pipeline then it wouldn't hurt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants