You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have not tested and cannot verify whisperX transcription quality on non-English audio, but here are the search queries for permissively licensed podcasts in all languages:
Napkin math: 20k podcasts, ~30 minutes/podcast, 150 tokens/minute = 90M tokens? Not sure we should put a ton of effort into this but if it's easy to reuse the YouTube pipeline then it wouldn't hurt.
There are many podcasts published on Internet Archive under permissive licenses that can be transcribed with an ASR system like whisperX.
Below are the Internet Archive search queries for English podcasts under different licenses (see IA Search Guide for how to filter by license):
I have not tested and cannot verify whisperX transcription quality on non-English audio, but here are the search queries for permissively licensed podcasts in all languages:
With these search queries we can follow these instructions to bulk download the podcasts and then pass them through an ASR system.
The text was updated successfully, but these errors were encountered: