Internet Archive Podcasts #54

nkandpa2 · 2024-01-29T16:49:19Z

There are many podcasts published on Internet Archive under permissive licenses that can be transcribed with an ASR system like whisperX.

Below are the Internet Archive search queries for English podcasts under different licenses (see IA Search Guide for how to filter by license):

Public Domain - 8,131 results
CC-BY-SA - 10,061 results
CC-BY - 4,530 results

I have not tested and cannot verify whisperX transcription quality on non-English audio, but here are the search queries for permissively licensed podcasts in all languages:

Public Domain - 33,826 results
CC-BY-SA - 18,270 results
CC-BY - 17,720 results

With these search queries we can follow these instructions to bulk download the podcasts and then pass them through an ASR system.

craffel · 2024-05-06T17:37:47Z

Napkin math: 20k podcasts, ~30 minutes/podcast, 150 tokens/minute = 90M tokens? Not sure we should put a ton of effort into this but if it's easy to reuse the YouTube pipeline then it wouldn't hurt.

craffel added the low priority label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internet Archive Podcasts #54

Internet Archive Podcasts #54

nkandpa2 commented Jan 29, 2024

craffel commented May 6, 2024

Internet Archive Podcasts #54

Internet Archive Podcasts #54

Comments

nkandpa2 commented Jan 29, 2024

craffel commented May 6, 2024