Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YouTube Transcripts #53

Open
nkandpa2 opened this issue Jan 29, 2024 · 6 comments
Open

YouTube Transcripts #53

nkandpa2 opened this issue Jan 29, 2024 · 6 comments
Labels
external project We will be including this data but the work will be done primarily by someone else. high priority

Comments

@nkandpa2
Copy link
Collaborator

Videos on YouTube can optionally be published under a CC-BY license. We can identify these videos with the YouTube API, download them, and transcribe them with an ASR system like whisperX.

@nkandpa2 nkandpa2 added the external project We will be including this data but the work will be done primarily by someone else. label Mar 4, 2024
@nkandpa2
Copy link
Collaborator Author

nkandpa2 commented Mar 4, 2024

Code for cataloging CC YouTube videos can be found in this repo: https://github.com/nkandpa2/youtube-commons. About 800K CC videos adding up to about 300K hours of primarily speech-based video have been cataloged.

@storytracer
Copy link

storytracer commented Apr 8, 2024

The US government offers a search of government YouTube videos: https://find.search.gov/search/news?affiliate=usagov_all_gov&channel=10448&query= . All these videos are produced by government agencies and are Public Domain and they seem to be all hosted on YouTube. Public Domain is not a choice you can select in the upload interface when you upload a video to YT, you can only choose between the YouTube standard license and CC, that’s why these videos can’t be found through the YouTube API. So these ~ 200K Public Domain videos findable through this search index could also be added to your YouTube Commons collection @nkandpa2 and processed with Whisper.

I can't find a public-facing search API for find.search.gov, there seem to only exist APIs for internal use. So the search.gov video index would have to be web-scraped to get a link list for all the government Youtube videos.

@craffel
Copy link
Collaborator

craffel commented Apr 12, 2024

Can we find any explicit statement that these videos are public domain, or do we need to rely on the reasoning that "these were made by the government, therefore they are public domain"

@storytracer
Copy link

storytracer commented Apr 12, 2024

No, I was not able to find any explicit statements inside the government YouTube channels so far that the videos they host are in the public domain except by using the full-text search to search for "Public Domain" which only finds 3,000 videos: https://find.search.gov/search/news?affiliate=usagov_all_gov&channel=10448&sort_by=&query=%22public+domain%22.

The only explicit statement about a channel I was able find so far is on the USDA website, which states that all the content in their YouTube channel is in the public domain in a privacy impact assessment (p. 4): "Video content published to the USDA YouTube page will be previously approved by relevant Department and Office of Communications leadership, and will be available in the public domain." But while USDA makes this statement in this random document on their website, they do not include this statement in their YouTube channel or the videos themselves, which are indexed through find.search.gov.

Generally speaking, government videos are rarely expliclity declared to be in the public domain. But that's the standard case for federal government documents as well. Outside the cultural heritage sector works are rarely explicitly declared to be in the public domain with something like the PD Mark and even in the CH sector the use of a PD mark is much less common than CC licenses.

US Federal Government works are "born in the public domain" by law, they don't become part of the public domain at some later point through a waiver or expiration of rights. But such a general legal status is of course harder to document than an explicit rights statement, so it's worth evaluating whether it's worth the effort to transcribe government videos in addition to CC videos.

@nkandpa2
Copy link
Collaborator Author

This is a great idea @storytracer. I think we can cover many of these videos by simply including the major US agencies' YouTube channels in the dataset. Looking through a couple of them from the search results you provided, channels sometimes will explicitly state in the description something along the lines of "Here you will find original content produced by [AGENCY]". So to me it feels reasonable to count this data as PD. We can always remove this later if we are unsure of the PD status.

@storytracer
Copy link

@nkandpa2 What's the current state of cataloging? Do you have an ETA and do think you could upload the catalog as a dataset to HF? I would like to help with the distributed processing of the audio files on our infrastructure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external project We will be including this data but the work will be done primarily by someone else. high priority
Projects
None yet
Development

No branches or pull requests

3 participants