Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internet Archive #13

Open
upintheairsheep opened this issue Feb 14, 2023 · 0 comments
Open

Internet Archive #13

upintheairsheep opened this issue Feb 14, 2023 · 0 comments

Comments

@upintheairsheep
Copy link

http://archive.org/ - Contact the internet archive to give you a listing of all the data you want, the Internet Archive is a giant library filled with documents (books, manuals, and other random PDF files) and other interesting files, including JSONs from mirrored online videos, sometimes including their comments and just random important documents. For PDF files, they provide a variety of formats, like an OCR txt and a OCR xml. See https://archive.org/download/andrus-thesis as an example. Just to note, they also include mirrored online videos including their metadata and sometimes comments. See https://archive.org/download/youtube-DPMluEVUqS0 as an example of this and https://archive.org/download/instagram-apple as another format commonly used. The archive also provides directory listings on common compressed files, so you can scrape them for documents too. See #11 for formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant