Internet Archive #13

upintheairsheep · 2023-02-14T17:13:58Z

http://archive.org/ - Contact the internet archive to give you a listing of all the data you want, the Internet Archive is a giant library filled with documents (books, manuals, and other random PDF files) and other interesting files, including JSONs from mirrored online videos, sometimes including their comments and just random important documents. For PDF files, they provide a variety of formats, like an OCR txt and a OCR xml. See https://archive.org/download/andrus-thesis as an example. Just to note, they also include mirrored online videos including their metadata and sometimes comments. See https://archive.org/download/youtube-DPMluEVUqS0 as an example of this and https://archive.org/download/instagram-apple as another format commonly used. The archive also provides directory listings on common compressed files, so you can scrape them for documents too. See #11 for formats.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internet Archive #13

Internet Archive #13

upintheairsheep commented Feb 14, 2023

Internet Archive #13

Internet Archive #13

Comments

upintheairsheep commented Feb 14, 2023