Skip to content

Commit

Permalink
docs and README: add wacz format doc, tweak links, tweak README
Browse files Browse the repository at this point in the history
  • Loading branch information
ikreymer committed Jun 11, 2020
1 parent 53bb291 commit ae872cd
Show file tree
Hide file tree
Showing 6 changed files with 42 additions and 12 deletions.
2 changes: 1 addition & 1 deletion Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ GEM
eventmachine (>= 0.12.9)
http_parser.rb (~> 0.6.0)
eventmachine (1.2.7)
ffi (1.13.0)
ffi (1.13.1)
forwardable-extended (2.6.0)
http_parser.rb (0.6.0)
i18n (0.9.5)
Expand Down
16 changes: 10 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,18 @@
## Serverless Web Archive Replay

ReplayWeb.page provides a full web archive replay system running directly in the browser,
available at: https://replayweb.page/
available at: [https://replayweb.page/](https://replayweb.page)

For full user docs, see: https://replayweb.page/docs
For full user docs, see: [https://replayweb.page/docs](https://replayweb.page/docs)

The ReplayWeb.page App can be downloaded from: https://replayweb.page/releases
The ReplayWeb.page App can be downloaded from the [Releases](https://replayweb.page/releases) page.

### Embedding Guide

## Architecture / What's in this repo
See the [Embedding Guide](https://replayweb.page/docs/embedding) for more info on embedding web archives in other sites.


## What's in this repo

ReplayWeb.page is a static web site / offline web app + Electron app.

Expand Down Expand Up @@ -59,12 +63,12 @@ For service workers to work, they must be served from either localhost or an HTT

See the [user docs](https://replayweb.page/docs/) for additional info about using ReplayWeb.page



## LICENSE

ReplayWeb.page is made available under the AGPLv3 License.

[Embedding ReplayWeb.page](https://replayweb.page/docs/embedding) from published releases is encouraged.

If you would like to use it under a different license or have a question, please reach out as that may be a possibility.


Expand Down
4 changes: 2 additions & 2 deletions docs/exploring.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,12 @@ The archive view presents several tabs:
- **Story** - This Story view presents lists of curated pages, as developed by the creator of the web archive.
This option is only shown if there is a curated story. As curated lists are not a standard part of WARC, only WARCs exported from Webrecorder.io/Conifer can have this option.

The new [Web Archive Collection (WACZ)](web-archive-collection-format) can also include curated lists.
The new [Web Archive Collection (WACZ)](wacz-format) can also include curated lists.

- **Pages** - The Pages view presents all pages in the web archive. As pages are not a standard part of WARC format,
generally only WARCs from Webrecorder.io/Conifer will have pages.

The new [Web Archive Collection (WACZ)](web-archive-collection-format) can also store pages.
The new [Web Archive Collection (WACZ)](wacz-format) can also store pages.


- **Page Resources** - This view allows searching the archive by URLs, as well as by common MIME type.
Expand Down
2 changes: 1 addition & 1 deletion docs/formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ parent: Reference
ReplayWeb.Page supports the archive formats listed below.
Format is currently determined based on the file extension.

The `.wacz` refers to the newly proposed [Web Archive Collection Zip Format](web-archive-collection).
The `.wacz` refers to the newly proposed [Web Archive Collection Zip Format](wacz-format).


| Format | Extensions | Status |
Expand Down
4 changes: 2 additions & 2 deletions docs/loading.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ To load a remote archive, simply enter the URL of the archive and click `Load`.
{: .fs-3 .pad .bg-grey-lt-100}
See [Supported Locations](locations) for details on where archives can be loaded from.

The archive will be downloaded, either fully or [only as needed (if possible)](streaming-archives.md) and presented on the archive page.
The archive will be downloaded, either fully or on-demand (if possible) and presented on the archive page.

The system supports WARC files, as well as several other formats

Expand Down Expand Up @@ -74,7 +74,7 @@ Due to the nature of the WARC format, the entire file must be read on first use
For WARC files **>25MB**, only the index is initially stored in the browser, and the actual content is loaded 'on-demand',
when the content is first accessed. This leads to faster loading and saves memory when dealing with large archives.

[Web Archive Collection (WACZ)](web-archive-collection-format) are always loaded on-demand, as no indexing is required.
[Web Archive Collection (WACZ)](wacz-format) are always loaded on-demand, as no indexing is required.
The initial archive view should load almost instantly as a result.

If an archive could not be loaded, an error will be displayed instead of the progress.
Expand Down
26 changes: 26 additions & 0 deletions docs/wacz-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
layout: default
title: 'Web Archive Collection Zipped (WACZ) Format'
nav_order: 1
permalink: /docs/wacz-format
parent: Reference
---

## Web Archive Collection Format Specification

ReplayWeb.page supports a new format for bundling raw web archive data (usually WARC files), indices,
page lists and other metadata into a single ZIP file.

The full spec for this format is available at: [https://github.com/webrecorder/web-archive-collection-format/blob/master/README.md](https://github.com/webrecorder/web-archive-collection-format/blob/master/README.md)

Files bundled into this format can use the .wacz (web archive collection zipped) file extension.

ReplayWeb.page will recognize this extension (as well as regular .zip) and will also load it from Google Drive when the
[Google Drive Integration](https://gsuite.google.com/u/2/marketplace/app/replaywebpage/160798412227) is installed.

The key benefit of this format is that large web archive collections can be loaded very quickly, to show the page list
and other key metadata, by downloading only parts of the WACZ file, as [outlined here](https://github.com/webrecorder/web-archive-collection-format/blob/master/README.md#appendix-a-use-case-random-access-to-web-archives-in-zip)

The actual raw content is loaded on-demand when the user requests each page.

With a WARC file, the entire contents must be loaded or indexed to determine the contents of the web archive collection.

0 comments on commit ae872cd

Please sign in to comment.