Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding status code and digest to Memento TimeMaps #345

Open
anjackson opened this issue Apr 28, 2017 · 2 comments
Open

Consider adding status code and digest to Memento TimeMaps #345

anjackson opened this issue Apr 28, 2017 · 2 comments

Comments

@anjackson
Copy link
Member

anjackson commented Apr 28, 2017

Following this conversation, we should consider putting the HTTP status code and perhaps also the payload digest (if known) in the Memento TimeMap. e.g.

...
<https://www.webarchive.org.uk/wayback/archive/20151106002344/http://www.bl.uk/>; rel="memento"; datetime="Fri, 06 Nov 2015 00:23:44 GMT",
<https://www.webarchive.org.uk/wayback/archive/20151106004051/http://www.bl.uk/>; rel="memento"; datetime="Fri, 06 Nov 2015 00:40:51 GMT",
...

could be something like...

...
<https://www.webarchive.org.uk/wayback/archive/20151106002344/http://www.bl.uk/>; rel="memento"; datetime="Fri, 06 Nov 2015 00:23:44 GMT"; status="404",
<https://www.webarchive.org.uk/wayback/archive/20151106004051/http://www.bl.uk/>; rel="memento"; datetime="Fri, 06 Nov 2015 00:40:51 GMT"; status="200",
...

This information is generally in the CDX index/service so it should be easy enough to add.

Are there any downsides?

EDIT: I've just realised one possible source of issues. The time-map return the status code from the CDX, i.e. from the original server, but our service can override that to return a 451 status. In our case, this doesn't really matter because we work at URI-resolution, so the whole timemap 451, but if anyone blocks individual instances of a resource this will lead to problems. Not sure anyone does that though?

@kris-sigur
Copy link
Member

Revisit records do not include a status code in the CDX. Usually they represent a 200, but there are cases of deduplication of 301, 302 and 404s.

The actual status code is in the WARC file of course, but it is more expensive to fetch. Maybe this is really a case where the CDX should be fixed.

@anjackson
Copy link
Member Author

For the original use case, it might be sufficient to just omit the status code unless we're sure of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants