Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDX-index-based playback: redirect URLs with www clobbering ones without www in them #308

Open
christianleger opened this issue Feb 10, 2016 · 0 comments

Comments

@christianleger
Copy link

Running the 2.3.0 distribution of CDX-indexer and Openwayback.

When replaying a WARC file via a CDX index, I'm getting an 'unavailable' response to a particular URL request even though the record is definitely in the WARC file. A reference to the URL exists in the CDX index, which points to the correct WARC file, and the correct offset.

The URL in question is also the new location of a previously www-prefixed identical URL. In the WARC file, the response for the www-prefixed URL is a 301 (pointing to the URL without www), and the one for non-www is a 200.

Specifically:

http://mgerc-ceegm.gc.ca/index-eng.html has a 200 response entry in the WARC file.
http://www.mgerc-ceegm.gc.ca/index-eng.html has a 301 response entry, which points to http://mgerc-ceegm.gc.ca/index-eng.html.

It appears that the cause is that in the CDX index. There are two consecutive entries:

mgerc-ceegm.gc.ca/index-eng.html .... 200
mgerc-ceegm.gc.ca/index-eng.html .... 301

In debugging, I saw that the replay software finds both the 200 and 301, but always acts (to the webapp frontend) as if only the 301 exists. It finds it, then tries to redirect, then fails to display the 200 record - the result states that the requested page is unavailable.

Removing the line having the 301 result from the CDX index file allows me to view the record!

It would be nice to see both entries in the CDX index be findable, possibly by adjusting the CDX-index generation to something such as:

example.com .... 200
www.example.com .... 301

Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant