You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running the 2.3.0 distribution of CDX-indexer and Openwayback.
When replaying a WARC file via a CDX index, I'm getting an 'unavailable' response to a particular URL request even though the record is definitely in the WARC file. A reference to the URL exists in the CDX index, which points to the correct WARC file, and the correct offset.
The URL in question is also the new location of a previously www-prefixed identical URL. In the WARC file, the response for the www-prefixed URL is a 301 (pointing to the URL without www), and the one for non-www is a 200.
In debugging, I saw that the replay software finds both the 200 and 301, but always acts (to the webapp frontend) as if only the 301 exists. It finds it, then tries to redirect, then fails to display the 200 record - the result states that the requested page is unavailable.
Removing the line having the 301 result from the CDX index file allows me to view the record!
It would be nice to see both entries in the CDX index be findable, possibly by adjusting the CDX-index generation to something such as:
Running the 2.3.0 distribution of CDX-indexer and Openwayback.
When replaying a WARC file via a CDX index, I'm getting an 'unavailable' response to a particular URL request even though the record is definitely in the WARC file. A reference to the URL exists in the CDX index, which points to the correct WARC file, and the correct offset.
The URL in question is also the new location of a previously www-prefixed identical URL. In the WARC file, the response for the www-prefixed URL is a 301 (pointing to the URL without www), and the one for non-www is a 200.
Specifically:
http://mgerc-ceegm.gc.ca/index-eng.html has a 200 response entry in the WARC file.
http://www.mgerc-ceegm.gc.ca/index-eng.html has a 301 response entry, which points to http://mgerc-ceegm.gc.ca/index-eng.html.
It appears that the cause is that in the CDX index. There are two consecutive entries:
mgerc-ceegm.gc.ca/index-eng.html .... 200
mgerc-ceegm.gc.ca/index-eng.html .... 301
In debugging, I saw that the replay software finds both the 200 and 301, but always acts (to the webapp frontend) as if only the 301 exists. It finds it, then tries to redirect, then fails to display the 200 record - the result states that the requested page is unavailable.
Removing the line having the 301 result from the CDX index file allows me to view the record!
It would be nice to see both entries in the CDX index be findable, possibly by adjusting the CDX-index generation to something such as:
example.com .... 200
www.example.com .... 301
Many thanks!
The text was updated successfully, but these errors were encountered: