Skip to content
Sally Kong edited this page May 12, 2015 · 5 revisions

The metadata API allows querying with a partial URL prefix and (optionally) a crawl name. Subdomains are ignored.

$ curl "http://statmt.org:8030/query_domain?domain=hettahuskies&crawl=2013_20"

{
  "unique_urls": [
    "http://hettahuskies.com/farm&dogs/AboutHuskies/famous.php",
    "http://hettahuskies.com/",
    "http://hettahuskies.com/landpgru.php",
    "http://hettahuskies.com/Location/When2come/w2cintro.php",
    "http://hettahuskies.com/activities/MultiActivity/MAWAW.php",
    "http://hettahuskies.com/activities/SingleDay/SDlss.php",
    "http://hettahuskies.com/activities/MultiActivity/MASAD.php",
    "http://hettahuskies.com/Location/Location&maps/Scandinavia.php",
    "http://www.hettahuskies.com/",
    "http://www.hettahuskies.com/landpgfr.php",
    "http://hettahuskies.com/activities/FarmVisits/FVintro.php",
    "http://www.hettahuskies.com/landpggr.php",
    "http://hettahuskies.com/activities/MultiDay/MDintro.php",
    "http://hettahuskies.com/Location/Areaattractions/aaintro.php"
  ],
  "query_domain": "hettahuskies",
  "query_path": "",
  "query_crawl": "2013_20"
}

Note how subdomains are (intentionally) ignored. If there were other suffixes but .com they would all be here. Include the suffix if needed:

$ curl "http://statmt.org:8030/query_domain?domain=hettahuskies.de&crawl=2013_20"

{
  "unique_urls": [],
  "query_domain": "hettahuskies",
  "query_path": "",
  "query_crawl": "2013_20"
}

No results here because we only have .com urls for that domain.

If 'crawl' is not specified, we get results from all crawls at once. For now only 2013_20 is indexed though. Get a list by using the crawls endpoint:

$ curl "http://statmt.org:8030/crawls"
{
  "crawls": [ "2013_20" ]
}

You can also query a full prefix of the path:

$ curl "http://statmt.org:8030/query_domain?domain=hettahuskies.com/Loc"

{
  "unique_urls": [
    "http://hettahuskies.com/Location/Location&maps/Scandinavia.php",
    "http://hettahuskies.com/Location/When2come/w2cintro.php",
    "http://hettahuskies.com/Location/Areaattractions/aaintro.php"
  ],
  "query_domain": "hettahuskies",
  "query_path": "/Loc",
  "query_crawl": ""
}

To get the full metadata stored for each URL just add &full:

curl "http://statmt.org:8030/query_domain?domain=hettahuskies.com/Loc&full"
{
  "unique_urls": [
    "http://hettahuskies.com/Location/Location&maps/Scandinavia.php", 
    "http://hettahuskies.com/Location/When2come/w2cintro.php", 
    "http://hettahuskies.com/Location/Areaattractions/aaintro.php"
  ], 
  "query_domain": "hettahuskies", 
  "query_path": "/Loc", 
  "data": {
    "http://hettahuskies.com/Location/Location&maps/Scandinavia.php": [
      [
        "2013_20", 
        {
          "length": "3763", 
          "offset:": "122159591", 
          "filename": "https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368698238192/warc/CC-MAIN-20130516095718-00091-ip-10-60-113-184.ec2.internal.warc.gz"
        }
      ]
    ], 
    "http://hettahuskies.com/Location/When2come/w2cintro.php": [
      [
        "2013_20", 
        {
          "length": "7763", 
          "offset:": "124940270", 
          "filename": "https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368698411148/warc/CC-MAIN-20130516100011-00070-ip-10-60-113-184.ec2.internal.warc.gz"
        }
      ]
    ], 
    "http://hettahuskies.com/Location/Areaattractions/aaintro.php": [
      [
        "2013_20", 
        {
          "length": "4140", 
          "offset:": "123806691", 
          "filename": "https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368700212265/warc/CC-MAIN-20130516103012-00009-ip-10-60-113-184.ec2.internal.warc.gz"
        }
      ]
    ]
  }, 
  "query_crawl": ""
}

For now this includes the full location of the source file, and the length and offset at which the data is located. For example to get the last entry (http://hettahuskies.com/Location/Areaattractions/aaintro.php) compute the data range as (offset, offset + length -1) and download using e.g. curl:

$ curl -r 123806691-123810830 https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368700212265/warc/CC-MAIN-20130516103012-00009-ip-10-60-113-184.ec2.internal.warc.gz | \
zcat | head -n 25

WARC/1.0
WARC-Type: response
WARC-Date: 2013-05-21T16:36:12Z
WARC-Record-ID: <urn:uuid:fa940850-e840-4dbc-9dda-8a12a983ea7f>
Content-Length: 11265
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:0e0dc031-834b-4214-8734-436e417840b7>
WARC-Concurrent-To: <urn:uuid:1c9a8321-669c-44af-9d26-162048519f6d>
WARC-IP-Address: 85.13.248.54
WARC-Target-URI: http://hettahuskies.com/Location/Areaattractions/aaintro.php
WARC-Payload-Digest: sha1:FJYCE63U6QKI5EH7N5NVMBQB2GJA4TRE
WARC-Block-Digest: sha1:SBSHBG6E2ED3BPT5CU3QLMFLSF5WXF5G

HTTP/1.0 200 OK
Date: Tue, 21 May 2013 16:36:05 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Server: Apache
X-Powered-By: PHP/5.2.17

<U+FEFF>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dt
d">
<html [...]

Format is warc header, empty line, http response header, empty line, html content. Check out download_candidates.py for downloading code in python using connection pools.

Results from the metadata API are limited to 10000 per request, just to keep the result size reasonable. It's not super fast but I assume that's because of either the python interface or, more likely, because we're touching many files over NFS. For batch processing we can access the DB locally and use the C++ interface.

Clone this wiki locally