Scripts for generating content inventory-style page stats from a Nutch crawl.
This is a set of scripts for generating a CSV file of per page stats of a single website crawl with Nutch, like one would use for a content audit. It was developed against Nutch 1.11, but may work with other versions.
Assuming you have generated a Nutch crawl via the crawl command, e.g.
bin/crawl urls/ MyCrawl 4
you can generate a dump of per page stats via the commands:
perl makedump.pl MyCrawl
perl processcrawl.pl MyCrawl 4 > MyCrawlStats.csv
The output (MyCrawlStats.csv) is a CSV file formatted for Excel with the following fields:
Url, Type, Extension, Host, Page Title, Word Count, In Links Count, Out Links Count, Self Links Count, Crawl Status, Depth, Visit Count
For more information on generating the Visit Count field, see the Visit Counts section below.
This script automates the dumping into text format of the crawl db, links db, and crawl segments that are generated by the Nutch crawl.
Usage: makedump.pl <Crawl Directory>
Crawl Directory: Name of crawl directory given as 2nd argument to nutch's bin/crawl
This script uses the output of the makedump scripts to generate per page stats for each page in the crawl.
Usage: processcrawl.pl <Crawl Directory> <N Segments> [Visit File] [Host]
Crawl Directory: Name of the crawl directory given as 2nd argument to nutch's bin/crawl.
N Segments: Number of segments aka crawl depth (3rd argument to bin/crawl)
Visit File: (Optional) File of visit counts from web site.
Host: (Optional for processing visit file) Host name to append to visit counts (without ending slash).
Crawling with Nutch via the bin/crawl script provided with the distribution is relatively simple. The bin/crawl script takes three arguments, a seed directory, a crawl directory and crawl depth (also called "num rounds"), e.g.,
bin/crawl urls/ MyCrawl 4
The seed directory (urls/ in the example) should contain a file seed.txt with one seed URL per line. If you are doing a content inventory for a website, this will probably be the URL of your website (e.g. www.mysite.org).
The crawl directory (MyCrawl in the example) shouldn't already exist. This is where Nutch will store the data files it needs to orchestrate its crawl.
Your crawl depth setting will depend on how large your website is, how content is organized on the site, and how extensively you want to crawl it. I find that a crawl depth of 4 is usually sufficient and for a medium to large website can take several hours and result in thousands of crawled pages. You can think of the crawl depth as the number of successive clicks that you expect a user will make to find content. A crawl depth of 4 will find all content that is available to a user through 4 successive clicks starting on the home page.
For full documentation on crawling with Nutch, see the Nutch website and the Nutch tutorial.
If you are doing a content inventory, you will probably want to change some of the default Nutch settings. The primary place where Nutch settings are configured are nutch-default.xml (or nutch-site.xml if you want site-specific rules separated from the default rule set) and regex-urlfilter.txt.
N.B. The versions of the config files that should be edited are the ones located runtime/local/conf, not the ones in the top level conf directory of the distribution.
You must set an agent name (http.agent.name) for your agent in order for the crawl to work. Reference Nutch documentation for more details about this.
By default Nutch will not add internal links to the links database. In order to be able to count in links from within your website, you must change db.ignore.internal.links to false.
By default Nutch limits the number of Outlinks that will be processed for a page to 100. For a full content inventory, set db.max.outlinks.per.page to -1.
You may want to ignore robot rules for the host you are crawling in order to get a full inventory of your site. This is controlled with the http.robot.rules.whitelist setting. This feature should be used very carefully. Also note that in Nutch 1.11, the robot whitelist feature that will ignore robots.txt is broken. In order to get this to work, check out a snapshot release from the Nutch repository.
By default, Nutch skips URLs with image and document suffixes. If you want to see these files in your crawl, modify the regex filter in regex-urlfilter.txt to remove the appropriate extensions (gif/GIF/jpg/JPG/etc.).
If you only want to crawl one domain, you should remove the "accept anything else" rule from the end of the regex filter list in regex-urlfilter.txt and add a rule that limits to your domain and/or its subdomains. E.g.
# Filter out the data subdomain of my site
-^http://data.mysite.org/
# Allow from any other subdomain of my site
+^http://.*.mysite.org/
The script is designed to incorporate visit counts from an Apache log when given a preprocessed file of counts and URLs as would be output by a command line uniq -c. To generate this file from a directory of log files, for example from a single month (6/2015), we generate this file using the command line as follows.
cut -f4,5 logs/access_log_2015-06-* | grep -v POST | grep -v OPTIONS | cut -f2 | sort | uniq -c > visited-raw-june-2015.txt
Note: The format of your log file and log file name may vary.
To process the crawl with the log file information, run
perl makedump.pl MyCrawl
perl processcrawl.pl MyCrawl 4 visited-raw-june-2015.txt http://yourhostname.com
The final argument appends the host name to the log file path so that the URLs can be matched up with the crawl, since typically this isn't present.