This project intends to document the pitfals and tricks of running Nutch in a AWS EMR Hadoop Cluster.
The first step to get one machine set up with the basic tools and configuration. I find that the simplest thing to do is to launch a t1.micro EC2 instance Amazon Linux AMI as it comes with most of the tools I need (ami-cli among them), it is very easy to replicate and very unexpensive. I won't cover this step here as it is well documented on the web. (hint:The Amazon Linux AMI is the first choice in the Quick Start tab at the web based "Classic Wizard" EC2 launcher.) Make sure you have security rights to ssh to this machine.
-
ssh to the Amazon Linux AMI instance, create a working folder, e.g., ~/nutch-aws, we will refer to it as NUTCH_AWS_HOME from now on.
-
scp your key-pair (.pem) file to this instance under NUTCH_AWS_HOME
-
ssh back to the Amazon Linux AMI instance
-
Yum install ant
sudo yum install ant -y
-
Get the Makefile* from github into NUTCH_AWS_HOME
wget https://raw.github.com/eleflow/nutch-aws/master/Makefile
-
Fill in the blanks in the Makefile
ACCESS_KEY_ID = ## YOUR ACCESS KEY ID SECRET_ACCESS_KEY = ## YOUR ACCESS KEY SECRET AWS_REGION = us-east-1 ## CHANGE IT IF YOU WANT EC2_KEY_NAME = ## YOUR ACCESS KEY NAME KEYPATH = ${HOME}/${EC2_KEY_NAME}.pem ## YOUR ACCESS KEY FILE (IF IT"S DIFFERENT THAN ${HOME}/${EC2_KEY_NAME}.pem) S3_BUCKET = ## THE S3 BUCKET WHERE FILES WILL BE READ FROM AND WRITTEN TO CLUSTERSIZE = 3 ## NUMBER OF MACHINES IN THE CLUSTER DEPTH = 3 ## HOW MANY LINK HOPS WILL THE CRAWLER GO TOPN = 5 ## HOW MANY OUTLINKS WILL BE FOLLOWED MASTER_INSTANCE_TYPE = m1.small SLAVE_INSTANCE_TYPE = m1.small
-
Checking the configuration:
make s3.list
This should list the S3 buckets associated with your account if the configuration is correct
-
Create a NUTCH_AWS_HOME\urls\seed.txt file with the urls that will be a starting point to the crawler.
make bootstrap
This make target will do these:
- download the Nutch 1.6 source code
- build the nutch 1.6 map reduce job jar.
- copy the nutch 1.6 map reduce job jar to s3://S3_BUCKET/lib
- copy the contents from the NUTCH_AWS_HOME\urls folder to s3://S3_BUCKET/url
make create
This make target will do these:
- start a emr cluster and run theese mr jobs:
- copy the logs to s3://S3_BUCKET/logs
The content of ./jobflowid file shoulfd be a jobflow id (e.g. j-IR4OQTH2HE7Z )if everything went well.
Note: the cluster is launched with "keep_job_flow_alive_when_no_steps" set to false which means it will be destroyed after the steps are completed.
make ssh
This will ssh into the master node and will give access to the hadoop command line tool and the logs at /mnt/var/log/hadoop.
make destroy
This will kill any job that the cluster may be running and terminate the cluster.
[*]Yeah it's a Makefile. I based it on Karan's and it may not the best tool for the job it was an easy way to get things rolling quick.