This guide assumes the following:
- Users are allowed to ssh into and between nodes allocated for their job without having to provide a password
- Users are allowed to open ports > 1024 on their jobs' nodes
- Each node has a local scratch space that is not shared between nodes† and users can write to it.
- For the sake of simplicity, we also assume nodes are not shared between jobs. Hadoop stresses aspects of the system (like I/O) that are not typically considered in resource allocation, so having Hadoop and non-Hadoop jobs on a shared node introduces a lot of unnecessary complexity.
We are also installing Apache Hadoop version 1, as opposed to version 2. This is a critical distinction because Hadoop version 2 is architecturally (and configurationally) very different from version 1. Hadoop 1 is what is in widespread production today, and it is the version for which most of the Hadoop application ecosystem has been developed. In the future I will be incorporating Hadoop2/YARN support into myHadoop, it's not there yet.
The exact version of Hadoop 1 we're using is less important; I am using the latest (as of this writing) version of Hadoop 1.2.1, although there is no reason this system would not work for earlier versions; we've tested using 1.0.4, 1.1.1, and the ancient 0.20 releases.
† Although this guide assumes each node has its own user-writeable local scratch, this is not a requirement for deploying Hadoop on traditional supercomputers. I will try to cover the more advanced topic of running Hadoop on shared/parallel filesystems in a future guide.
You can download the tarballs from the following locations:
Neither of these tarballs requires any sort of proper "installation;" simply unpack them wherever you'd like to have them be installed, e.g.,
$ mkdir ~/hadoop-stack
$ cd ~/hadoop-stack
$ tar zxvf hadoop-1.2.1-bin.tar.gz
hadoop-1.2.1/
hadoop-1.2.1/.eclipse.templates/
hadoop-1.2.1/.eclipse.templates/.externalToolBuilders/
hadoop-1.2.1/.eclipse.templates/.launches/
hadoop-1.2.1/bin/
hadoop-1.2.1/c++/
...
$ tar zxvf myhadoop-0.30b.tar.gz
myhadoop-0.30b/
myhadoop-0.30b/CHANGELOG
myhadoop-0.30b/LICENSE
myhadoop-0.30b/README.md
myhadoop-0.30b/bin/
myhadoop-0.30b/bin/myhadoop-bootstrap.sh
...
myHadoop ships with a patch file, myhadoop-1.2.1.patch, that converts the default configuration files that ship with Apache Hadoop into templates that myHadoop can then copy and modify when a user wants to run a job. To apply that patch,
$ cd hadoop-1.2.1/conf
$ patch < ../../myhadoop-0.30b/myhadoop-1.2.1.patch
patching file core-site.xml
patching file hdfs-site.xml
patching file mapred-site.xml
...and that's it. You now have Hadoop installed on your cluster.
When you want to spin up a Hadoop cluster, you will need to do the following from within a job environment--either a non-interactive job script, or an interactive job.
Define $HADOOP_HOME, which in the previous section's example, would be $HOME/hadoop-stack/hadoop-1.2.1:
$ export HADOOP_HOME=$HOME/hadoop-stack/hadoop-1.2.1
myHadoop needs to know where this is because $HADOOP_HOME/conf contains the patched configuration templates it will use to build actual cluster configurations.
Adding $HADOOP_HOME/bin and your myHadoop installation directory to $PATH is just a matter of additional convenience:
$ export PATH=$HADOOP_HOME/bin:$PATH
$ export PATH=$HOME/hadoop-stack/myhadoop-0.30b/bin:$PATH
In the event that your system does not have $JAVA_HOME properly set either, you must also define that.
$ export JAVA_HOME=/usr/java/latest
Define $HADOOP_CONF_DIR, which will be where your personal cluster's configuration will be located. This is an arbitrary choice, but I like to use the jobid, e.g.,
$ export HADOOP_CONF_DIR=$HOME/mycluster-conf-$PBS_JOBID
to ensure that running multiple Hadoop cluster jobs on the same supercomputer doesn't cause them all to step on each others' config directories.
The myhadoop-configure.sh script (which is in the bin/ directory of your myHadoop install directory) extracts all of the job information (node names, node count, etc) from your supercomputer's resource manager (Torque, Grid Engine, SLURM, etc) and populates the configuration directory ($HADOOP_CONF_DIR) with the Hadoop config files necessary to make the Hadoop cluster use the nodes and resources assigned to it by the resource manager.
The general syntax for myhadoop-configure.sh
$ myhadoop-configure.sh -c $HADOOP_CONF_DIR -s /scratch/$USER/$PBS_JOBID
where
-c $HADOOP_CONF_DIR
specifies the directory where your Hadoop cluster configuration should reside. Provided you defined the $HADOOP_CONF_DIR environment variable as recommended above, this will always be what you specify with the -c flag.- -s /scratch/$USER/$PBS_JOBID specifies the location of the scratch space local to each compute node. This directory should reside on a local filesystem on each compute node, and it cannot be a shared filesystem.
Upon running this myhadoop-configure.sh script, $HADOOP_CONF_DIR will be created and populated with the necessary configuration files. In addition, the HDFS filesystem for this new Hadoop cluster will be created (in the directory specified after the -s flag on each compute node) and formatted.
Once myhadoop-configure.sh has been run, the Hadoop cluster is ready to go. Once again, ensure that the $HADOOP_CONF_DIR environment variable is defined, and then use the start-all.sh script that comes with Hadoop (it's located in $HADOOP_HOME/bin).
$ $HADOOP_HOME/bin/start-all.sh
Warning: $HADOOP_HOME is deprecated.
starting namenode, logging to /scratch/username/1234567.master/logs/hadoop-username-namenode-node-6-78.sdsc.edu.out
node-6-78: Warning: $HADOOP_HOME is deprecated.
node-6-78:
node-6-78: starting datanode, logging to /scratch/username/1234567.master/logs/hadoop-username-datanode-node-6-78.sdsc.edu.out
...
The warnings about $HADOOP_HOME being deprecated are harmless and the result of Hadoop 1.x (as opposed to 0.20) preferring $HADOOP_PREFIX instead of $HADOOP_HOME. If you find it extremely annoying, you can add export HADOOP_HOME_WARN_SUPPRESS=TRUE to your ~/.bashrc to suppress it.
To verify that your Hadoop cluster is up and running, you can do something like
$ hadoop dfsadmin -report
Configured Capacity: 899767603200 (837.97 GB)
Present Capacity: 899662729261 (837.88 GB)
DFS Remaining: 899662704640 (837.88 GB)
DFS Used: 24621 (24.04 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 3 (3 total, 0 dead)
Name: 10.5.101.146:50010
...
Or you can try making a few directories on HDFS and loading up some examples...
$ hadoop dfs -mkdir data
$ wget http://www.gutenberg.org/cache/epub/2701/pg2701.txt
...
2014-02-09 14:01:12 (939 KB/s) - "pg2701.txt" saved [1257274/1257274]
$ hadoop dfs -put pg2701.txt data/
$ hadoop dfs -ls data
Found 1 items
-rw-r--r-- 3 username supergroup 1257274 2014-02-09 14:01 /user/username/data/pg2701.txt
And running the wordcount example that comes with Hadoop:
$ hadoop jar $HADOOP_HOME/hadoop-examples-1.2.1.jar wordcount data/pg2701.txt wordcount-output
14/02/09 14:03:21 INFO input.FileInputFormat: Total input paths to process : 1
14/02/09 14:03:21 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/02/09 14:03:21 WARN snappy.LoadSnappy: Snappy native library not loaded
14/02/09 14:03:21 INFO mapred.JobClient: Running job: job_201402091353_0001
14/02/09 14:03:22 INFO mapred.JobClient: map 0% reduce 0%
14/02/09 14:03:27 INFO mapred.JobClient: map 100% reduce 0%
14/02/09 14:03:34 INFO mapred.JobClient: map 100% reduce 33%
14/02/09 14:03:36 INFO mapred.JobClient: map 100% reduce 100%
14/02/09 14:03:37 INFO mapred.JobClient: Job complete: job_201402091353_0001
14/02/09 14:03:37 INFO mapred.JobClient: Counters: 29
...
And verify the output:
$ hadoop dfs -cat wordcount-output/part-r-00000
"'A 3
"'Also 1
"'Are 1
"'Aye, 2
...
Once you are done with your Hadoop cluster, you can stop it using the stop-all.sh script provided with Hadoop:
$ stop-all.sh
stopping jobtracker
gcn-6-78: stopping tasktracker
gcn-7-85: stopping tasktracker
...
stopping namenode
gcn-7-85: stopping datanode
gcn-6-78: stopping datanode
...
gcn-6-78: stopping secondarynamenode
Then run the myhadoop-cleanup.sh script included with myHadoop. It will copy the Hadoop logfiles off of your jobtracker (useful for debugging failed jobs) and put them in your $HADOOP_CONF_DIR, then delete all of the temporary files that Hadoop creates all over each compute node.
$ myhadoop-cleanup.sh
Copying Hadoop logs back to /home/username/mycluster-conf-1234567.master/logs...
...
removed directory: `/scratch/username/1234567.master/mapred_scratch/taskTracker'
removed directory: `/scratch/username/1234567.master/mapred_scratch/ttprivate'
...
Strictly speaking, neither of these steps (stop-all.sh and myhadoop-cleanup.sh) are necessary since most supercomputers will clean up temporary user files on each node after your job ends, but it doesn't hurt to do these two steps explicitly to save on potential headaches.
Although the preferred way of using myHadoop and HDFS is using node-local disk for HDFS, this has the drawback of the HDFS state not persisting once your myHadoop job ends and the node-local scratch space is (presumably) purged by the resource manager.
To address this limitation, myHadoop also provides a "persistent" mode whereby the namenode and all datanodes are actually stored on some persistent, shared filesystem (like Lustre) and linked to each Hadoop node at the same location in a node-local filesystem.
To configure a "persistent" myHadoop cluster, add the -p flag to specify a location on the shared filesystem to act as the true storage backend for all of your datanodes and namenode, e.g.,
myhadoop-configure.sh -p /path/to/shared/filesystem \
-c /your/new/config/dir \
-s /path/to/node/local/storage
In this case, /path/to/shared/filesystem would be your space on a filesystem accessible to all of your Hadoop nodes, /your/new/config/dir is the same arbitrary configuration directory for the cluster you want to spin up (this does not need to be the same as any previous persistent states), and /path/to/node/local/storage remains a path to a filesystem that is NOT shared across nodes (e.g., /tmp).
Persistent mode then creates directories under /path/to/shared/filesystem that stores the datanode and HDFS data for each datanode. In addition, it creates a /path/to/shared/filesystem/namenode_data directory which contains the namenode state (e.g., fsimage). It then creates symlinks in /path/to/node/local/storage pointing to /path/to/shared/filesystem on each datanode to point to this shared filesystem.
You can then safely run your map/reduce jobs, stop-all.sh to shut down the cluster, and even myhadoop-cleanup.sh to wipe out your compute nodes. At a later time, you can request a new set of nodes from your supercomputer's batch scheduler, run the same myhadoop-configure.sh command with the -p option pointing to the same /path/to/shared/filesystem, and myHadoop will detect an existing persistent HDFS state and adjust the resulting Hadoop cluster configurations accordingly. You can use this mechanism to store data on HDFS even when you have no jobs running in the batch system.
IMPORTANT NOTE: Use of persistent mode is not recommended, as Hadoop's performance and resiliency arises from the fact that HDFS resides on physically discrete storage devices. By pointing all of your datanodes' HDFS blocks at the same persistent storage device (a SAN, NFS-mounted storage, etc), you lose the data parallelism and resulting perfomance that makes Hadoop useful. You are, in effect, shooting yourself in the foot by doing this. The only potential exception to this is if you use a parallel clustered filesystem (like Lustre) as the persistent storage device; the parallelism underneath that filesystem may allow you to recover some of the performance loss because it will store your HDFS blocks on different object storage targets. However, other bottlenecks and limitations also enter the picture.
myHadoop provides a simple facility to run all HDFS and mapreduce traffic over IPoIB (IP over InfiniBand) interfaces. By specifying the -i option followed by a regular expression transformation (which is passed to sed), the list of hostnames provided by the resource manager can be modified to be IPoIB hostnames.
For example, if the IPoIB interface for "node-0-1" is "node-0-1.ibnet", then the necessary transformation would be
-i 's/$/.ibnet/'
The myhadoop-configure.sh command has the following command-line options that can also be set via the environment variables listed below:
Switch | Environment Variable | Option |
---|---|---|
-n | NODES | number of nodes to use for Hadoop cluster |
-p | MH_PERSIST_DIR | location of persistent HDFS data |
-c | HADOOP_CONF_DIR | location to use when building new Hadoop config directory |
-s | MH_SCRATCH_DIR | location of node-local storage for HDFS data |
-h | HADOOP_HOME | location of hadoop installation containing myHadoop templates in $HADOOP_HOME/conf |
-i | MH_IPOIB_TRANSFORM | regex (passed to sed -e) to transform hostnames from resource manager to IP over InfiniBand hosts |
The order of precedence is
- etc/myhadoop.conf is loaded
- Environment variables are loaded
- Command line switches are evaluated
myHadoop allows systems administators to set up default configurations that make the most sense for an entire system via the myhadoop.conf file located in the "../etc/myhadoop.conf" relative to the myhhadoop-configure.sh script. This file contains simple bash lines which are evaluated before myhadoop-configure.sh builds its configurations. As mentioned above, the variables set in this file are the defaults which can be overridden by the user environment OR command-line switches when the myhadoop-configure.sh command is called.
All lines in myhadoop.conf are actually evaluated as-is, so significant system-specific modification to the myhadoop-configure.sh script can be made via myhadoop-conf. Systems administators may wish to define the following variables:
- MH_SCRATCH_DIR - sets the location of the node-local storage filesystem. Can contain other variables, e.g., MH_SCRATCH_DIR=/scratch/$USER
- HADOOP_HOME - sets the location of the Hadoop installation. This will propogate through the entire Hadoop configuration process
- MH_IPOIB_TRANSFORM - sets the transformation to ensure that Hadoop runs over the IP over InfiniBand interfaces.
After running the first hello-world samples for myHadoop on a HPC cluster, it is very likely you want to see how it's performance is actually doing. Well the default performance of Hadoop is most likely not really using the big iron it is running on at the moment. Therefore a couple of tuning varables have been added to make sure Hadoop is running more tuned to it's new home.
Hadoop was designed to run in an environment where failure was deemed common. One of the neat tricks here is to use a default replication factor of 3 on alle the HDFS blocks. With myHadoop you have the choice of either running in a non-persistent environment where this replication is not very usefull since the HDFS storage is deleted right after taking the cluster down. In the case of using persistent storage, on something like Lustre or GPFS, your friendly cluster admins will have taken plenty of care that this storage is actually secure. In both cases, it is save to choose a replication factor of 1 to lower the total amount of IO.
An other option to play with is the HDFS block size. The best block size depends very much of the amount of data being used in your job and the amount of mappers/reducers that are available to handle this data. When handling larger data sizes, going up from the default 64MB to 128MB or 256MB may give you better performance.
To be able to use the multi-core nodes you may be using, Hadoop must be told it may increase the amount of mappers/reducers to more than the default 2 it is starting. In the example etc/myhadoop.conf-tuning the script is checking the amount of cores on the node available through the resource managers variable, and uses half of them for the mappers and 1/4 for the reducers. The data and task manager on each node require each a core, but it is save to play around with these verables to optimize.
Also a hint is being given to Hadoop about the total amount of mappers/reducers available for a job based upon the above variables times the amount of nodes on which myHadoop is running. Especially on a HPC cluster it only takes on variable to change this is a submit script, so it makes sense to have Hadoop being aware of this.