-
Notifications
You must be signed in to change notification settings - Fork 11
Part 1
The lesson today is how to write a simple Cascalog app. The goal is clear and concise: create the simplest application possible in Cascalog, while following best practices. No bangs, no whistles, just good solid code.
Here's a brief Cascalog program, about a dozen lines long. It copies lines of text from file "A" to file "B". It uses 1 mapper in Apache Hadoop. No reducer needed. A conceptual diagram for this implementation in Cascading, in which Cascalog is built on, is shown as:
Certainly this same work could be performed in much quicker ways, such as using cp
on Linux. However this Cascalog example is merely a starting point. We'll build on this example, adding new pieces of code to explore features and strengths of Cascalog. We'll keep building until we have a MapReduce implementation of TF-IDF for scoring the relative "importance" of keywords in a set of documents. In other words, Text Mining 101. What you might find when you peek inside Lucene for example, or some other text indexing framework. Moreover, we'll show how to use TDD features of Cascalog, to build robust MapReduce apps for scale.
Download source for this example on GitHub.
First, we create a -main
method with arguments in
and out
to specify the paths of our input and output data.
(defn -main [in out & args]
...
)
First, we create a generator to source the input data. That data happens to be formatted as tab-separated values (TSV) with a header row:
(hfs-delimited in :skip-header? true)
Notice that in
is merely the input path passed from the command line from -main
.
However, we always need to name the vars from a generator.
((hfs-delimited in :skip-header? true) ?doc ?line)
As there are only two fields separated by a tab in the input, we name the first ?doc
and the second ?line
.
Next we create a sink to write the output data, which is also TSV:
(hfs-delimited out)
Then we specify the output fields to complete the fact:
[?doc ?line]
Putting these together in a Cascalog query:
(?<- (hfs-delimited out)
[?doc ?line]
((hfs-delimited in :skip-header? true) ?doc ?line))
The notion of a query lives at the heart of Cascalog. Instead of thinking in terms of mapper and reducer steps in a MapReduce job, we prefer to think about vars and predicates.
Place those lines all into a -main
method, then build a JAR file. You should be good to go.
The build for this example is based on using Leiningen. To build the sample app from the command line use:
lein uberjar
What you should have at this point is a JAR file which is nearly ready to drop into your Maven repo -- almost. Actually, we provide a community jar repository for Cascading libraries and extensions at http://conjars.org
Before running this sample app, you'll need to have a supported release of Apache Hadoop installed. Here's what was used to develop and test our example code:
$ hadoop version
Hadoop 1.0.3
Be sure to set your HADOOP_HOME
environment variable. Then clear the output
directory (Apache Hadoop insists, if you're running in standalone mode) and run the app:
rm -rf output
hadoop jar ./target/impatient.jar data/rain.txt output/rain
Notice how those command line arguments align with args[]
in the source. The file data/rain.txt
gets copied, TSV row by TSV row. Output text gets stored in the partition file output/rain
which you can then verify:
more output/rain/part-00000
Again, here's a log file from our run of the sample app. If your run looks terribly different, something is probably not set up correctly. Drop us a line on the cascalog-user email forum.
That's it in a nutshell, our simplest app possible in Cascalog. Not quite a "Hello World", but more like a "Hi there, bus stop". Or something. Proceed to the next installment of our Cascalog for the Impatient series.