Part 1 Cascalog for the Impatient, part 1

The lesson today is how to write a simple Cascalog app. The goal is clear and concise: create the simplest application possible in Cascalog, while following best practices. No bangs, no whistles, just good solid code.

Here's a brief Cascalog program, about a dozen lines long. It copies lines of text from file "A" to file "B". It uses 1 mapper in Apache Hadoop. No reducer needed. A conceptual diagram for this implementation in Cascading, in which Cascalog is built on, is shown as:

Conceptual Workflow Diagram

Certainly this same work could be performed in much quicker ways, such as using cp on Linux. However this Cascalog example is merely a starting point. We'll build on this example, adding new pieces of code to explore features and strengths of Cascalog. We'll keep building until we have a MapReduce implementation of TF-IDF for scoring the relative "importance" of keywords in a set of documents. In other words, Text Mining 101. What you might find when you peek inside Lucene for example, or some other text indexing framework. Moreover, we'll show how to use TDD features of Cascalog, to build robust MapReduce apps for scale.

Source

Download source for this example on GitHub.

First, we create a -main method with arguments in and out to specify the paths of our input and output data.

(defn -main [in out & args]
  ...
)

First, we create a generator to source the input data. That data happens to be formatted as tab-separated values (TSV) with a header row:

(hfs-delimited in :skip-header? true)

Notice that in is merely the input path passed from the command line from -main.

However, we always need to name the vars from a generator.

((hfs-delimited in :skip-header? true) ?doc ?line)

As there are only two fields separated by a tab in the input, we name the first ?doc and the second ?line.

Next we create a sink to write the output data, which is also TSV:

(hfs-delimited out)

Then we specify the output fields to complete the fact:

[?doc ?line]

Putting these together in a Cascalog query:

(?<- (hfs-delimited out)
     [?doc ?line]
     ((hfs-delimited in :skip-header? true) ?doc ?line))

The notion of a query lives at the heart of Cascalog. Instead of thinking in terms of mapper and reducer steps in a MapReduce job, we prefer to think about vars and predicates.

Place those lines all into a -main method, then build a JAR file. You should be good to go.

Build

The build for this example is based on using Leiningen. To build the sample app from the command line use:

lein uberjar

What you should have at this point is a JAR file which is nearly ready to drop into your Maven repo -- almost. Actually, we provide a community jar repository for Cascading libraries and extensions at http://conjars.org

Run

Before running this sample app, you'll need to have a supported release of Apache Hadoop installed. Here's what was used to develop and test our example code:

$ hadoop version
Hadoop 1.0.3

Be sure to set your HADOOP_HOME environment variable. Then clear the output directory (Apache Hadoop insists, if you're running in standalone mode) and run the app:

rm -rf output
hadoop jar ./target/impatient.jar data/rain.txt output/rain

Notice how those command line arguments align with args[] in the source. The file data/rain.txt gets copied, TSV row by TSV row. Output text gets stored in the partition file output/rain which you can then verify:

more output/rain/part-00000

Again, here's a log file from our run of the sample app. If your run looks terribly different, something is probably not set up correctly. Drop us a line on the cascalog-user email forum.

That's it in a nutshell, our simplest app possible in Cascalog. Not quite a "Hello World", but more like a "Hi there, bus stop". Or something. Proceed to the next installment of our Cascalog for the Impatient series.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Part 1

Cascalog for the Impatient, part 1

Source

Build

Run

Clone this wiki locally