Skip to content

Commit

Permalink
SAMZA-2124: Add Beam API doc to the website (#948)
Browse files Browse the repository at this point in the history
* SAMZA-2124: Add Beam API doc to the website

* Address pr feedback
  • Loading branch information
xinyuiscool authored and shanthoosh committed Mar 12, 2019
1 parent 8dea84c commit 6711a9f
Show file tree
Hide file tree
Showing 8 changed files with 434 additions and 104 deletions.
111 changes: 111 additions & 0 deletions docs/learn/documentation/versioned/api/beam-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
layout: page
title: Apache Beam API
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

### Table Of Contents
- [Introduction](#introduction)
- [Basic Concepts](#basic-concepts)
- [Apache Beam - A Samza’s Perspective](#apache-beam---a-samza’s-perspective)

### Introduction

Apache Beam brings an easy-to-usen but powerful API and model for state-of-art stream and batch data processing with portability across a variety of languages. The Beam API and model has the following characteristics:

- *Simple constructs, powerful semantics*: the whole beam API can be simply described by a `Pipeline` object, which captures all your data processing steps from input to output. Beam SDK supports over [20 data IOs](https://beam.apache.org/documentation/io/built-in/), and data transformations from simple [Map](https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/transforms/MapElements.html) to complex [Combines and Joins](https://beam.apache.org/releases/javadoc/2.11.0/index.html?org/apache/beam/sdk/transforms/Combine.html).

- *Strong consistency via event-time*: Beam provides advanced [event-time support](https://beam.apache.org/documentation/programming-guide/#watermarks-and-late-data) so you can perform windowing and aggregations based on when the events happen, instead of when they arrive. The event-time mechanism improves the accuracy of processing results, and guarantees repeatability in results when reprocessing the same data set.

- *Comprehensive stream processing semantics*: Beam supports an up-to-date stream processing model, including [tumbling/sliding/session windows](https://beam.apache.org/documentation/programming-guide/#windowing), joins and aggregations. It provides [triggers](https://beam.apache.org/documentation/programming-guide/#triggers) based on conditions of early and late firings, and late arrival handling with accumulation mode and allowed lateness.

- *Portability with multiple programming languages*: Beam supports a consistent API in multiple languages, including [Java, Python and Go](https://beam.apache.org/roadmap/portability/). This allows you to leverage the rich ecosystem built for different languages, e.g. ML libs for Python.

### Basic Concepts

Let's walk through the WordCount example to illustrate the Beam basic concepts. A Beam program often starts by creating a [Pipeline](https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/Pipeline.html) object in your `main()` function.

{% highlight java %}

// Start by defining the options for the pipeline.
PipelineOptions options = PipelineOptionsFactory.create();

// Then create the pipeline.
Pipeline p = Pipeline.create(options);

{% endhighlight %}

The `PipelineOptions` supported by SamzaRunner is documented in detail [here](https://beam.apache.org/documentation/runners/samza/).

Let's apply the first data transform to read from a text file using [TextIO.read()](https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/TextIO.html):

{% highlight java %}

PCollection<String> lines = p.apply(
"ReadLines", TextIO.read().from("/path/to/inputData"));

{% endhighlight %}

To break down each line into words, you can use a [FlatMap](https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/transforms/FlatMapElements.html):

{% highlight java %}

PCollection<String> words = lines.apply(
FlatMapElements.into(TypeDescriptors.strings())
.via((String word) -> Arrays.asList(word.split("\\W+"))));

{% endhighlight %}

Beam provides a build-in transform [Count.perElement](https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/transforms/Count.html) to count the number of elements based on each value. Let's use it here to count the words:

{% highlight java %}

PCollection<KV<String, Long>> counts = pipeline.apply(Count.perElement());

{% endhighlight %}

Finally we format the counts into strings and write to a file using `TextIO.write()`:

{% highlight java %}

counts.apply(ToString.kvs())
.apply(TextIO.write().to("/path/to/output").withoutSharding());

{% endhighlight %}

To run your pipeline and wait for the results, just do:

{% highlight java %}

pipeline.run().waitUntilFinish();

{% endhighlight %}

Or you can run your pipeline asynchronously, e.g. when you submit it to a remote cluster:

{% highlight java %}

pipeline.run();

{% endhighlight %}

To run this Beam program with Samza, you can simply provide "--runner=SamzaRunner" as a program argument. You can follow our [quick start](/startup/quick-start/{{site.version}}/beam.html) to set up your project and run different examples. For more details on writing the Beam program, please refer the [Beam programming guide](https://beam.apache.org/documentation/programming-guide/).

### Apache Beam - A Samza’s Perspective

The goal of Samza is to provide large-scale streaming processing capabilities with first-class state support. This does not contradict with Beam. In fact, while Samza lays out a solid foundation for large-scale stateful stream processing, Beam adds the cutting-edge stream processing API and model on top of it. The Beam API and model allows further optimization in the Samza platform, including multi-stage distributed computation and parallel processing on the per-key basis. The performance enhancements from these optimizations will benefit both Samza and its users. Samza can also further improve Beam model by providing various use cases. We firmly believe Samza will benefit from collaborating with Beam.
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,11 @@ A _stream application_ processes messages from input streams, transforms them an

![diagram-medium](/img/{{site.version}}/learn/documentation/core-concepts/stream-application.png)

Samza offers three top-level APIs to help you build your stream applications: <br/>
Samza offers foure top-level APIs to help you build your stream applications: <br/>
1. The [High Level Streams API](/learn/documentation/{{site.version}}/api/high-level-api.html), which offers several built-in operators like map, filter, etc. This is the recommended API for most use-cases. <br/>
2. The [Low Level Task API](/learn/documentation/{{site.version}}/api/low-level-api.html), which allows greater flexibility to define your processing-logic and offers greater control <br/>
3. [Samza SQL](/learn/documentation/{{site.version}}/api/samza-sql.html), which offers a declarative SQL interface to create your applications <br/>
4. [Apache Beam API](/learn/documentation/{{site.version}}/api/beam-api.html), which offers the full Java API from [Apache beam](https://beam.apache.org/) while Python and Go are work-in-progress.

### State
Samza supports both stateless and stateful stream processing. _Stateless processing_, as the name implies, does not retain any state associated with the current message after it has been processed. A good example of this is filtering an incoming stream of user-records by a field (eg:userId) and writing the filtered messages to their own stream.
Expand Down
2 changes: 1 addition & 1 deletion docs/learn/documentation/versioned/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ <h4>API</h4>
<li><a href="api/table-api.html">Table API</a></li>
<li><a href="api/test-framework.html">Testing Samza</a></li>
<li><a href="api/samza-sql.html">Samza SQL</a></li>
<li><a href="https://beam.apache.org/documentation/runners/samza/">Apache BEAM</a></li>
<li><a href="api/beam-api.html">Apache BEAM</a></li>
</ul>

<h4>DEPLOYMENT</h4>
Expand Down
152 changes: 152 additions & 0 deletions docs/startup/code-examples/versioned/beam.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
---
layout: page
title: Beam Code Examples
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

The [samza-beam-examples](https://github.com/apache/samza-beam-examples) project contains examples to demonstrate running Beam pipelines with SamzaRunner locally, in Yarn cluster, or in standalone cluster with Zookeeper. More complex pipelines can be built from this project and run in similar manner.

### Example Pipelines
The following examples are included:

1. [`WordCount`](https://github.com/apache/samza-beam-examples/blob/master/src/main/java/org/apache/beam/examples/WordCount.java) reads a file as input (bounded data source), and computes word frequencies.

2. [`KafkaWordCount`](https://github.com/apache/samza-beam-examples/blob/master/src/main/java/org/apache/beam/examples/KafkaWordCount.java) does the same word-count computation but reading from a Kafka stream (unbounded data source). It uses a fixed 10-sec window to aggregate the counts.

### Run the Examples

Each example can be run locally, in Yarn cluster or in standalone cluster. Here we use KafkaWordCount as an example.

#### Set Up
1. Download and install [JDK version 8](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html). Verify that the JAVA_HOME environment variable is set and points to your JDK installation.

2. Download and install [Apache Maven](http://maven.apache.org/download.cgi) by following Maven’s [installation guide](http://maven.apache.org/install.html) for your specific operating system.

Check out the `samza-beam-examples` repo:

```
$ git clone https://github.com/apache/samza-beam-examples.git
$ cd samza-beam-examples
```

A script named "grid" is included in this project which allows you to easily download and install Zookeeper, Kafka, and Yarn.
You can run the following to bring them all up running in your local machine:

```
$ scripts/grid bootstrap
```

All the downloaded package files will be put under `deploy` folder. Once the grid command completes,
you can verify that Yarn is up and running by going to http://localhost:8088. You can also choose to
bring them up separately, e.g.:

```
$ scripts/grid install zookeeper
$ scripts/grid start zookeeper
```
Now let's create a Kafka topic named "input-text" for this example:

```
$ ./deploy/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic input-text --partitions 10 --replication-factor 1
```

#### Run Locally
You can run directly within the project using maven:

```
$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.KafkaWordCount \
-Dexec.args="--runner=SamzaRunner" -P samza-runner
```

#### Packaging Your Application
To execute the example in either Yarn or standalone, you need to package it first.
After packaging, we deploy and explode the tgz in the deploy folder:

```
$ mkdir -p deploy/examples
$ mvn package && tar -xvf target/samza-beam-examples-0.1-dist.tar.gz -C deploy/examples/
```

#### Run in Standalone Cluster with Zookeeper
You can use the `run-beam-standalone.sh` script included in this repo to run an example
in standalone mode. The config file is provided as `config/standalone.properties`. Note by
default we create one single split for the whole input (--maxSourceParallelism=1). To
set each Kafka partition in a split, we can set a large "maxSourceParallelism" value which
is the upper bound of the number of splits.

```
$ deploy/examples/bin/run-beam-standalone.sh org.apache.beam.examples.KafkaWordCount \
--configFilePath=$PWD/deploy/examples/config/standalone.properties --maxSourceParallelism=1024
```

#### Run Yarn Cluster
Similar to running standalone, we can use the `run-beam-yarn.sh` to run the examples
in Yarn cluster. The config file is provided as `config/yarn.properties`. To run the
KafkaWordCount example in yarn:

```
$ deploy/examples/bin/run-beam-yarn.sh org.apache.beam.examples.KafkaWordCount \
--configFilePath=$PWD/deploy/examples/config/yarn.properties --maxSourceParallelism=1024
```

#### Validate the Pipeline Results
Now the pipeline is deployed to either locally, standalone or Yarn. Let's check out the results. First we start a kakfa consumer to listen to the output:

```
$ ./deploy/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic word-count --property print.key=true
```

Then let's publish a few lines to the input Kafka topic:

```
$ ./deploy/kafka/bin/kafka-console-producer.sh --topic input-text --broker-list localhost:9092
Nory was a Catholic because her mother was a Catholic, and Nory’s mother was a Catholic because her father was a Catholic, and her father was a Catholic because his mother was a Catholic, or had been.
```

You should see the word count shows up in the consumer console in about 10 secs:

```
a 6
br 1
mother 3
was 6
Catholic 6
his 1
Nory 2
s 1
father 2
had 1
been 1
and 2
her 3
or 1
because 3
```

### Beyond Examples
Feel free to build more complex pipelines based on the examples above, and reach out to us:

* Subscribe and mail to [[email protected]](mailto:[email protected]) for any Beam questions.

* Subscribe and mail to [[email protected]](mailto:[email protected]) for any Samza questions.

### More Information

* [Apache Beam](http://beam.apache.org)
* [Apache Samza](https://samza.apache.org/)
* Quickstart: [Java](https://beam.apache.org/get-started/quickstart-java), [Python](https://beam.apache.org/get-started/quickstart-py), [Go](https://beam.apache.org/get-started/quickstart-go)
Loading

0 comments on commit 6711a9f

Please sign in to comment.