Apache Beam samples

This project is composed by several samples. The purpose is to download and analyze GDELT Project data using Apache Beam pipelines.

The objectives are:

Show how to implement Apache Beam pipelines for both streaming and batch analyses.
Show how to run those pipelines on several runners.

GDELT Project stores all news articles as "events": http://data.gdeltproject.org/events/index.html

Daily, a zip file is created, containing a CSV file with all events using the following format:

545037848       20150530        201505  2015    2015.4110                                                                                       JPN     TOKYO   JPN                                                             1       046     046     04      1       7.0     15      1       15      -1.06163552535792       0                                                       4       Tokyo, Tokyo, Japan     JA      JA40    35.685  139.751 -246227 4       Tokyo, Tokyo, Japan     JA      JA40    35.685  139.751 -246227 20160529        http://deadline.com/print-article/1201764227/

The format is described: http://data.gdeltproject.org/documentation/GDELT-Data_Format_Codebook.pdf

Compiling and packaging the samples

It is simple, just use:

mvn clean package

Executing the examples

We have prepared maven profiles to execute the Pipelines in every single runner:

You must activate the profile and choose the appropiate runner:

Direct Runner

mvn exec:java -Dexec.mainClass=org.apache.beam.samples.EventsByLocation -Pdirect-runner -Dexec.args="--runner=DirectRunner --input=/home/dataset/gdelt/2014-2016/201605*.zip --output=/tmp/gdelt/output/"

Spark Runner

mvn exec:java -Dexec.mainClass=org.apache.beam.samples.EventsByLocation -Pspark3-runner -Dexec.args="--runner=SparkRunner --input=/home/dataset/gdelt/2014-2016/201605*.zip --output=/tmp/gdelt/output/"

Flink Runner

mvn exec:java -Dexec.mainClass=org.apache.beam.samples.EventsByLocation -Pflink-runner -Dexec.args="--runner=FlinkRunner --input=/home/dataset/gdelt/2014-2016/201605*.zip --output=/tmp/gdelt/output/"

Google Dataflow Runner

mvn exec:java -Dexec.mainClass=org.apache.beam.samples.EventsByLocation -Pflink-runner -Dexec.args="--runner=DataflowRunner --input=/home/dataset/gdelt/2014-2016/201605*.zip --output=/tmp/gdelt/output/"

Google Dataflow Runner (blocking)

mvn exec:java -Dexec.mainClass=org.apache.beam.samples.EventsByLocation -Pflink-runner -Dexec.args="--runner=BlockingDataflowRunner --input=/home/dataset/gdelt/2014-2016/201605*.zip --output=/tmp/gdelt/output/"

Test infrastructure

Some of the samples require to have some infrastructure available, e.g. some brokers, filesystems and databases.

We provide a convinient way to have such infrastructura available using docker-compose.

Name		Name	Last commit message	Last commit date
Latest commit History 380 Commits
.github/workflows		.github/workflows
EventsByLocation		EventsByLocation
EventsToIOs		EventsToIOs
GoodAndBadGroupReduce		GoodAndBadGroupReduce
IoTEventsProcess		IoTEventsProcess
amazon-web-services		amazon-web-services
analytic		analytic
data-ingestion		data-ingestion
file-parsing		file-parsing
iot		iot
join		join
metrics-polling		metrics-polling
normalization		normalization
orchestration		orchestration
runners-tests		runners-tests
serializableTests		serializableTests
sql		sql
.gitignore		.gitignore
README.md		README.md
SECURITY.md		SECURITY.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Beam samples

Compiling and packaging the samples

Executing the examples

Test infrastructure

About

Releases

Packages

Contributors 8

Languages

Talend/beam-samples

Folders and files

Latest commit

History

Repository files navigation

Apache Beam samples

Compiling and packaging the samples

Executing the examples

Test infrastructure

About

Resources

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages