Skip to content

Latest commit

 

History

History
249 lines (172 loc) · 11.3 KB

README.md

File metadata and controls

249 lines (172 loc) · 11.3 KB

Apache Flink Training Exercises

Exercises that accompany the training content in the documentation.

Table of Contents

Set up your development environment

  1. Software requirements
  2. Clone and build the flink-training project
  3. Import the flink-training project into your IDE

Use the taxi data streams

  1. Schema of taxi ride events
  2. Schema of taxi fare events

How to do the lab exercises

  1. Learn about the data
  2. Run and debug Flink programs in your IDE
  3. Exercises, tests, and solutions

Lab exercises

Contributing

License

Set up your development environment

You will need to set up your environment in order to develop, debug, and execute solutions to the training exercises and examples.

Software requirements

Flink supports Linux, OS X, and Windows as development environments for Flink programs and local execution. The following software is required for a Flink development setup and should be installed on your system:

  • Git
  • a JDK for Java 8 or Java 11 (a JRE is not sufficient; other versions of Java are currently not supported)
  • an IDE for Java (and/or Scala) development with Gradle support

ℹ️ Note for Windows users: The shell command examples provided in the training instructions are for UNIX systems. You may find it worthwhile to setup cygwin or WSL. For developing Flink jobs, Windows works reasonably well: you can run a Flink cluster on a single machine, submit jobs, run the webUI, and execute jobs in the IDE.

Clone and build the flink-training project

This flink-training repository contains exercises, tests, and reference solutions for the programming exercises.

ℹ️ Repository Layout: This repository has several branches set up pointing to different Apache Flink versions, similarly to the apache/flink repository with:

  • a release branch for each minor version of Apache Flink, e.g. release-1.10, and
  • a master branch that points to the current Flink release (not flink:master!)

If you want to work on a version other than the current Flink release, make sure to check out the appropriate branch.

Clone the flink-training repository from GitHub, navigate into the project repository, and build it:

git clone https://github.com/ververica/flink-training.git ververica-flink-training
cd ververica-flink-training
./gradlew test shadowJar

If this is your first time building it, you will end up downloading all of the dependencies for this Flink training project. This usually takes a few minutes, depending on the speed of your internet connection.

If all of the tests pass and the build is successful, you are off to a good start.

🇨🇳 Users in China: click here for instructions on using a local Maven mirror.

If you are in China, we recommend configuring the Maven repository to use a mirror. You can do this by uncommenting this section in our build.gradle file:

    repositories {
        // for access from China, you may need to uncomment this line
        maven { url 'https://maven.aliyun.com/repository/public/' }
        mavenCentral()
        maven {
            url "https://repository.apache.org/content/repositories/snapshots/"
            mavenContent {
                snapshotsOnly()
            }
        }
    }
Enable Scala (optional)

The exercises in this project are also available in Scala but due to a couple of reported problems from non-Scala users, we decided to disable these by default. You can re-enable all Scala exercises and solutions adapting the gradle.properties file like this:

#...

# Scala exercises can be enabled by setting this to true
org.gradle.project.enable_scala = true

You can also selectively apply this plugin in a single subproject if desired.

Import the flink-training project into your IDE

The project needs to be imported as a gradle project into your IDE.

Then you should be able to open RideCleansingTest and run this test.

ℹ️ Note for Scala users: You will need to use IntelliJ with the JetBrains Scala plugin, and you will need to add a Scala 2.12 SDK to the Global Libraries section of the Project Structure as well as to the module you are working on. IntelliJ will ask you for the latter when you open a Scala file. Please note that Scala 2.12.8 and above are not supported (see [Flink Scala Versions](https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/project-configuration/#scala-versions for details))!

Use the taxi data streams

These exercises use data generators that produce simulated event streams. The data is inspired by the New York City Taxi & Limousine Commission's public data set about taxi rides in New York City.

Schema of taxi ride events

Our taxi data set contains information about individual taxi rides in New York City.

Each ride is represented by two events: a trip start, and a trip end.

Each event consists of eleven fields:

rideId         : Long      // a unique id for each ride
taxiId         : Long      // a unique id for each taxi
driverId       : Long      // a unique id for each driver
isStart        : Boolean   // TRUE for ride start events, FALSE for ride end events
eventTime      : Instant   // the timestamp for this event
startLon       : Float     // the longitude of the ride start location
startLat       : Float     // the latitude of the ride start location
endLon         : Float     // the longitude of the ride end location
endLat         : Float     // the latitude of the ride end location
passengerCnt   : Short     // number of passengers on the ride

Schema of taxi fare events

There is also a related data set containing fare data about those same rides, with the following fields:

rideId         : Long      // a unique id for each ride
taxiId         : Long      // a unique id for each taxi
driverId       : Long      // a unique id for each driver
startTime      : Instant   // the start time for this ride
paymentType    : String    // CASH or CARD
tip            : Float     // tip for this ride
tolls          : Float     // tolls for this ride
totalFare      : Float     // total fare collected

How to do the lab exercises

In the hands-on sessions, you will implement Flink programs using various Flink APIs.

The following steps guide you through the process of using the provided data streams, implementing your first Flink streaming program, and executing your program in your IDE.

We assume you have set up your development environment according to our setup guide.

Learn about the data

The initial set of exercises are all based on data streams of events about taxi rides and taxi fares. These streams are produced by source functions which reads data from input files. Read the instructions to learn how to use them.

Run and debug Flink programs in your IDE

Flink programs can be executed and debugged from within an IDE. This significantly eases the development process and provides an experience similar to working on any other Java (or Scala) application.

To start a Flink program in your IDE, run its main() method. Under the hood, the execution environment will start a local Flink instance within the same process. Hence, it is also possible to put breakpoints in your code and debug it.

If you have an IDE with this flink-training project imported, you can run (or debug) a streaming job by:

  • opening the org.apache.flink.training.examples.ridecount.RideCountExample class in your IDE
  • running (or debugging) the main() method of the RideCountExample class using your IDE

Exercises, tests, and solutions

Each of these exercises include:

  • an ...Exercise class with most of the necessary boilerplate code for getting started
  • a JUnit Test class (...Test) with a few tests for your implementation
  • a ...Solution class with a complete solution

There are Java and Scala versions of all the exercise, test, and solution classes. They can each be run from IntelliJ.

ℹ️ Note: As long as your ...Exercise class is throwing a MissingSolutionException, the provided JUnit test classes will ignore that failure and verify the correctness of the solution implementation instead.

You can run exercises, solutions, and tests with the gradlew command.

To run tests:

./gradlew test
./gradlew :<subproject>:test

For Java/Scala exercises and solutions, we provide special tasks that can be listed with:

./gradlew printRunTasks

👇 Now you are ready to begin the lab exercises. 👇

Lab exercises

  1. Filtering a Stream (Ride Cleansing)
  2. Stateful Enrichment (Rides and Fares)
  3. Windowed Analytics (Hourly Tips)
  4. ProcessFunction and Timers (Long Ride Alerts)
  5. Tuning & Troubleshooting Labs

Contribute

If you would like to contribute to this repository or add new exercises, please read the contributing guide.

License

The code in this repository is licensed under the Apache Software License 2.