Scalaflow

Scalaflow is a Scala DSL for Google's Cloud Dataflow's Java SDK.

Dataflow is a cool service but it's even cooler if we can write data pipelines and distributed data transformations in Scala with a fluent style:

object ScalaStyleWordCount extends App {
  // pipeline definition
  val options = PipelineOptionsFactory
    .fromArgs(args)
    .withValidation()
    .as(classOf[WordCountOptions])
  val job = new DataflowJob(options)

  // input
  val input = job.text(options.getInput())

  // transformations
  val words = input.flatMap(line => line.split("[^a-zA-Z']+"))
  val wordCounts = words.applyTransform(Count.perElement())
  val results = wordCounts.map(
    count => count.getKey + "\t" + count.getValue.toString)

  // output
  results.persist(options.getOutput(), Some("writeCounts"))

  job.run
}

The Java version of this code can be found here. Compare them and make a judge yourself!

Features

a DList abstraction similar to Spark's RDD
a DataflowJob abstraction that wraps the Pipeline in Java SDK
a Scala side CoderRegistry for data type serialization and desrialization

TODOs

Integrate the KV data operations (groupByKey, Combine, Join etc.)
More tests on CoderRegistry (port Java SDK's test)
Use Kryo for user class ser/deser

Feel free to share your ideas with me on this project: ju.han.felix at gmail.com or @juhanlol on twitter.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
project		project
src		src
.gitignore		.gitignore
README.md		README.md
logging.properties		logging.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scalaflow

Features

TODOs

About

Releases

Packages

Contributors 2

Languages

darkjh/scalaflow

Folders and files

Latest commit

History

Repository files navigation

Scalaflow

Features

TODOs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages