-
Notifications
You must be signed in to change notification settings - Fork 13
Spark
Website | http://spark.apache.org/ |
Supported versions | 2.3.1 for Hadoop 2.7+ with OpenJDK 8 |
2.3.0 for Hadoop 2.7+ with OpenJDK 8 | |
2.2.1 for Hadoop 2.7+ with OpenJDK 8 | |
2.2.0 for Hadoop 2.7+ with OpenJDK 8 | |
2.1.1 for Hadoop 2.7+ with OpenJDK 8 | |
2.1.0 for Hadoop 2.7+ with OpenJDK 8 | |
2.0.2 for Hadoop 2.7+ with OpenJDK 8 | |
2.0.1 for Hadoop 2.7+ with OpenJDK 8 | |
2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 8 | |
2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 7 | |
1.6.2 for Hadoop 2.6 | |
1.5.1 for Hadoop 2.6 | |
Current responsible(s) | Erika Pauwels @ TenForce -- [email protected] |
Aad Versteden @ TenForce -- [email protected] | |
Gezim Sejdiu @ UBO -- [email protected] | |
Ivan Ermilov @ InfAI -- [email protected] | |
Docker image(s) | bde2020/spark-master:latest |
bde2020/spark-worker:latest | |
bde2020/spark-java-template:latest | |
bde2020/spark-python-template:latest | |
More info | http://spark.apache.org/docs/latest/programming-guide.html |
Apache Spark is an in-memory data processing engine. It provides APIs in Java, Python and Scala which try to simplify the programming complexity by introducing the abstraction of Resilient Distributed Datasets (RDD), i.e. a logical collection of data partitioned across machines. The way applications manipulate RDDs is similar to manipulating local collections of data.
On top of its core, Apache Spark provides 4 libraries:
- Spark SQL - Library to make Spark work with (semi-)structured data by providing a data abstraction SchemaRDD/DataFrame on top of Spark core. The library also provides SQL support and a domain-specific language to manipulate SchemaRDD/DataFrame.
- Spark streaming - Library that adds stream data processing to Spark core. Spark streaming makes it easy to build scalable fault-tolerant streaming applications by ingesting data in mini-batches. Moreover, application code developed for batch processing can be reused for stream processing in Spark.
- Mlib Machine Learning Library - Library that provides a Machine Learning framework on top of Spark core.
- GraphX - Library that provides a distributed graph processing framework on top of Spark core. GraphX comes with a variety of graph algorithms unifying ETL, exploratory analysis, and iterative graph computation within a single system.
Building and running your Spark application on top of the Spark cluster is as simple as extending a template Docker image. Check the template's README for further documentation.
- Java template
- Scala template (will be added soon)
- Python template
The repository big-data-europe/demo-spark-sensor-data contains a demo application in Java which also integrates with HDFS.
RDDs are fault-tolerant collections of elements that can be operated on in parallel. As a consequence, Spark applications scale automatically when augmenting the number of Spark worker nodes in the cluster.