Spark


Website	http://spark.apache.org/
Supported versions	2.3.1 for Hadoop 2.7+ with OpenJDK 8
	2.3.0 for Hadoop 2.7+ with OpenJDK 8
	2.2.1 for Hadoop 2.7+ with OpenJDK 8
	2.2.0 for Hadoop 2.7+ with OpenJDK 8
	2.1.1 for Hadoop 2.7+ with OpenJDK 8
	2.1.0 for Hadoop 2.7+ with OpenJDK 8
	2.0.2 for Hadoop 2.7+ with OpenJDK 8
	2.0.1 for Hadoop 2.7+ with OpenJDK 8
	2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 8
	2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 7
	1.6.2 for Hadoop 2.6
	1.5.1 for Hadoop 2.6
Current responsible(s)	Erika Pauwels @ TenForce -- [email protected]
	Aad Versteden @ TenForce -- [email protected]
	Gezim Sejdiu @ UBO -- [email protected]
	Ivan Ermilov @ InfAI -- [email protected]
Docker image(s)	bde2020/spark-master:latest
	bde2020/spark-worker:latest
	bde2020/spark-java-template:latest
	bde2020/spark-python-template:latest
More info	http://spark.apache.org/docs/latest/programming-guide.html

Short description

Apache Spark is an in-memory data processing engine. It provides APIs in Java, Python and Scala which try to simplify the programming complexity by introducing the abstraction of Resilient Distributed Datasets (RDD), i.e. a logical collection of data partitioned across machines. The way applications manipulate RDDs is similar to manipulating local collections of data.

On top of its core, Apache Spark provides 4 libraries:

Spark SQL - Library to make Spark work with (semi-)structured data by providing a data abstraction SchemaRDD/DataFrame on top of Spark core. The library also provides SQL support and a domain-specific language to manipulate SchemaRDD/DataFrame.
Spark streaming - Library that adds stream data processing to Spark core. Spark streaming makes it easy to build scalable fault-tolerant streaming applications by ingesting data in mini-batches. Moreover, application code developed for batch processing can be reused for stream processing in Spark.
Mlib Machine Learning Library - Library that provides a Machine Learning framework on top of Spark core.
GraphX - Library that provides a distributed graph processing framework on top of Spark core. GraphX comes with a variety of graph algorithms unifying ETL, exploratory analysis, and iterative graph computation within a single system.

Example usage

Building and running your Spark application on top of the Spark cluster is as simple as extending a template Docker image. Check the template's README for further documentation.

Java template
Scala template (will be added soon)
Python template

The repository big-data-europe/demo-spark-sensor-data contains a demo application in Java which also integrates with HDFS.

Scaling

RDDs are fault-tolerant collections of elements that can be operated on in parallel. As a consequence, Spark applications scale automatically when augmenting the number of Spark worker nodes in the cluster.

Computational frameworks
- Flink
- Spark
- Storm
Data storage
- Hadoop
- Hue HDFS File Browser
- Cassandra
- Hive
- Redis
- Virtuoso
- 4store
- PostGIS
- Zeppelin
Data acquisition
- Flume
Message passing
- Kafka
Search engines
- Elasticsearch
- Solr
Semantic components
- DEER
- EDCAT
- FOX
- GeoTriples
- Silk
- Limes
- SEMAGROW engine
- Sextant
- Strabon
- UnifiedViews

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark

Short description

Example usage

Scaling

Home

BDE stack

Implementing pilot on BDE stack

Implementing pilot on BDI platform

Installation

Components

Clone this wiki locally