Gradoop is an open source (ALv2) research framework for scalable graph analytics built on top of Apache Flink™. It offers a graph data model which extends the widespread property graph model by the concept of logical graphs and further provides operators that can be applied on single logical graphs and collections of logical graphs. The combination of these operators allows the flexible, declarative definition of graph analytical workflows. Gradoop can be easily integrated in a workflow which already uses Flink™ operators and Flink™ libraries (i.e. Gelly, ML and Table).
Gradoop is work in progress which means APIs may change. It is currently used as a proof of concept implementation and far from production ready.
- Cypher-based Graph Pattern Matching in Apache Flink, FlinkForward, September 2017
- Cypher-based Graph Pattern Matching in GRADOOP, SIGMOD GRADES Workshop, May 2017
- DIMSpan - Transactional Frequent Subgraph Mining with Distributed In-Memory Dataflow Systems, arXiv, March 2017
- Distributed Grouping of Property Graphs with GRADOOP, BTW Conf., March 2017
- Graph Mining for Complex Data Analytics, ICDM Demo, December 2016
- [german] Graph Mining für Business Intelligence, data2day, October 2016
- [german] Verteilte Graphanalyse mit Gradoop, JavaSPEKTRUM, October 2016
- Extended Property Graphs with Apache Flink, SIGMOD NDA Workshop, June 2016
- Gradoop @Flink/Neo4j Meetup Berlin, March 2016
- Gradoop @FOSDEM GraphDevroom, January 2016
- Gradoop @FlinkForward, September 2015 (YouTube)
In the extended property graph model (EPGM), a database consists of multiple property graphs which are called logical graphs. These graphs describe application-specific subsets of vertices and edges, i.e. a vertex or an edge can be contained in multiple logical graphs. Additionally, not only vertices and edges but also logical graphs have a type label and can have different properties.
Data Model elements (logical graphs, vertices and edges) have a unique identifier, a single label (e.g. User) and a number of key-value properties (e.g. name = Alice). There is no schema involved, meaning each element can have an arbitrary number of properties even if they have the same label.
The EPGM provides operators for both single logical graphs as well as collections of logical graphs; operators may also return single graphs or graph collections. The following tables contains an overview (GC = Graph Collection, G = Logical Graph).
Operator | Output | Output description | Impl |
---|---|---|---|
Aggregation | G | Graph with result of an aggregate function as a new property | Yes |
Matching | GC | Graphs that match a given graph pattern | Yes |
Transformation | G | Graph with transformed (graph, vertex, edge) data | Yes |
Grouping | G | Structural condense of the input graph | Yes |
Subgraph | G | Subgraph that fulfils given vertex and edge predicates | Yes |
Operator | Output | Output description | Impl |
---|---|---|---|
Combination | G | Graph with vertices and edges from both input graphs | Yes |
Overlap | G | Graph with vertices and edges that exist in both input graphs | Yes |
Exclusion | G | Graph with vertices and edges that exist only in the first graph | Yes |
Equality | {true, false} | Compare graphs in terms of identity or equality of contained elements | Yes |
VertexFusion | G | The second graph is fused to a single vertex within the first graph | Yes |
Operator | Output | Output description | Impl |
---|---|---|---|
Matching | GC | Graphs that match a given graph pattern | Yes |
Selection | GC | Filter graphs based on their attached data (i.e. label, properties) | Yes |
Distinct | GC | Collection with no duplicate graphs | Yes |
SortBy | GC | Collection sorted by values of a given property key | No |
Limit | GC | The first n arbitrary elements of the input collection | Yes |
Operator | Output | Output description | Impl |
---|---|---|---|
Union | GC | All graphs from both input collections | Yes |
Intersection | GC | Only graphs that exist in both collections | Yes |
Difference | GC | Only graphs that exist only in the first collection | Yes |
Equality | {true, false} | Compare collections in terms of identity or equality of contained elements | Yes |
Operator | In | Out | Output description | Impl |
---|---|---|---|---|
Apply | GC | GC | Applies unary operator (e.g. aggregate) on each graph in the collection | Yes |
Reduce | GC | G | Reduces collection to single graph using binary operator (e.g. combine) | Yes |
Call | GC/G | GC/G | Applies external algorithm on graph or graph collection | Yes |
- Add one of the following dependencies to your maven project
Stable:
<dependency>
<groupId>org.gradoop</groupId>
<artifactId>gradoop-flink</artifactId>
<version>0.3.2</version>
</dependency>
Latest nightly build (additional repository is required):
<repositories>
<repository>
<id>oss.sonatype.org-snapshot</id>
<url>http://oss.sonatype.org/content/repositories/snapshots</url>
<releases><enabled>false</enabled></releases>
<snapshots><enabled>true</enabled></snapshots>
</repository>
</repositories>
<dependency>
<groupId>org.gradoop</groupId>
<artifactId>gradoop-flink</artifactId>
<version>0.3.3-SNAPSHOT</version>
</dependency>
-
Gradoop requires Java 8
-
Clone Gradoop into your local file system
-
Build and execute tests
cd gradoop
mvn clean install
The main contents of that module are the EPGM data model and a corresponding POJO implementation which is used in Flink™. The persistent representation of the EPGM is also contained in gradoop-common and together with its mapping to HBase™.
Input and output formats for reading and writing graph collections from Apache HBase.
This module contains reference implementations of the EPGM operators. The EPGM is mapped to Flink™ DataSets while the operators are implemented using DataSet transformations. The module also contains implementations of general graph algorithms (e.g. Label Propagation, Frequent Subgraph Mining) adapted to be used with the EPGM model.
Contains example pipelines showing use cases for Gradoop.
- Graph grouping example (build structural aggregates of property graphs)
- Social network examples (composition of multiple operators to analyze social networks graphs)
- Input/Output examples (usage of DataSource and DataSink implementations)
- Benchmarks used for cluster evaluations
Used to maintain the code style for the whole project.
- 0.0.1 first prototype using Hadoop MapReduce and Apache Giraph for operator processing
- 0.0.2 support for HBase as distributed graph storage
- 0.0.3 Apache Flink replaces MapReduce and Giraph as operator implementation layer and distributed execution engine
- 0.1 Major refactoring of internal EPGM representation (e.g. ID and property handling), Equality Operators, GDL-based unit testing
- 0.2.0 Pattern Matching and Frequent Subgraph Mining algorithms
- 0.3.1 Bug fixes and support for more Gelly algorithms
Apache®, Apache Flink™, Flink™, Apache HBase™ and HBase™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.