GitHub - sgolecha/gradoop: Distributed Graph Analytics with Apache Flink

Gradoop: Distributed Graph Analytics on Hadoop

Gradoop is an open source (ALv2) research framework for scalable graph analytics built on top of Apache Flink™. It offers a graph data model which extends the widespread property graph model by the concept of logical graphs and further provides operators that can be applied on single logical graphs and collections of logical graphs. The combination of these operators allows the flexible, declarative definition of graph analytical workflows. Gradoop can be easily integrated in a workflow which already uses Flink™ operators and Flink™ libraries (i.e. Gelly, ML and Table).

Gradoop is work in progress which means APIs may change. It is currently used as a proof of concept implementation and far from production ready.

Further Information (articles and talks)

Data Model

In the extended property graph model (EPGM), a database consists of multiple property graphs which are called logical graphs. These graphs describe application-specific subsets of vertices and edges, i.e. a vertex or an edge can be contained in multiple logical graphs. Additionally, not only vertices and edges but also logical graphs have a type label and can have different properties.

Data Model elements (logical graphs, vertices and edges) have a unique identifier, a single label (e.g. User) and a number of key-value properties (e.g. name = Alice). There is no schema involved, meaning each element can have an arbitrary number of properties even if they have the same label.

Graph operators

The EPGM provides operators for both single logical graphs as well as collections of logical graphs; operators may also return single graphs or graph collections. The following tables contains an overview (GC = Graph Collection, G = Logical Graph).

Unary logical graph operators (one graph as input):

Operator	Output	Output description	Impl
Aggregation	G	Graph with result of an aggregate function as a new property	Yes
Matching	GC	Graphs that match a given graph pattern	Yes
Transformation	G	Graph with transformed (graph, vertex, edge) data	Yes
Grouping	G	Structural condense of the input graph	Yes
Subgraph	G	Subgraph that fulfils given vertex and edge predicates	Yes

Binary logical graph operators (two graphs as input):

Operator	Output	Output description	Impl
Combination	G	Graph with vertices and edges from both input graphs	Yes
Overlap	G	Graph with vertices and edges that exist in both input graphs	Yes
Exclusion	G	Graph with vertices and edges that exist only in the first graph	Yes
Equality	{true, false}	Compare graphs in terms of identity or equality of contained elements	Yes
VertexFusion	G	The second graph is fused to a single vertex within the first graph	Yes

Unary graph collection operators (one collection as input):

Operator	Output	Output description	Impl
Matching	GC	Graphs that match a given graph pattern	Yes
Selection	GC	Filter graphs based on their attached data (i.e. label, properties)	Yes
Distinct	GC	Collection with no duplicate graphs	Yes
SortBy	GC	Collection sorted by values of a given property key	No
Limit	GC	The first n arbitrary elements of the input collection	Yes

Binary graph collection operators (two collections as input):

Operator	Output	Output description	Impl
Union	GC	All graphs from both input collections	Yes
Intersection	GC	Only graphs that exist in both collections	Yes
Difference	GC	Only graphs that exist only in the first collection	Yes
Equality	{true, false}	Compare collections in terms of identity or equality of contained elements	Yes

Auxiliary operators:

Operator	In	Out	Output description	Impl
Apply	GC	GC	Applies unary operator (e.g. aggregate) on each graph in the collection	Yes
Reduce	GC	G	Reduces collection to single graph using binary operator (e.g. combine)	Yes
Call	GC/G	GC/G	Applies external algorithm on graph or graph collection	Yes

Setup

Use gradoop via Maven

Add one of the following dependencies to your maven project

Stable:

<dependency>
    <groupId>org.gradoop</groupId>
    <artifactId>gradoop-flink</artifactId>
    <version>0.3.2</version>
</dependency>

Latest nightly build (additional repository is required):

<repositories>
    <repository>
        <id>oss.sonatype.org-snapshot</id>
        <url>http://oss.sonatype.org/content/repositories/snapshots</url>
        <releases><enabled>false</enabled></releases>
        <snapshots><enabled>true</enabled></snapshots>
    </repository>
</repositories>

<dependency>
    <groupId>org.gradoop</groupId>
    <artifactId>gradoop-flink</artifactId>
    <version>0.3.3-SNAPSHOT</version>
</dependency>

Build gradoop from source

Gradoop requires Java 8
Clone Gradoop into your local file system

git clone https://github.com/dbs-leipzig/gradoop.git
Build and execute tests

cd gradoop

mvn clean install

Gradoop modules

gradoop-common

The main contents of that module are the EPGM data model and a corresponding POJO implementation which is used in Flink™. The persistent representation of the EPGM is also contained in gradoop-common and together with its mapping to HBase™.

gradoop-hbase

Input and output formats for reading and writing graph collections from Apache HBase.

gradoop-flink

This module contains reference implementations of the EPGM operators. The EPGM is mapped to Flink™ DataSets while the operators are implemented using DataSet transformations. The module also contains implementations of general graph algorithms (e.g. Label Propagation, Frequent Subgraph Mining) adapted to be used with the EPGM model.

gradoop-examples

Contains example pipelines showing use cases for Gradoop.

Graph grouping example (build structural aggregates of property graphs)
Social network examples (composition of multiple operators to analyze social networks graphs)
Input/Output examples (usage of DataSource and DataSink implementations)
Benchmarks used for cluster evaluations

gradoop-checkstyle

Used to maintain the code style for the whole project.

Version History

0.0.1 first prototype using Hadoop MapReduce and Apache Giraph for operator processing
0.0.2 support for HBase as distributed graph storage
0.0.3 Apache Flink replaces MapReduce and Giraph as operator implementation layer and distributed execution engine
0.1 Major refactoring of internal EPGM representation (e.g. ID and property handling), Equality Operators, GDL-based unit testing
0.2.0 Pattern Matching and Frequent Subgraph Mining algorithms
0.3.1 Bug fixes and support for more Gelly algorithms

Disclaimer

Apache®, Apache Flink™, Flink™, Apache HBase™ and HBase™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Name		Name	Last commit message	Last commit date
Latest commit History 862 Commits
dev-support		dev-support
gradoop-checkstyle		gradoop-checkstyle
gradoop-common		gradoop-common
gradoop-examples		gradoop-examples
gradoop-flink		gradoop-flink
gradoop-hbase		gradoop-hbase
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gradoop: Distributed Graph Analytics on Hadoop

Further Information (articles and talks)

Data Model

Graph operators

Unary logical graph operators (one graph as input):

Binary logical graph operators (two graphs as input):

Unary graph collection operators (one collection as input):

Binary graph collection operators (two collections as input):

Auxiliary operators:

Setup

Use gradoop via Maven

Build gradoop from source

Gradoop modules

gradoop-common

gradoop-hbase

gradoop-flink

gradoop-examples

gradoop-checkstyle

Version History

Disclaimer

About

Releases

Packages

Languages

License

sgolecha/gradoop

Folders and files

Latest commit

History

Repository files navigation

Gradoop: Distributed Graph Analytics on Hadoop

Further Information (articles and talks)

Data Model

Graph operators

Unary logical graph operators (one graph as input):

Binary logical graph operators (two graphs as input):

Unary graph collection operators (one collection as input):

Binary graph collection operators (two collections as input):

Auxiliary operators:

Setup

Use gradoop via Maven

Build gradoop from source

Gradoop modules

gradoop-common

gradoop-hbase

gradoop-flink

gradoop-examples

gradoop-checkstyle

Version History

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages