Skip to content

Latest commit

 

History

History

hugegraph-spark-connector

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

HugeGraph Spark Connector

License

HugeGraph Spark Connector is a Spark connector application for reading and writing HugeGraph data in Spark standard format.

Building

Required:

  • Java 8+
  • Maven 3.6+

To build without executing tests:

mvn clean package -DskipTests

To build with default tests:

mvn clean packge

How to use

If we have a graph, the schema is defined as follows:

Schema

schema.propertyKey("name").asText().ifNotExist().create()
schema.propertyKey("age").asInt().ifNotExist().create()
schema.propertyKey("city").asText().ifNotExist().create()
schema.propertyKey("weight").asDouble().ifNotExist().create()
schema.propertyKey("lang").asText().ifNotExist().create()
schema.propertyKey("date").asText().ifNotExist().create()
schema.propertyKey("price").asDouble().ifNotExist().create()

schema.vertexLabel("person")
        .properties("name", "age", "city")
        .useCustomizeStringId()
        .nullableKeys("age", "city")
        .ifNotExist()
        .create()

schema.vertexLabel("software")
        .properties("name", "lang", "price")
        .primaryKeys("name")
        .ifNotExist()
        .create()

schema.edgeLabel("knows")
        .sourceLabel("person")
        .targetLabel("person")
        .properties("date", "weight")
        .ifNotExist()
        .create()

schema.edgeLabel("created")
        .sourceLabel("person")
        .targetLabel("software")
        .properties("date", "weight")
        .ifNotExist()
        .create()

Then we can insert graph data through Spark, first add dependency in your pom.

<dependency>
    <groupId>org.apache.hugegraph</groupId>
    <artifactId>hugegraph-spark-connector</artifactId>
    <version>${revision}</version>
</dependency>

Vertex Sink

val df = sparkSession.createDataFrame(Seq(
  Tuple3("marko", 29, "Beijing"),
  Tuple3("vadas", 27, "HongKong"),
  Tuple3("Josh", 32, "Beijing"),
  Tuple3("peter", 35, "ShangHai"),
  Tuple3("li,nary", 26, "Wu,han"),
  Tuple3("Bob", 18, "HangZhou"),
)) toDF("name", "age", "city")

df.show()

df.write
  .format("org.apache.hugegraph.spark.connector.DataSource")
  .option("host", "127.0.0.1")
  .option("port", "8080")
  .option("graph", "hugegraph")
  .option("data-type", "vertex")
  .option("label", "person")
  .option("id", "name")
  .option("batch-size", 2)
  .mode(SaveMode.Overwrite)
  .save()

Edge Sink

val df = sparkSession.createDataFrame(Seq(
  Tuple4("marko", "vadas", "20160110", 0.5),
  Tuple4("peter", "Josh", "20230801", 1.0),
  Tuple4("peter", "li,nary", "20130220", 2.0)
)).toDF("source", "target", "date", "weight")

df.show()

df.write
  .format("org.apache.hugegraph.spark.connector.DataSource")
  .option("host", "127.0.0.1")
  .option("port", "8080")
  .option("graph", "hugegraph")
  .option("data-type", "edge")
  .option("label", "knows")
  .option("source-name", "source")
  .option("target-name", "target")
  .option("batch-size", 2)
  .mode(SaveMode.Overwrite)
  .save()

Configs

Client Configs are used to configure hugegraph-client.

Client Configs

Params Default Value Description
host localhost Address of HugeGraphServer
port 8080 Port of HugeGraphServer
graph hugegraph Graph space name
protocol http Protocol for sending requests to the server, optional http or https
username null Username of the current graph when HugeGraphServer enables permission authentication
token null Token of the current graph when HugeGraphServer has enabled authorization authentication
timeout 60 Timeout (seconds) for inserting results to return
max-conn CPUS * 4 The maximum number of HTTP connections between HugeClient and HugeGraphServer
max-conn-per-route CPUS * 2 The maximum number of HTTP connections for each route between HugeClient and HugeGraphServer
trust-store-file null The client’s certificate file path when the request protocol is https
trust-store-token null The client's certificate password when the request protocol is https
Graph Data Configs

Graph Data Configs are used to set graph space configuration.

Params Default Value Description
date-type Graph data type, must be vertex or edge
label Label to which the vertex/edge data to be imported belongs
id Specify a column as the id column of the vertex. When the vertex id policy is CUSTOMIZE, it is required; when the id policy is PRIMARY_KEY, it must be empty
source-name Select certain columns of the input source as the id column of source vertex. When the id policy of the source vertex is CUSTOMIZE, a certain column must be specified as the id column of the vertex; when the id policy of the source vertex is When PRIMARY_KEY, one or more columns must be specified for splicing the id of the generated vertex, that is, no matter which id strategy is used, this item is required
target-name Specify certain columns as the id columns of target vertex, similar to source
selected-fields Select some columns to insert, other unselected ones are not inserted, cannot exist at the same time as ignored
ignored-fields Ignore some columns so that they do not participate in insertion, cannot exist at the same time as selected
batch-size 500 The number of data items in each batch when importing data

Common Configs

Common Configs contains some common configurations.

Params Default Value Description
delimiter , Separator of source-name, target-name, selected-fields or ignore-fields

Licence

The same as HugeGraph, hugegraph-spark-connector is also licensed under Apache 2.0 License.