HugeGraph Spark Connector is a Spark connector application for reading and writing HugeGraph data in Spark standard format.
Required:
- Java 8+
- Maven 3.6+
To build without executing tests:
mvn clean package -DskipTests
To build with default tests:
mvn clean packge
If we have a graph, the schema is defined as follows:
schema.propertyKey("name").asText().ifNotExist().create()
schema.propertyKey("age").asInt().ifNotExist().create()
schema.propertyKey("city").asText().ifNotExist().create()
schema.propertyKey("weight").asDouble().ifNotExist().create()
schema.propertyKey("lang").asText().ifNotExist().create()
schema.propertyKey("date").asText().ifNotExist().create()
schema.propertyKey("price").asDouble().ifNotExist().create()
schema.vertexLabel("person")
.properties("name", "age", "city")
.useCustomizeStringId()
.nullableKeys("age", "city")
.ifNotExist()
.create()
schema.vertexLabel("software")
.properties("name", "lang", "price")
.primaryKeys("name")
.ifNotExist()
.create()
schema.edgeLabel("knows")
.sourceLabel("person")
.targetLabel("person")
.properties("date", "weight")
.ifNotExist()
.create()
schema.edgeLabel("created")
.sourceLabel("person")
.targetLabel("software")
.properties("date", "weight")
.ifNotExist()
.create()
Then we can insert graph data through Spark, first add dependency in your pom.
<dependency>
<groupId>org.apache.hugegraph</groupId>
<artifactId>hugegraph-spark-connector</artifactId>
<version>${revision}</version>
</dependency>
val df = sparkSession.createDataFrame(Seq(
Tuple3("marko", 29, "Beijing"),
Tuple3("vadas", 27, "HongKong"),
Tuple3("Josh", 32, "Beijing"),
Tuple3("peter", 35, "ShangHai"),
Tuple3("li,nary", 26, "Wu,han"),
Tuple3("Bob", 18, "HangZhou"),
)) toDF("name", "age", "city")
df.show()
df.write
.format("org.apache.hugegraph.spark.connector.DataSource")
.option("host", "127.0.0.1")
.option("port", "8080")
.option("graph", "hugegraph")
.option("data-type", "vertex")
.option("label", "person")
.option("id", "name")
.option("batch-size", 2)
.mode(SaveMode.Overwrite)
.save()
val df = sparkSession.createDataFrame(Seq(
Tuple4("marko", "vadas", "20160110", 0.5),
Tuple4("peter", "Josh", "20230801", 1.0),
Tuple4("peter", "li,nary", "20130220", 2.0)
)).toDF("source", "target", "date", "weight")
df.show()
df.write
.format("org.apache.hugegraph.spark.connector.DataSource")
.option("host", "127.0.0.1")
.option("port", "8080")
.option("graph", "hugegraph")
.option("data-type", "edge")
.option("label", "knows")
.option("source-name", "source")
.option("target-name", "target")
.option("batch-size", 2)
.mode(SaveMode.Overwrite)
.save()
Client Configs are used to configure hugegraph-client.
Params | Default Value | Description |
---|---|---|
host |
localhost |
Address of HugeGraphServer |
port |
8080 |
Port of HugeGraphServer |
graph |
hugegraph |
Graph space name |
protocol |
http |
Protocol for sending requests to the server, optional http or https |
username |
null |
Username of the current graph when HugeGraphServer enables permission authentication |
token |
null |
Token of the current graph when HugeGraphServer has enabled authorization authentication |
timeout |
60 |
Timeout (seconds) for inserting results to return |
max-conn |
CPUS * 4 |
The maximum number of HTTP connections between HugeClient and HugeGraphServer |
max-conn-per-route |
CPUS * 2 |
The maximum number of HTTP connections for each route between HugeClient and HugeGraphServer |
trust-store-file |
null |
The client’s certificate file path when the request protocol is https |
trust-store-token |
null |
The client's certificate password when the request protocol is https |
Graph Data Configs are used to set graph space configuration.
Params | Default Value | Description |
---|---|---|
date-type |
Graph data type, must be vertex or edge |
|
label |
Label to which the vertex/edge data to be imported belongs | |
id |
Specify a column as the id column of the vertex. When the vertex id policy is CUSTOMIZE, it is required; when the id policy is PRIMARY_KEY, it must be empty | |
source-name |
Select certain columns of the input source as the id column of source vertex. When the id policy of the source vertex is CUSTOMIZE, a certain column must be specified as the id column of the vertex; when the id policy of the source vertex is When PRIMARY_KEY, one or more columns must be specified for splicing the id of the generated vertex, that is, no matter which id strategy is used, this item is required | |
target-name |
Specify certain columns as the id columns of target vertex, similar to source | |
selected-fields |
Select some columns to insert, other unselected ones are not inserted, cannot exist at the same time as ignored | |
ignored-fields |
Ignore some columns so that they do not participate in insertion, cannot exist at the same time as selected | |
batch-size |
500 |
The number of data items in each batch when importing data |
Common Configs contains some common configurations.
Params | Default Value | Description |
---|---|---|
delimiter |
, |
Separator of source-name , target-name , selected-fields or ignore-fields |
The same as HugeGraph, hugegraph-spark-connector is also licensed under Apache 2.0 License.