All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
- Detect and partition sparse region of UIDs (pull #224)
- Estimator "maxLeaseId" renamed to "maxUid", as used with option
dgraph.partitioner.uidRange.estimator
(pull #221). - Upgraded gson and requests dependencies (pull #225).
- Work with maxUid values that cannot be parsed (pull #216).
- Handle maxUid values larger than Long.MaxValue (pull #216).
- Handle Dgraph data type "default" as plain strings (pull #223).
- Supports full unsigned long (64 bits) value range of Dgraph uids, mapped into signed longs (pull #222).
- Moved to shaded Java Dgraph client (uk.co.gresearch.dgraph:dgraph4j-shaded:21.12.0-0).
- Moved Java Dgraph client to 21.12.0.
- Support latest dgraph release 21.12.0 (pull #147)
- Moved Java Dgraph client to 21.03.1.
- Support latest dgraph release 21.03.0 (pull #101)
- Adds support to read string predicates with language tags like
<http://www.w3.org/2000/01/rdf-schema#label@en>
(issue #63). This works with any source and mode except the node source in wide mode. Note that reading into GraphFrames is based on the wide mode, so only the untagged language strings can be read there. Filter pushdown is not supported for multi-language predicates yet (issue #68). - Adds readable exception and suggests next steps when GRPC fails with
RESOURCE_EXHAUSTED
code. - Missing
maxLeaseId
in cluster state response defaults to1000L
to avoid an exception.
- Improves predicate partitioning on projection pushdown as it creates full partitions.
- Fixes bug that did not push predicate value filter correctly down to Dgraph causing incorrect results (issue #82)
- Fixes bug in reading
geo
andpassword
data types. - Tests against Dgraph 20.03, 20.07 and 20.11.
- Moved Java Dgraph client to 20.11.0.
- Upgraded all dependencies to latest versions.
- Optionally reads all partitions within the same transaction. This guarantees a consistent snapshot of the graph (issue #6). However, concurrent mutations reduce the lifetime of such a transaction and will cause an exception when lifespan exceeds.
- Add Python API that mirrors the Scala API. The README.md fully documents how to load Dgraph data in PySpark.
- Fixed dependency conflicts between connector dependencies and Spark by shading the Java Dgraph client and all its dependencies.
- Refactored connector API, renamed
spark.read.dgraph*
methods tospark.read.dgraph.*
. - Moved
triples
,edges
andnodes
sources from packageuk.co.gresearch.spark.dgraph.connector
touk.co.gresearch.spark.dgraph
. - Moved Java Dgraph client to 20.03.1 and Dgraph test cluster to 20.07.0.
- Add Spark filter pushdown and projection pushdown to improve efficiency when loading only subgraphs.
Filters like
.where($"revenue".isNotNull)
and projections like.select($"subject", $"`dgraph.type`", $"revenue")
will be pushed to Dgraph and only the relevant graph data will be read (issue #7). - Improve performance of
PredicatePartitioner
for multiple predicates per partition. Restoring default number of predicates per partition of1000
from before 0.3.0 (issue #22). - The
PredicatePartitioner
combined withUidRangePartitioner
is the default partitioner now. - Add stream-like reading of partitions from Dgraph. Partitions are split into smaller chunks. This make Spark read Dgraph partitions of any size.
- Add Dgraph metrics to measure throughput, visible in Spark UI Stages page and through
SparkListener
.
- Move Google Guava dependency version to 24.1.1-jre due to known security vulnerability fixed in 24.1.1
- Load data from Dgraph cluster as GraphFrames
GraphFrame
. - Use exact uid cardinality for uid range partitioning. Combined with predicate partitioning, large predicates get split into more partitions than small predicates (issue #2).
- Improve performance of
PredicatePartitioner
for a single predicate per partition (dgraph.partitioner.predicate.predicatesPerPartition=1
). This becomes the new default for this partitioner. - Move to Spark 3.0.0 release (was 3.0.0-preview2).
- Dgraph groups with no predicates caused a
NullPointerException
. - Predicate names need to be escaped in Dgraph queries.
- Load nodes from Dgraph cluster as wide nodes (fully typed property columns).
- Added
dgraph.type
anddgraph.graphql.schema
predicates to be loaded from Dgraph cluster.
Initial release of the project
- Load data from Dgraph cluster as triples (as strings or fully typed), edges or node
DataFrame
s. - Load data from Dgraph cluster as Apache Spark GraphX
Graph
. - Partitioning by Dgraph Group, Alpha node, predicates and uids.