A library for querying SPARQL endpoints with Apache Spark, for Spark SQL and DataFrames.
SPARQL queries types SELECT, CONSTRUCT, ASK and DESCRIBE are supported.
This library requires Spark 1.5+
This library is build with sbt.
Use sbt assembly
or sbt +assembly
for cross compilation.
This package can be added to Spark using the --jars
command line option. For example, to include it when starting the spark shell:
$SPARK_HOME/bin/spark-shell --jars spark-sparql-connector-spark1.5.2-scala2.11-1.0.0-SNAPSHOT.jar
$SPARK_HOME/bin/spark-shell --jars spark-sparql-connector-spark1.5.2-scala2.10-1.0.0-SNAPSHOT.jar
import de.usu.research.sake.sparksparql.SparqlContext
val service = "http://dbpedia.org/sparql"
val query = """SELECT ?property ?hasValue
{ <http://dbpedia.org/resource/Le_Figaro> ?property ?hasValue }
val dataFrame = sqlContext.sparqlQuery(service, query)
When using the PySpark shell, this package must be added both to the driver and the executors by using the --driver-class-path
and --jars
command line options. For example, to include it when starting the pyspark shell:
$SPARK_HOME/bin/pyspark --driver-class-path spark-sparql-connector-spark1.5.2-scala2.11-1.0.0-SNAPSHOT.jar --jars spark-sparql-connector-spark1.5.2-scala2.11-1.0.0-SNAPSHOT.jar
$SPARK_HOME/bin/pyspark --driver-class-path spark-sparql-connector-spark1.5.2-scala2.10-1.0.0-SNAPSHOT.jar --jars spark-sparql-connector-spark1.5.2-scala2.10-1.0.0-SNAPSHOT.jar
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
query = """SELECT ?property ?hasValue
{ <http://dbpedia.org/resource/Le_Figaro> ?property ?hasValue }
df = sqlContext.read.format('de.usu.research.sake.sparksparql').options(service='http://dbpedia.org/sparql', query=query).load()
- SPARQL queries are parsed using Apache Jena ARQ to extract the result variables. Then the queries are performed using Apache Jena JDBC driver
- Currently partitioning of the query results is not supported
- All result columns are mapped to StringType by default. To use a mapping to other types like IntegerType, BooleanType, TimestampType, etc take a look to the test suite