SPARQL connector for Spark

A library for querying SPARQL endpoints with Apache Spark, for Spark SQL and DataFrames.

SPARQL queries types SELECT, CONSTRUCT, ASK and DESCRIBE are supported.

Requirements

This library requires Spark 1.5+

Building

This library is build with sbt. Use sbt assembly or sbt +assembly for cross compilation.

Using with Spark shell

This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell:

Spark compiled with Scala 2.11

$SPARK_HOME/bin/spark-shell --jars spark-sparql-connector-spark1.5.2-scala2.11-1.0.0-SNAPSHOT.jar

Spark compiled with Scala 2.10

$SPARK_HOME/bin/spark-shell --jars spark-sparql-connector-spark1.5.2-scala2.10-1.0.0-SNAPSHOT.jar

Scala example

import de.usu.research.sake.sparksparql.SparqlContext

val service = "http://dbpedia.org/sparql"
val query = """SELECT ?property ?hasValue
WHERE {
  { <http://dbpedia.org/resource/Le_Figaro> ?property ?hasValue }
}"""

val dataFrame = sqlContext.sparqlQuery(service, query)
dataFrame.show()

Using with PySpark shell

When using the PySpark shell, this package must be added both to the driver and the executors by using the --driver-class-path and --jars command line options. For example, to include it when starting the pyspark shell:

PySpark shell compiled with Scala 2.11

$SPARK_HOME/bin/pyspark --driver-class-path spark-sparql-connector-spark1.5.2-scala2.11-1.0.0-SNAPSHOT.jar --jars spark-sparql-connector-spark1.5.2-scala2.11-1.0.0-SNAPSHOT.jar

PySpark shell compiled with Scala 2.10

$SPARK_HOME/bin/pyspark --driver-class-path spark-sparql-connector-spark1.5.2-scala2.10-1.0.0-SNAPSHOT.jar --jars spark-sparql-connector-spark1.5.2-scala2.10-1.0.0-SNAPSHOT.jar

Python example

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

query = """SELECT ?property ?hasValue
WHERE {
  { <http://dbpedia.org/resource/Le_Figaro> ?property ?hasValue }
}"""
df = sqlContext.read.format('de.usu.research.sake.sparksparql').options(service='http://dbpedia.org/sparql', query=query).load()
df.collect()

Implementation details

SPARQL queries are parsed using Apache Jena ARQ to extract the result variables. Then the queries are performed using Apache Jena JDBC driver
Currently partitioning of the query results is not supported
All result columns are mapped to StringType by default. To use a mapping to other types like IntegerType, BooleanType, TimestampType, etc take a look to the test suite SparqlSuite

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
project		project
sbt		sbt
src		src
.gitignore		.gitignore
.project		.project
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPARQL connector for Spark

Requirements

Building

Using with Spark shell

Spark compiled with Scala 2.11

Spark compiled with Scala 2.10

Scala example

Using with PySpark shell

PySpark shell compiled with Scala 2.11

PySpark shell compiled with Scala 2.10

Python example

Implementation details

About

Releases

Packages

Languages

License

USU-Research/spark-sparql-connector

Folders and files

Latest commit

History

Repository files navigation

SPARQL connector for Spark

Requirements

Building

Using with Spark shell

Spark compiled with Scala 2.11

Spark compiled with Scala 2.10

Scala example

Using with PySpark shell

PySpark shell compiled with Scala 2.11

PySpark shell compiled with Scala 2.10

Python example

Implementation details

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages