A library for parsing and querying shapefile data with Apache Spark, for Spark SQL and DataFrames.
This library requires Spark 2.0+ and make sure to clone and install https://github.com/Esri/geometry-api-java.git
$SPARK_HOME/bin/spark-shell --packages com.esri:spark-shp:0.30
This package allows reading shapefiles in local or distributed filesystem as Spark DataFrames. When reading files the API accepts several options:
path
The location of shapefile(s). Similar to Spark can accept standard Hadoop globbing expressions.shape
An optional name of the shape column. Default value isshape
.columns
An optional list of comma separated attribute column names. Default value is blank indicating all attribute fields.format
An optional parameter to define the output format of the shape field. Default value isSHP
. Possible values are:SHP
Esri binary shape format.WKT
Well known Text.WKB
Well Known BinaryGEOJSON
GeoJSON
repair
An optional parameter to repair the read geometry. Possible values are:None
No repair.Esri
Apply Esri repair operator.OGC
Apply OGC repair operator.
CREATE TABLE gps
USING com.esri.spark.shp
OPTIONS
(
path "data/gps.shp"
)
df = spark.read \
.format("shp") \
.options(path="data/gps.shp", columns="atext,adate", format="GEOJSON") \
.load() \
.cache()
This library is built using Apache Maven. To build the jar, execute the following command:
mvn clean install
- Download the shapefile of Metro Stations in DC
export ENV=spark-shp
conda remove --yes --all --name $ENV
conda create --yes --name $ENV python=3.6
source activate $ENV
conda install --yes --quiet -c conda-forge\
jupyterlab\
tqdm\
future\
matplotlib=3.1\
gdal=2.4\
pyproj=2.2\
shapely=1.6\
pyshp=2.1