Shapefile Data Source for Apache Spark

A library for parsing and querying shapefile data with Apache Spark, for Spark SQL and DataFrames.

Requirements

This library requires Spark 2.0+ and make sure to clone and install https://github.com/Esri/geometry-api-java.git

Using with Spark shell

$SPARK_HOME/bin/spark-shell --packages com.esri:spark-shp:0.30

Features

This package allows reading shapefiles in local or distributed filesystem as Spark DataFrames. When reading files the API accepts several options:

path The location of shapefile(s). Similar to Spark can accept standard Hadoop globbing expressions.
shape An optional name of the shape column. Default value is shape.
columns An optional list of comma separated attribute column names. Default value is blank indicating all attribute fields.
format An optional parameter to define the output format of the shape field. Default value is SHP. Possible values are:
- SHP Esri binary shape format.
- WKT Well known Text.
- WKB Well Known Binary
- GEOJSON GeoJSON
repair An optional parameter to repair the read geometry. Possible values are:
- None No repair.
- Esri Apply Esri repair operator.
- OGC Apply OGC repair operator.

SQL API

CREATE TABLE gps
    USING com.esri.spark.shp
    OPTIONS
(
    path "data/gps.shp"
)

Python API

df = spark.read \
    .format("shp") \
    .options(path="data/gps.shp", columns="atext,adate", format="GEOJSON") \
    .load() \
    .cache()

Building From Source

This library is built using Apache Maven. To build the jar, execute the following command:

mvn clean install

Data

Download the shapefile of Metro Stations in DC

Create Conda Env

export ENV=spark-shp
conda remove --yes --all --name $ENV
conda create --yes --name $ENV python=3.6
source activate $ENV
conda install --yes --quiet -c conda-forge\
    jupyterlab\
    tqdm\
    future\
    matplotlib=3.1\
    gdal=2.4\
    pyproj=2.2\
    shapely=1.6\
    pyshp=2.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Shapefile Data Source for Apache Spark

Requirements

Using with Spark shell

Features

SQL API

Python API

Building From Source

Data

Create Conda Env

Files

README.md

Latest commit

History

README.md

File metadata and controls

Shapefile Data Source for Apache Spark

Requirements

Using with Spark shell

Features

SQL API

Python API

Building From Source

Data

Create Conda Env