Skip to content

A library for parsing and querying shapefile data with Apache Spark, for Spark SQL and DataFrames.

License

Notifications You must be signed in to change notification settings

mraad/spark-shp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Shapefile Data Source for Apache Spark

A library for parsing and querying shapefile data with Apache Spark, for Spark SQL and DataFrames.

Requirements

This library requires Spark 2.0+ and make sure to clone and install https://github.com/Esri/geometry-api-java.git

Using with Spark shell

$SPARK_HOME/bin/spark-shell --packages com.esri:spark-shp:0.30

Features

This package allows reading shapefiles in local or distributed filesystem as Spark DataFrames. When reading files the API accepts several options:

  • path The location of shapefile(s). Similar to Spark can accept standard Hadoop globbing expressions.
  • shape An optional name of the shape column. Default value is shape.
  • columns An optional list of comma separated attribute column names. Default value is blank indicating all attribute fields.
  • format An optional parameter to define the output format of the shape field. Default value is SHP. Possible values are:
  • repair An optional parameter to repair the read geometry. Possible values are:
    • None No repair.
    • Esri Apply Esri repair operator.
    • OGC Apply OGC repair operator.

SQL API

CREATE TABLE gps
    USING com.esri.spark.shp
    OPTIONS
(
    path "data/gps.shp"
)

Python API

df = spark.read \
    .format("shp") \
    .options(path="data/gps.shp", columns="atext,adate", format="GEOJSON") \
    .load() \
    .cache()

Building From Source

This library is built using Apache Maven. To build the jar, execute the following command:

mvn clean install

Data

Create Conda Env

export ENV=spark-shp
conda remove --yes --all --name $ENV
conda create --yes --name $ENV python=3.6
source activate $ENV
conda install --yes --quiet -c conda-forge\
    jupyterlab\
    tqdm\
    future\
    matplotlib=3.1\
    gdal=2.4\
    pyproj=2.2\
    shapely=1.6\
    pyshp=2.1

About

A library for parsing and querying shapefile data with Apache Spark, for Spark SQL and DataFrames.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published