Skip to content

Commit

Permalink
Merge branch 'databrickslabs:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
a0x8o authored May 14, 2024
2 parents 8868a1f + 66a0bc1 commit 0014122
Show file tree
Hide file tree
Showing 29 changed files with 444 additions and 480 deletions.
5 changes: 2 additions & 3 deletions .github/workflows/build_main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,7 @@ jobs:
uses: ./.github/actions/scala_build
- name: build python
uses: ./.github/actions/python_build
# CRAN FLAKY (502 'Bad Gateway' ERRORS)
# - name: build R
# uses: ./.github/actions/r_build
- name: build R
uses: ./.github/actions/r_build
- name: upload artefacts
uses: ./.github/actions/upload_artefacts
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
## v0.4.2 [DBR 13.3 LTS]
- Geopandas now fixed to "<0.14.4,>=0.14" due to conflict with minimum numpy version in geopandas 0.14.4.
- H3 python changed from "==3.7.0" to "<4.0,>=3.7" to pick up patches.
- Fixed an issue with fallback logic when deserializing subdatasets from a zip.
- Adjusted data used to speed up a long-running test.
- Streamlines setup_gdal and setup_fuse_install:
- init script and resource copy logic fixed to repo "main" (.so) / "latest" (.jar).
- added apt-get lock handling for native installs.
- removed support for UbuntuGIS PPA as GDAL version no longer compatible with jammy default (3.4.x).

## v0.4.1 [DBR 13.3 LTS]
- Fixed python bindings for MosaicAnalyzer functions.
- Added tiller functions, ST_AsGeoJSONTile and ST_AsMVTTile, for creating GeoJSON and MVT tiles as aggregations of geometries.
Expand Down
5 changes: 5 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,9 @@ The repository is structured as follows:

## Test & build Mosaic

Given that DBR 13.3 is Ubuntu 22.04, we recommend using docker,
see [mosaic-docker.sh](https://github.com/databrickslabs/mosaic/blob/main/scripts/mosaic-docker.sh).

### Scala JAR

We use the [Maven](https://maven.apache.org/install.html) build tool to manage and build the Mosaic scala project.
Expand Down Expand Up @@ -115,6 +118,8 @@ To build the docs:
- Install the pandoc library (follow the instructions for your platform [here](https://pandoc.org/installing.html)).
- Install the python requirements from `docs/docs-requirements.txt`.
- Build the HTML documentation by running `make html` from `docs/`.
- For nbconvert you may have to symlink your jupyter share folder,
e.g. `sudo ln -s /opt/homebrew/share/jupyter /usr/local/share`.
- You can locally host the docs by running the `reload.py` script in the `docs/source/` directory.

## Style
Expand Down
2 changes: 1 addition & 1 deletion R/sparkR-mosaic/sparkrMosaic/DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: sparkrMosaic
Title: SparkR bindings for Databricks Mosaic
Version: 0.4.1
Version: 0.4.2
Authors@R:
person("Robert", "Whiffin", , "[email protected]", role = c("aut", "cre")
)
Expand Down
2 changes: 1 addition & 1 deletion R/sparklyr-mosaic/sparklyrMosaic/DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: sparklyrMosaic
Title: sparklyr bindings for Databricks Mosaic
Version: 0.4.1
Version: 0.4.2
Authors@R:
person("Robert", "Whiffin", , "[email protected]", role = c("aut", "cre")
)
Expand Down
2 changes: 1 addition & 1 deletion R/sparklyr-mosaic/tests.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ library(sparklyr.nested)
spark_home <- Sys.getenv("SPARK_HOME")
spark_home_set(spark_home)

install.packages("sparklyrMosaic_0.4.1.tar.gz", repos = NULL)
install.packages("sparklyrMosaic_0.4.2.tar.gz", repos = NULL)
library(sparklyrMosaic)

# find the mosaic jar in staging
Expand Down
26 changes: 18 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ An extension to the [Apache Spark](https://spark.apache.org/) framework that all
[![codecov](https://codecov.io/gh/databrickslabs/mosaic/branch/main/graph/badge.svg?token=aEzZ8ITxdg)](https://codecov.io/gh/databrickslabs/mosaic)
[![build](https://github.com/databrickslabs/mosaic/actions/workflows/build_main.yml/badge.svg)](https://github.com/databrickslabs/mosaic/actions?query=workflow%3A%22build+main%22)
[![docs](https://github.com/databrickslabs/mosaic/actions/workflows/docs.yml/badge.svg)](https://github.com/databrickslabs/mosaic/actions/workflows/docs.yml)
[![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/databrickslabs/mosaic.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/databrickslabs/mosaic/context:python)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![lines of code](https://tokei.rs/b1/github/databrickslabs/mosaic)]([https://codecov.io/github/databrickslabs/mosaic](https://github.com/databrickslabs/mosaic))

Expand All @@ -33,7 +32,8 @@ The supported languages are Scala, Python, R, and SQL.

## How does it work?

The Mosaic library is written in Scala (JVM) to guarantee maximum performance with Spark and when possible, it uses code generation to give an extra performance boost.
The Mosaic library is written in Scala (JVM) to guarantee maximum performance with Spark and when possible,
it uses code generation to give an extra performance boost.

__The other supported languages (Python, R and SQL) are thin wrappers around the Scala (JVM) code.__

Expand All @@ -42,6 +42,13 @@ Image1: Mosaic logical design.

## Getting started

:warning: **geopandas 0.14.4 not supported**

For Mosaic <= 0.4.1 `%pip install databricks-mosaic` will no longer install "as-is" in DBRs due to the fact that Mosaic
left geopandas unpinned in those versions. With geopandas 0.14.4, numpy dependency conflicts with the limits of
scikit-learn in DBRs. The workaround is `%pip install geopandas==0.14.3 databricks-mosaic`.
Mosaic 0.4.2+ limits the geopandas version.

### Mosaic 0.4.x Series [Latest]

We recommend using Databricks Runtime versions 13.3 LTS with Photon enabled.
Expand All @@ -56,18 +63,21 @@ We recommend using Databricks Runtime versions 13.3 LTS with Photon enabled.
__Language Bindings__

As of Mosaic 0.4.0 (subject to change in follow-on releases)...
As of Mosaic 0.4.0 / DBR 13.3 LTS (subject to change in follow-on releases)...

* [Assigned Clusters](https://docs.databricks.com/en/compute/configure.html#access-modes): Mosaic Python, SQL, R, and Scala APIs.
* [Shared Access Clusters](https://docs.databricks.com/en/compute/configure.html#access-modes): Mosaic Scala API (JVM) with Admin [allowlisting](https://docs.databricks.com/en/data-governance/unity-catalog/manage-privileges/allowlist.html); _Python bindings to Mosaic Scala APIs are blocked by Py4J Security on Shared Access Clusters._
* [Assigned Clusters](https://docs.databricks.com/en/compute/configure.html#access-modes)
* Mosaic Python, SQL, R, and Scala APIs.
* [Shared Access Clusters](https://docs.databricks.com/en/compute/configure.html#access-modes)
* Mosaic Scala API (JVM) with Admin [allowlisting](https://docs.databricks.com/en/data-governance/unity-catalog/manage-privileges/allowlist.html).
* Mosaic Python bindings (to Mosaic Scala APIs) are blocked by Py4J Security on Shared Access Clusters.
* Mosaic SQL expressions cannot yet be registered with [Unity Catalog](https://www.databricks.com/product/unity-catalog) due to API changes affecting DBRs >= 13, more [here](https://docs.databricks.com/en/udf/index.html).

__Additional Notes:__

As of Mosaic 0.4.0 (subject to change in follow-on releases)...
Mosaic is a custom JVM library that extends spark, which has the following implications in DBR 13.3 LTS:

1. [Unity Catalog](https://www.databricks.com/product/unity-catalog): Enforces process isolation which is difficult to accomplish with custom JVM libraries; as such only built-in (aka platform provided) JVM APIs can be invoked from other supported languages in Shared Access Clusters.
2. [Volumes](https://docs.databricks.com/en/connect/unity-catalog/volumes.html): Along the same principle of isolation, clusters (both assigned and shared access) can read Volumes via relevant built-in readers and writers or via custom python calls which do not involve any custom JVM code.
2. [Volumes](https://docs.databricks.com/en/connect/unity-catalog/volumes.html): Along the same principle of isolation, clusters can read Volumes via relevant built-in (aka platform provided) readers and writers or via custom python calls which do not involve any custom JVM code.

### Mosaic 0.3.x Series

Expand Down Expand Up @@ -142,7 +152,7 @@ import com.databricks.labs.mosaic.JTS
val mosaicContext = MosaicContext.build(H3, JTS)
mosaicContext.register(spark)
```
__Note: Mosaic 0.4.x SQL bindings for DBR 13 can register with Assigned clusters (as Hive UDFs), but not Shared Access due to API changes, more [here](https://docs.databricks.com/en/udf/index.html).__
__Note: Mosaic 0.4.x SQL bindings for DBR 13 can register with Assigned clusters (as Spark Expressions), but not Shared Access due to API changes, more [here](https://docs.databricks.com/en/udf/index.html).__

## Examples

Expand Down
6 changes: 3 additions & 3 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@
# -- Project information -----------------------------------------------------

project = 'Mosaic'
copyright = '2022, Databricks Inc'
author = 'Stuart Lynn, Milos Colic, Erni Durdevic, Robert Whiffin, Timo Roest'
copyright = '2024, Databricks Inc'
author = 'Milos Colic, Stuart Lynn, Michael Johns, Robert Whiffin'

# The full version, including alpha/beta/rc tags
release = "v0.4.1"
release = "v0.4.2"


# -- General configuration ---------------------------------------------------
Expand Down
82 changes: 37 additions & 45 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,84 +37,73 @@
:target: https://github.com/databrickslabs/mosaic/actions/workflows/docs.yml
:alt: Mosaic sphinx docs

.. image:: https://img.shields.io/lgtm/grade/python/g/databrickslabs/mosaic.svg?logo=lgtm&logoWidth=18
:target: https://lgtm.com/projects/g/databrickslabs/mosaic/context:python
:alt: Language grade: Python

.. image:: https://img.shields.io/badge/code%20style-black-000000.svg
:target: https://github.com/psf/black
:alt: Code style: black



Mosaic is an extension to the `Apache Spark <https://spark.apache.org/>`_ framework that allows easy and fast processing of very large geospatial datasets.

We currently recommend using Databricks Runtime with Photon enabled;
this will leverage the Databricks H3 expressions when using H3 grid system.

Mosaic provides:

* easy conversion between common spatial data encodings (WKT, WKB and GeoJSON);
* constructors to easily generate new geometries from Spark native data types;
* many of the OGC SQL standard :code:`ST_` functions implemented as Spark Expressions for transforming, aggregating and joining spatial datasets;
* high performance through implementation of Spark code generation within the core Mosaic functions;
* optimisations for performing point-in-polygon joins using an approach we co-developed with Ordnance Survey (`blog post <https://databricks.com/blog/2021/10/11/efficient-point-in-polygon-joins-via-pyspark-and-bng-geospatial-indexing.html>`_); and
* the choice of a Scala, SQL and Python API.
| Mosaic is an extension to the `Apache Spark <https://spark.apache.org/>`_ framework for fast + easy processing
of very large geospatial datasets. It provides:
|
| [1] The choice of a Scala, SQL and Python language bindings (written in Scala).
| [2] Raster and Vector APIs.
| [3] Easy conversion between common spatial data encodings (WKT, WKB and GeoJSON).
| [4] Constructors to easily generate new geometries from Spark native data types.
| [5] Many of the OGC SQL standard :code:`ST_` functions implemented as Spark Expressions for transforming,
| aggregating and joining spatial datasets.
| [6] High performance through implementation of Spark code generation within the core Mosaic functions.
| [7] Performing point-in-polygon joins using an approach we co-developed with Ordnance Survey
(`blog post <https://databricks.com/blog/2021/10/11/efficient-point-in-polygon-joins-via-pyspark-and-bng-geospatial-indexing.html>`_).
.. note::
For Mosaic versions < 0.4 please use the `0.3 docs <https://databrickslabs.github.io/mosaic/v0.3.x/index.html>`_.

.. warning::
At times, it is useful to "hard refresh" pages to ensure your cached local version matches the latest live,
more `here <https://www.howtogeek.com/672607/how-to-hard-refresh-your-web-browser-to-bypass-your-cache/>`_.
We recommend using Databricks Runtime with Photon enabled to leverage the Databricks H3 expressions.

Version 0.4.x Series
====================

We recommend using Databricks Runtime versions 13.3 LTS with Photon enabled.
.. warning::
For Mosaic <= 0.4.1 :code:`%pip install databricks-mosaic` will no longer install "as-is" in DBRs due to the fact that Mosaic
left geopandas unpinned in those versions. With geopandas 0.14.4, numpy dependency conflicts with the limits of
scikit-learn in DBRs. The workaround is :code:`%pip install geopandas==0.14.3 databricks-mosaic`.
Mosaic 0.4.2+ limits the geopandas version.

Mosaic 0.4.x series only supports DBR 13.x DBRs. If running on a different DBR it will throw an exception:

DEPRECATION ERROR: Mosaic v0.4.x series only supports Databricks Runtime 13.
You can specify `%pip install 'databricks-mosaic<0.4,>=0.3'` for DBR < 13.
DEPRECATION ERROR: Mosaic v0.4.x series only supports Databricks Runtime 13.
You can specify :code:`%pip install 'databricks-mosaic<0.4,>=0.3'` for DBR < 13.

Mosaic 0.4.x series issues an ERROR on standard, non-Photon clusters `ADB <https://learn.microsoft.com/en-us/azure/databricks/runtime/>`_ |
`AWS <https://docs.databricks.com/runtime/index.html/>`_ |
`GCP <https://docs.gcp.databricks.com/runtime/index.html/>`_:

DEPRECATION ERROR: Please use a Databricks Photon-enabled Runtime for performance benefits or Runtime ML for
spatial AI benefits; Mosaic 0.4.x series restricts executing this cluster.

As of Mosaic 0.4.0 (subject to change in follow-on releases)
DEPRECATION ERROR: Please use a Databricks Photon-enabled Runtime for performance benefits or Runtime ML for
spatial AI benefits; Mosaic 0.4.x series restricts executing this cluster.

* `Assigned Clusters <https://docs.databricks.com/en/compute/configure.html#access-modes>`_: Mosaic Python, SQL, R, and Scala APIs.
* `Shared Access Clusters <https://docs.databricks.com/en/compute/configure.html#access-modes>`_: Mosaic Scala API (JVM) with
Admin `allowlisting <https://docs.databricks.com/en/data-governance/unity-catalog/manage-privileges/allowlist.html>`_;
Python bindings to Mosaic Scala APIs are blocked by Py4J Security on Shared Access Clusters.
As of Mosaic 0.4.0 / DBR 13.3 LTS (subject to change in follow-on releases):

.. warning::
Mosaic 0.4.x SQL bindings for DBR 13 can register with Assigned clusters (as Hive UDFs), but not Shared Access due
to `Unity Catalog <https://www.databricks.com/product/unity-catalog>`_ API changes, more `here <https://docs.databricks.com/en/udf/index.html>`_.
* `Assigned Clusters <https://docs.databricks.com/en/compute/configure.html#access-modes>`_
* Mosaic Python, SQL, R, and Scala APIs.
* `Shared Access Clusters <https://docs.databricks.com/en/compute/configure.html#access-modes>`_
* Mosaic Scala API (JVM) with Admin `allowlisting <https://docs.databricks.com/en/data-governance/unity-catalog/manage-privileges/allowlist.html>`_.
* Mosaic Python bindings (to Mosaic Scala APIs) are blocked by Py4J Security on Shared Access Clusters.
* Mosaic SQL expressions cannot yet be registered due to `Unity Catalog <https://www.databricks.com/product/unity-catalog>`_.
API changes, more `here <https://docs.databricks.com/en/udf/index.html>`_.

.. note::
As of Mosaic 0.4.0 (subject to change in follow-on releases)
Mosaic is a custom JVM library that extends spark, which has the following implications in DBR 13.3 LTS:

* `Unity Catalog <https://www.databricks.com/product/unity-catalog>`_ enforces process isolation which is difficult
to accomplish with custom JVM libraries; as such only built-in (aka platform provided) JVM APIs can be invoked from
other supported languages in Shared Access Clusters.
* Along the same principle of isolation, clusters (both Assigned and Shared Access) can read
`Volumes <https://docs.databricks.com/en/connect/unity-catalog/volumes.html>`_ via relevant built-in readers and
writers or via custom python calls which do not involve any custom JVM code.
* Clusters can read `Volumes <https://docs.databricks.com/en/connect/unity-catalog/volumes.html>`_ via relevant
built-in (aka platform provided) readers and writers or via custom python calls which do not involve any custom JVM code.


Version 0.3.x Series
====================

We recommend using Databricks Runtime versions 12.2 LTS with Photon enabled.
For Mosaic versions < 0.4.0 please use the `0.3.x docs <https://databrickslabs.github.io/mosaic/v0.3.x/index.html>`_.

.. warning::
Mosaic 0.3.x series does not support DBR 13.x DBRs.

As of the 0.3.11 release, Mosaic issues the following WARNING when initialized on a cluster that is neither Photon Runtime
nor Databricks Runtime ML `ADB <https://learn.microsoft.com/en-us/azure/databricks/runtime/>`_ |
`AWS <https://docs.databricks.com/runtime/index.html/>`_ |
Expand All @@ -128,6 +117,9 @@ making this change is that we are streamlining Mosaic internals to be more align
powered by Photon. Along this direction of change, Mosaic has standardized to JTS as its default and supported Vector
Geometry Provider.

.. note::
For Mosaic versions < 0.4 please use the `0.3 docs <https://databrickslabs.github.io/mosaic/v0.3.x/index.html>`_.


Documentation
=============
Expand Down
2 changes: 1 addition & 1 deletion docs/source/usage/automatic-sql-registration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ to your Spark / Databricks cluster to perform spatial queries or integrating Spa
with a geospatial middleware component such as [Geoserver](https://geoserver.org/).

.. warning::
Mosaic 0.4.x SQL bindings for DBR 13 can register with Assigned clusters (as Hive UDFs), but not Shared Access due
Mosaic 0.4.x SQL bindings for DBR 13 can register with Assigned clusters (as Spark Expressions), but not Shared Access due
to `Unity Catalog <https://www.databricks.com/product/unity-catalog>`_ API changes, more `here <https://docs.databricks.com/en/udf/index.html>`_.

Pre-requisites
Expand Down
Loading

0 comments on commit 0014122

Please sign in to comment.