Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Circular dependency for installing GDAL using mosaic.setup_gdal() #524

Open
smartkiwi opened this issue Jan 26, 2024 · 8 comments
Open

Circular dependency for installing GDAL using mosaic.setup_gdal() #524

smartkiwi opened this issue Jan 26, 2024 · 8 comments

Comments

@smartkiwi
Copy link

GDAL installation helper is not usable as part of mosaic library
Currently mosaic.setup_gdal() helper requires GDAL to installed.
This makes it difficult to use for users.

Versions:

  • DBR - 13.3
  • mosaic - 0.4.0
  • GDAL =

Install GDAL documentation doesn't work because of this.
https://github.com/databrickslabs/mosaic/blob/main/docs/source/usage/install-gdal.rst

To Reproduce
Steps to reproduce the behavior:
Running %pip install databricks-mosaic in Databricks Notebook in vanilla DBR 13.3 fails with error that GLAD not found.

Expected behavior
Documentation and tooling should be improved to allow users to install GDAL first without requiring to install mosaic.
Or there should be some way to install mosaic library without GDAL dependencies to allow users to use mosaic.setup_gdal function

@mjohns-databricks
Copy link
Contributor

These are the instructions - https://databrickslabs.github.io/mosaic/usage/install-gdal.html, it is not circular.

@smartkiwi
Copy link
Author

Maybe subject string is not best. I've updated subject.

Let me clarify the problem I face with DBR 13.3.
Currently https://databrickslabs.github.io/mosaic/usage/install-gdal.html tells to run the following steps in order to install GDAL on worker nodes.

But instructions do not include details how to install mosaic on driver node.

On DBR 13.3 (which has no GDAL library) user cannot install mosaic, thus user cannot run mos.setup_gdal()

import mosaic as mos

mos.enable_mosaic(spark, dbutils)
mos.setup_gdal()

@arr175
Copy link

arr175 commented Aug 22, 2024

@smartkiwi did you ever figure this out? I'm stuck at the same location.

@mjohns-databricks
Copy link
Contributor

Again, not a circular dependency. The following is what the docs are conveying:

  1. %pip install databricks-mosaic
  2. (1x setup)
import mosaic as mos

mos.enable_mosaic(spark, dbutils)
mos.setup_gdal()
  1. add the generated init script path to your cluster and restart your cluster
  2. (after restart)
import mosaic as mos

mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)

Providing the signature to the code for setup_gdal under gdal.py to further de-mystify:

def setup_gdal(
        to_fuse_dir: str = "/Workspace/Shared/geospatial/mosaic/gdal/jammy/0.4.2",
        script_out_name: str = "mosaic-gdal-init.sh",
        jni_so_copy: bool = False,
        test_mode: bool = False
) -> bool:
    """
    Prepare GDAL init script and shared objects required for GDAL to run on spark.
    This function will generate the init script that will install GDAL on each worker node.
    After the setup_gdal is run, the init script must be added to the cluster; also,
    a cluster restart is required.

    Notes:
      (a) This is close in behavior to Mosaic < 0.4 series (prior to DBR 13),
          now using jammy default (3.4.1)
      (b) `to_fuse_dir` can be one of `/Volumes/..`, `/Workspace/..`, `/dbfs/..`;
           however, you should use `setup_fuse_install()` for Volume based installs

    Parameters
    ----------
    to_fuse_dir : str
            Path to write out the init script for GDAL installation;
            default is '/Workspace/Shared/geospatial/mosaic/gdal/jammy/0.4.2'.
    script_out_name : str
            name of the script to be written;
            default is 'mosaic-gdal-init.sh'.
    jni_so_copy : bool
            if True, copy shared object to fuse dir and config script to use;
            default is False
    test_mode : bool
            Only for unit tests.

    Returns
    -------
    True unless resources fail to download.
    """

@mjohns-databricks
Copy link
Contributor

If you are running on a "Single Node" spark instance (vs cluster) and do not want to setup an init script then just manually run the contents of the generated script in a cell in your notebook, from here, e.g. something like the following (you are root when running in the notebook, so no sudo):

%sh
apt-add-repository -y "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc)-backports main universe multiverse restricted"
apt-add-repository -y "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc)-updates main universe multiverse restricted"
apt-add-repository -y "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc)-security main multiverse restricted universe"
apt-add-repository -y "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc) main multiverse restricted universe"
apt-get update -y

apt-get -o DPkg::Lock::Timeout=-1 install -y unixodbc libcurl3-gnutls libsnappy-dev libopenjp2-7
apt-get -o DPkg::Lock::Timeout=-1 install -y gdal-bin libgdal-dev python3-numpy python3-gdal

pip install --upgrade pip
pip install gdal==3.4.1

GITHUB_REPO_PATH=databrickslabs/mosaic/main/resources/gdal/jammy
wget -nv -P /usr/lib -nc https://raw.githubusercontent.com/$GITHUB_REPO_PATH/libgdalalljni.so
wget -nv -P /usr/lib -nc https://raw.githubusercontent.com/$GITHUB_REPO_PATH/libgdalalljni.so.30
wget -nv -P /usr/lib -nc https://raw.githubusercontent.com/$GITHUB_REPO_PATH/libgdalalljni.so.30.0.3

Then no cluster restart needed for mos.enable_gdal(spark) ...

@arr175
Copy link

arr175 commented Aug 23, 2024

Hi Michael,

Thanks for sharing this so quickly. The issue that I'm having is on the first step, %pip install databricks-mosaic , it does not install mosaic due to missing GDAL. This is on a new DBR 13.3 shared compute. If there's a specific setting that I need to request from our IT team, please let me know. See below for the error I'm getting on the first step.

Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.
Collecting databricks-mosaic
  Downloading databricks_mosaic-0.4.2-py3-none-any.whl.metadata (828 bytes)
Collecting geopandas<0.14.4,>=0.14 (from databricks-mosaic)
  Downloading geopandas-0.14.3-py3-none-any.whl.metadata (1.5 kB)
Collecting h3<4.0,>=3.7 (from databricks-mosaic)
  Downloading h3-3.7.7-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (4.9 kB)
Requirement already satisfied: ipython>=7.22.0 in /databricks/python3/lib/python3.10/site-packages (from databricks-mosaic) (8.10.0)
Collecting keplergl==0.3.2 (from databricks-mosaic)
  Downloading keplergl-0.3.2.tar.gz (9.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.7/9.7 MB 74.0 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting pyspark<3.5,>=3.4 (from databricks-mosaic)
  Downloading pyspark-3.4.3.tar.gz (311.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 311.4/311.4 MB 49.2 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: ipywidgets<8,>=7.0.0 in /databricks/python3/lib/python3.10/site-packages (from keplergl==0.3.2->databricks-mosaic) (7.7.2)
Collecting traittypes>=0.2.1 (from keplergl==0.3.2->databricks-mosaic)
  Downloading traittypes-0.2.1-py2.py3-none-any.whl.metadata (1.0 kB)
Requirement already satisfied: pandas>=0.23.0 in /databricks/python3/lib/python3.10/site-packages (from keplergl==0.3.2->databricks-mosaic) (1.4.4)
Collecting Shapely>=1.6.4.post2 (from keplergl==0.3.2->databricks-mosaic)
  Downloading shapely-2.0.6-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (7.0 kB)
Collecting fiona>=1.8.21 (from geopandas<0.14.4,>=0.14->databricks-mosaic)
  Downloading fiona-1.9.6.tar.gz (411 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [3 lines of output]
      <string>:86: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
      WARNING:root:Failed to get options via gdal-config: [Errno 2] No such file or directory: 'gdal-config'
      CRITICAL:root:A GDAL API version must be specified. Provide a path to gdal-config using a GDAL_CONFIG environment variable or use a GDAL_VERSION environment variable.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

@mjohns-databricks
Copy link
Contributor

Referencing Installation Guide, please run on an Assigned Cluster and see if that clears up your issue. Also, refer to pending release of 0.4.3 PR #568 for any additional python library "version fixing" that might now be required in DBR 13.3 (notably we are going to be identifying a range for numpy as version 2.0 is no longer compatible with scikit-learn version installed).

@arr175
Copy link

arr175 commented Sep 5, 2024

@mjohns-databricks thanks for recommending running on an Assigned Cluster. Initially it didn't work either but after running

%sh
sudo apt update
sudo apt install -y cmake libgdal-dev

we were able to run %pip install databricks-mosaic followed by mos.setup_gdal(). Now we're up and running on DBR 13.3.

Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants