Release 1.7.0, Merge pull request #308 from sentinel-hub/develop

Release 1.7.0
sentinel-hub · Nov 22, 2023 · 1411aab · 1411aab
2 parents dd13783 + 1d1b9ff
commit 1411aab
Show file tree

Hide file tree

Showing 75 changed files with 550 additions and 2,835 deletions.
diff --git a/.github/workflows/ci_action.yml b/.github/workflows/ci_action.yml
@@ -90,6 +90,11 @@ jobs:
           pip install -e .[DEV,ML]
           pip install gdal==$(gdal-config --version)
 
+      - name: Set up local cluster # we need to install async-timeout until ray 2.9.0 fixes the issue
+        run: |
+          pip install async-timeout
+          ray start --head
+
       - name: Run fast tests
         if: ${{ !matrix.full_test_suite }}
         run: pytest -m "not integration"
@@ -113,3 +118,20 @@ jobs:
           files: coverage.xml
           fail_ci_if_error: true
           verbose: false
+
+  mirror-to-gitlab:
+    if: github.event_name == 'push'
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v1
+      - name: Mirror + trigger CI
+        uses: SvanBoxel/gitlab-mirror-and-ci-action@master
+        with:
+          args: "https://git.sinergise.com/eo/code/eo-grow"
+        env:
+          FOLLOW_TAGS: "true"
+          GITLAB_HOSTNAME: "git.sinergise.com"
+          GITLAB_USERNAME: "github-action"
+          GITLAB_PASSWORD: ${{ secrets.GITLAB_PASSWORD }}
+          GITLAB_PROJECT_ID: "878"
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
diff --git a/.github/workflows/ci_trigger.yml b/.github/workflows/ci_trigger.yml
@@ -1,29 +1,19 @@
-name: mirror_and_trigger
+name: trigger
 
 on:
-  pull_request:
-  push:
-    branches:
-      - "master"
-      - "develop"
-  workflow_call:
   release:
     types:
       - published
 
 jobs:
-  mirror-to-gitlab:
+  trigger:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v1
-      - name: Mirror + trigger CI
-        uses: SvanBoxel/gitlab-mirror-and-ci-action@master
-        with:
-          args: "https://git.sinergise.com/eo/code/eo-grow"
-        env:
-          FOLLOW_TAGS: "true"
-          GITLAB_HOSTNAME: "git.sinergise.com"
-          GITLAB_USERNAME: "github-action"
-          GITLAB_PASSWORD: ${{ secrets.GITLAB_PASSWORD }}
-          GITLAB_PROJECT_ID: "878"
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      - name: Trigger API
+        run: >
+          curl -X POST --fail \
+            -F token=${{ secrets.GITLAB_PIPELINE_TRIGGER_TOKEN }} \
+            -F ref=main \
+            -F variables[CUSTOM_RUN_TAG]=auto \
+            -F variables[LAYER_NAME]=dotai-eo \
+            https://git.sinergise.com/api/v4/projects/1031/trigger/pipeline
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -13,20 +13,20 @@ repos:
       - id: debug-statements
 
   - repo: https://github.com/pre-commit/mirrors-prettier
-    rev: "v3.0.3"
+    rev: "v3.1.0"
     hooks:
       - id: prettier
         exclude: "tests/(test_stats|test_project)/"
         types_or: [json]
 
   - repo: https://github.com/psf/black
-    rev: 23.10.1
+    rev: 23.11.0
     hooks:
       - id: black
         language_version: python3
 
   - repo: https://github.com/charliermarsh/ruff-pre-commit
-    rev: "v0.1.4"
+    rev: "v0.1.6"
     hooks:
       - id: ruff
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,14 @@
+## [Version 1.7.0] - 2023-11-22
+With this release we push `eo-grow` towards a more `ray` centered execution model.
+
+- The local EOExecutor models with multiprocessing/multithreading have been removed. (Most) pipelines no longer have the `use_ray` and `workers` parameters. In order to run instances locally one has to set up a local cluster (via `ray start --head`). We included a `debug` parameter that uses `EOExecutor` instead of `RayExecutor` so that IDE breakpoints work in most pipelines.
+- Pipeline chain configs have been adjusted. The user can now specify what kind of resources the main pipeline process would require. This also allows one to run pipelines entirely on worker instances.
+- The `ray_worker_type` field was replaced with `worker_resources` that allows for precise resource request specifications.
+- Fixed a but where CLI variables were not applied for config chains.
+- Removed `TestPipeline` and the `eogrow-test` command.
+- Some `ValueError` exceptions were changed to `TypeError`.
+
+
 ## [Version 1.6.3] - 2023-11-07
 
 - Pipelines can request specific type of worker when run on a ray cluster with the `ray_worker_type` field.

diff --git a/docs/source/common-configuration-patterns.md b/docs/source/common-configuration-patterns.md
@@ -9,8 +9,6 @@ Invoking `eogrow-template "eogrow.pipelines.zipmap.ZipMapPipeline" "zipmap.json"
 {
   "pipeline": "eogrow.pipelines.zipmap.ZipMapPipeline",
   "pipeline_name": "<< Optional[str] >>",
-  "workers": "<< 1 : int >>",
-  "use_ray": "<< 'auto' : Union[Literal['auto'], bool] >>",
   "input_features": {
     "<< type >>": "List[InputFeatureSchema]",
     "<< nested schema >>": "<class 'eogrow.pipelines.zipmap.InputFeatureSchema'>",
@@ -104,11 +102,11 @@ In certain use cases we have multiple pipelines that are meant to be run in a ce
 But the user still needs to run them in the correct order and by hand. This we can automate with a simple pipeline chain that links them together:
 ```
 [ // end_to_end_run.json
-  {"**download": "${config_path}/01_download.json"},
-  {"**preprocess": "${config_path}/02_preprocess_data.json"},
-  {"**predict": "${config_path}/03_use_model.json"},
-  {"**export": "${config_path}/04_export_maps.json"},
-  {"**ingest": "${config_path}/05_ingest_byoc.json"},
+  {"pipeline_config": {"**download": "${config_path}/01_download.json"}},
+  {"pipeline_config": {"**preprocess": "${config_path}/02_preprocess_data.json"}},
+  {"pipeline_config": {"**predict": "${config_path}/03_use_model.json"}},
+  {"pipeline_config": {"**export": "${config_path}/04_export_maps.json"}},
+  {"pipeline_config": {"**ingest": "${config_path}/05_ingest_byoc.json"}},
 ]
 ```
 
@@ -121,28 +119,59 @@ In experimentation we often want to run the same pipeline for multiple parameter
 ```
 [ // run_threshold_experiments.json
   {
-    "variables": {"threshold": 0.1},
-    "**pipeline": "${config_path}/extract_trees.json"
+    "pipeline_config:{
+      "variables": {"threshold": 0.1},
+      "**pipeline": "${config_path}/extract_trees.json"
+    },
   },
   {
-    "variables": {"threshold": 0.2},
-    "**pipeline": "${config_path}/extract_trees.json"
+    "pipeline_config:{
+      "variables": {"threshold": 0.2},
+      "**pipeline": "${config_path}/extract_trees.json"
+    },
   },
   {
-    "variables": {"threshold": 0.3},
-    "**pipeline": "${config_path}/extract_trees.json"
+    "pipeline_config:{
+      "variables": {"threshold": 0.3},
+      "**pipeline": "${config_path}/extract_trees.json"
+    },
   },
   {
-    "variables": {"threshold": 0.4},
-    "**pipeline": "${config_path}/extract_trees.json"
+    "pipeline_config:{
+      "variables": {"threshold": 0.4},
+      "**pipeline": "${config_path}/extract_trees.json"
+     }
   }
 ]
 ```
 
-### Using variables with pipelines
+### Using variables with pipeline chains
 
 While there is no syntactic sugar for specifying pipeline-chain-wide variables in JSON files, one can do that through CLI. Running `eogrow end_to_end_run.json -v "year:2019"` will set the variable `year` to 2019 for all pipelines in the chain.
 
+### Specifying resources for pipeline execution
+
+Pipeline chains also allow the user to specify resources needed by the main process of each pipeline in a similar way that a pipeline config can specify resources needed by its workers.
+
+```
+[ // end_to_end_run.json
+  {
+    "pipeline_config": {"**download": "${config_path}/01_download.json"}
+  }
+  {
+    "pipeline_config": {"**predict": "${config_path}/03_use_model.json"},
+    "pipeline_resources": {"memory": 2e9} // ~ 2GB RAM reserved for the main process
+  }
+  {
+    "pipeline_config": {"**export": "${config_path}/04_export_maps.json"}
+  }
+]
+```
+
+This also allows us to run certain pipelines on specially tagged workers. When setting up the cluster, one can tag workers with custom resources, for instance a `r5.4xlarge` worker with `big_RAM_worker: 1`. If we set `"pipeline_resources": {"resources": {"big_RAM_worker": 1}}` then the pipeline will run ONLY on such workers, and the whole worker instance will be assigned to it. This is great for pipelines which have a large workload in the main process.
+
+Pipeline chains can be 1 pipeline long, so this can also be used with a single pipeline.
+
 ## Path modification via variables
 
 In some cases one wants fine grained control over path specifications. The following is a simplified example of how one can provide separate download paths for a large amount of batch pipelines.

diff --git a/docs/source/config-language.md b/docs/source/config-language.md
@@ -26,19 +26,28 @@ Additional notes:
 
 ### Pipeline chains
 
-A typical configuration is a dictionary with pipeline parameters. However, it can also be a list of dictionaries. In this case each dictionary must contain parameters of a single pipeline. The order of dictionaries defines the consecutive order in which pipelines will be run. Example:
+A typical configuration is a dictionary with pipeline parameters. However, it can also be a list of pipeline-execution dictionaries that specify:
+- `pipeline_config`: a configuration for a single pipeline,
+- `pipeline_resources` (optional): a dictionary that is passed to `ray.remote` to configure which resources the main pipeline process will request from the cluster (see [here](https://docs.ray.io/en/latest/ray-core/api/doc/ray.remote_function.RemoteFunction.options.html) for options). The pipeline requests 1 CPU by default (and nothing else).
+
+The order of dictionaries defines the consecutive order in which pipelines will be run. Example:
 
 ```
 [
   {
-    "pipeline": "FirstPipeline",
-    "param1": "value1",
-    ...
+    "pipeline_config": {
+      "pipeline": "FirstPipeline",
+      "param1": "value1",
+      ...
+    },
   },
   {
-    "pipeline": "SecondPipeline",
-    "param2": "value2",
-    ...
+    "pipeline_config": {
+      "pipeline": "SecondPipeline",
+      "param2": "value2",
+      ...
+    },
+    "pipeline_resources": {"num_cpus": 2}
   },
   ...
 ]

diff --git a/eogrow/__init__.py b/eogrow/__init__.py
@@ -1,3 +1,3 @@
 """The main module of the eo-grow package."""
 
-__version__ = "1.6.3"
+__version__ = "1.7.0"