Merge pull request #53 from openEOPlatform/vito_memory

document terrascope job options
openEOPlatform · Nov 21, 2022 · 62039d3 · 62039d3
2 parents f62509c + fe3cb17
commit 62039d3
Showing 1 changed file with 50 additions and 2 deletions.
diff --git a/federation/index.md b/federation/index.md
@@ -176,13 +176,61 @@ on the appropriate processing back-ends.
 Subsequent interaction (starting the jobs, polling their status, requesting the result assets, ...)
 can be done through the "master" `job` object created above, in the same way as with normal batch jobs.
 
-### Validity signer URLs (Batch job results)
+### Validity of signed URLs in batch job results
 
 Batch job results are accessible to the user via signed URLs stored in the result assets. Within the platform, 
 these URLs have a validity (expiry time) of 7 days. Within these 7 days, the results of a batch job can be accessed 
 by any person with the URL. Each time a user requests the results from the job endpoint (`GET /jobs/{job_id}/results`), 
 a freshly signed URL (valid for 7 days) is created for the result assets.
 
+### Customizing batch job resources on Terrascope
+
+Jobs running on the (Terrascope) cluster get assigned a default amount of CPU and memory resources. This
+may not always be enough for your job, for instance when using UDF's. Also for very large jobs, you may want
+to tune your resource settings to optimize for cost.
+
+The example below shows how to start a job with all options set to their default values. It is important to highlight
+that default settings are subject to change by the backend whenever needed.
+
+```python
+job_options = {        
+        "executor-memory": "2G",
+        "executor-memoryOverhead": "3G",
+        "executor-cores": 2,
+        "task-cpus": 1,
+        "executor-request-cores": "400m",
+        "max-executors": "100",
+        "driver-memory": "8G",
+        "driver-memoryOverhead": "2G",
+        "driver-cores": 5,
+    }
+cube.execute_batch(job_options=job_options)
+```
+
+This is a short overview of the various options:
+
+- executor-memory: memory assigned to your workers, for the JVM that executes most predefined processes 
+- executor-memoryOverhead: memory assigned on top of the JVM, for instance to run UDF's
+- executor-cores: number of CPUs per worker (executor). The number of parallel tasks is executor-cores/task-cpus
+- task-cpus: CPUs assigned to a single task. UDF's using libraries like Tensorflow can benefit from further parallellization on the level of individual tasks.
+- executor-request-cores: this settings is only relevant for Kubernetes based backends, allows to overcommit CPU
+- max-executors: the maximum number of workers assigned to your job. Maximum number of parallel tasks is `max-executors*executor-cores/task-cpus`. Increasing this can inflate your costs, while not necessarily improving performance!
+- driver-memory: memory assigned to the spark 'driver' JVM that controls execution of your batch job
+- driver-memoryOverhead: memory assigned to the spark 'driver' on top of JVM memory, for Python processes.
+
+#### Learning more
+
+The topic of resource optimization is a complex one, and here we just give a short summary. The goal of openEO is to hide most of these
+details from the user, but we realize that advanced users sometimes want to have a bit more insight, so in the spirit of being open, we give some hints. 
+
+To learn more about these options, we point to the piece of code that handles this:
+
+https://github.com/Open-EO/openeo-geopyspark-driver/blob/faf5d5364a82e870e42efd2a8aee9742f305da9f/openeogeotrellis/backend.py#L1213
+
+Most memory related options are translated to Apache Spark configuration settings, which are documented here:
+
+https://spark.apache.org/docs/3.3.1/configuration.html#application-properties
+
 ### Batch job results on Sentinel Hub
 
-If you are processing data and the underlying back-end is Sentinel Hub, the output extent of your batch job results is currently larger than your input extent because Sentinel Hub processes whole tiles (this may change in the future and the data will be cropped to your input extent).
+If you are processing data and the underlying back-end is Sentinel Hub, the output extent of your batch job results is currently larger than your input extent because Sentinel Hub processes whole tiles (this may change in the future and the data will be cropped to your input extent).