Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add load_stac process #402

Closed
jdries opened this issue Apr 18, 2023 · 18 comments · Fixed by #430 or Open-EO/openeo-opensearch-client#11
Closed

add load_stac process #402

jdries opened this issue Apr 18, 2023 · 18 comments · Fixed by #430 or Open-EO/openeo-opensearch-client#11
Assignees

Comments

@jdries
Copy link
Contributor

jdries commented Apr 18, 2023

https://processes.openeo.org/draft/#load_stac

This probably has a lot in common with load_result.
There is one important change: we want to use FileLayerProvider, with a STAC client.

Specific case:
STAC API Collection that allows to filter items and to download assets.

@jdries jdries assigned bossie and unassigned soxofaan May 5, 2023
@bossie
Copy link
Collaborator

bossie commented May 15, 2023

Is a v2.x feature: Open-EO/openeo-processes#439

@bossie
Copy link
Collaborator

bossie commented May 15, 2023

Differences with load_result:

  • only accepts URLs, not batch job IDs
  • supports static STAC catalogs as well as dynamic STAC APIs
  • supportsproperties argument

@bossie
Copy link
Collaborator

bossie commented May 15, 2023

From the process description:

Batch job results can be specified in two ways:

For Batch job results at the same back-end, a URL pointing to the corresponding batch job results endpoint should be provided. The URL usually ends with /jobs/{id}/results and {id} is the corresponding batch job ID.
For external results, a signed URL must be provided. Not all back-ends support signed URLs, which are provided as a link with the link relation canonical in the batch job result metadata.

@jdries does this imply that the logged-in user should be able load his own batch job results by providing a /jobs/{id}/results URL that is not signed?

@jdries
Copy link
Contributor Author

jdries commented May 15, 2023

Good question, because I thought the canonical link was the only way to retrieve the correct url, but I think that this is indeed implied here. Note that for this ticket, implementing the remote url option is actually the main goal.

@soxofaan
Copy link
Member

@jdries does this imply that the logged-in user should be able load his own batch job results by providing a /jobs/{id}/results URL that is not signed?

Indeed, I pushed for that because the signed URLs have an expiry, could be invalidated/rotated, ..., while the /jobs/{id}/results URL is static and easily predictable (but require auth headers). I think both options are valuable to have (for different use cases)

@bossie
Copy link
Collaborator

bossie commented May 16, 2023

@jdries does this imply that the logged-in user should be able load his own batch job results by providing a /jobs/{id}/results URL that is not signed?

Indeed, I pushed for that because the signed URLs have an expiry, could be invalidated/rotated, ..., while the /jobs/{id}/results URL is static and easily predictable (but require auth headers). I think both options are valuable to have (for different use cases)

@soxofaan ok. I was thinking about the implications of this.

In load_result, we can simply assume that if the id starts with http(s)://, it's a signed URL and we can point our STAC client there; if it doesn't, it's a batch job ID and we load the results directly instead of invoking an HTTP request.

In the case of load_stac, doesn't this mean that we have to detect whether or not the URL is our own and act accordingly? In this case: pass auth headers if it is. But if we can determine that the URL is our own, we might as well parse the job ID from it and avoid HTTP again.

Thoughts?

@soxofaan
Copy link
Member

But if we can determine that the URL is our own, we might as well parse the job ID from it and avoid HTTP again.

Indeed, that was also my thought: if we detect a static (non-signed) job result URL, we just have to check that the current user owns the referenced job (which is equivalent to go full HTTP route with auth header stuff)

@bossie
Copy link
Collaborator

bossie commented May 22, 2023

It looks like that, at some point in the code, we have to determine whether the provided URL represents a static STAC catalog (incl. batch job results) or a dynamic STAC API Collection (which supports search requests).

A possible way to accomplish this seems to consist of:

  • fetching the STAC Collection document that the URL points to;
  • fetching the Collection's root Catalog by navigating to its "root" link;
  • looking at the root Catalog document's "conformsTo" property to determine whether it's a STAC API that supports item search;
  • if it does, we can send search requests to the Catalog's "search" URL for this collection; otherwise it's a static Catalog and we browse it.

@m-mohr does this approach seem right to you?

@m-mohr
Copy link
Member

m-mohr commented May 22, 2023

Sounds reasonable.

bossie added a commit that referenced this issue May 23, 2023
Download this cube to test it:

data_cube = (connection
             .load_stac(url="https://tamn.snapplanet.io/collections/S2",
                        spatial_extent={"west": -87.83465281740789, "south": 42.57836607418331, "east": -87.80890361086492, "north": 42.59100512331456},
                        temporal_extent=["2022-05-10", "2022-05-10"],
                        bands=["B04", "B03", "B02"])
             .save_result("GTiff"))
@bossie bossie mentioned this issue May 23, 2023
@bossie bossie linked a pull request May 23, 2023 that will close this issue
bossie added a commit to Open-EO/openeo-opensearch-client that referenced this issue May 24, 2023
bossie added a commit to Open-EO/openeo-python-driver that referenced this issue May 24, 2023
bossie added a commit to Open-EO/openeo-opensearch-client that referenced this issue May 25, 2023
bossie added a commit to Open-EO/openeo-opensearch-client that referenced this issue May 25, 2023
bossie added a commit to Open-EO/openeo-opensearch-client that referenced this issue May 25, 2023
@bossie bossie linked a pull request May 25, 2023 that will close this issue
bossie added a commit that referenced this issue May 25, 2023
ERROR    openeo_driver.views.error:views.py:268 Py4JJavaError('An error occurred while calling None.org.openeo.geotrellis.file.PyramidFactory.\n', JavaObject id=o115)
Traceback (most recent call last):
  File "/home/bossie/PycharmProjects/openeo/venv38/lib/python3.8/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/bossie/PycharmProjects/openeo/venv38/lib/python3.8/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/users/auth.py", line 88, in decorated
    return f(*args, **kwargs)
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/views.py", line 624, in result
    result = backend_implementation.processing.evaluate(process_graph=process_graph, env=env)
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 271, in evaluate
    return evaluate(process_graph=process_graph, env=env)
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 342, in evaluate
    result = convert_node(result_node, env=env)
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 362, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 1578, in apply_process
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 1578, in <dictcomp>
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 374, in convert_node
    return convert_node(processGraph['node'], env=env)
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 362, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 1673, in apply_process
    return process_function(args=args, env=env)
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 2191, in load_stac
    return env.backend_implementation.load_stac(url=url, load_params=load_params, env=env)
  File "/home/bossie/PycharmProjects/openeo/openeo-geopyspark-driver/openeogeotrellis/backend.py", line 836, in load_stac
    pyramid_factory = jvm.org.openeo.geotrellis.file.PyramidFactory(stac_api_client,
  File "/home/bossie/PycharmProjects/openeo/venv38/lib/python3.8/site-packages/py4j/java_gateway.py", line 1585, in __call__
    return_value = get_return_value(
  File "/home/bossie/PycharmProjects/openeo/venv38/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.openeo.geotrellis.file.PyramidFactory.
: java.lang.NullPointerException
	at org.openeo.geotrellis.file.PyramidFactory.<init>(PyramidFactory.scala:42)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)
bossie added a commit that referenced this issue May 26, 2023
bossie added a commit that referenced this issue Aug 17, 2023
bossie pushed a commit to Open-EO/openeo-geopyspark-integrationtests that referenced this issue Aug 18, 2023
bossie pushed a commit to Open-EO/openeo-geopyspark-integrationtests that referenced this issue Aug 18, 2023
@bossie bossie reopened this Aug 18, 2023
bossie added a commit that referenced this issue Aug 21, 2023
bossie added a commit that referenced this issue Aug 21, 2023
… URL #402

In a batch job started from an async_task (think: SHub batch process), there's no access_token
in the user's internal_auth_data; this was erroneously passed as the string "None" to the batch job.

Instead of an expected KeyError upon reconstructing the Bearer token, the Bearer token
was actually "basic//None", failing as well but with an unexpected 403 Forbidden instead.

File "batch_job.py", line 1298, in <module>
    main(sys.argv)
  File "batch_job.py", line 1034, in main
    run_driver()
  File "batch_job.py", line 1004, in run_driver
    run_job(
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/utils.py", line 52, in memory_logging_wrapper
    return function(*args, **kwargs)
  File "batch_job.py", line 1099, in run_job
    result = ProcessGraphDeserializer.evaluate(process_graph, env=env, do_dry_run=tracer)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 348, in evaluate
    result = convert_node(result_node, env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 368, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1480, in apply_process
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1480, in <dictcomp>
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 380, in convert_node
    return convert_node(processGraph['node'], env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 368, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1480, in apply_process
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1480, in <dictcomp>
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 380, in convert_node
    return convert_node(processGraph['node'], env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 368, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1512, in apply_process
    return process_function(args=ProcessArgs(args, process_id=process_id), env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 2088, in load_stac
    return env.backend_implementation.load_stac(url=url, load_params=load_params, env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/backend.py", line 937, in load_stac
    url = signed_results_url()  # FIXME: remove HTTP workaround, load job results directly (~ load_result)
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/backend.py", line 858, in signed_results_url
    resp.raise_for_status()
  File "/opt/venv/lib64/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://openeo-dev.vito.be/openeo/1.1/jobs/j-2802b19806a84c3eb8772c97e4abb2b7/results
bossie added a commit that referenced this issue Aug 21, 2023
Removes workaround where a Bearer token was reconstructed to be able
to obtain a canonical URL and load STAC from there.

Fixes the combination of load_stac and SHub batch processes.
bossie added a commit that referenced this issue Aug 24, 2023
Traceback (most recent call last):
  File "/opt/venv/lib64/python3.8/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/venv/lib64/python3.8/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/users/auth.py", line 88, in decorated
    return f(*args, **kwargs)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/views.py", line 619, in result
    result = backend_implementation.processing.evaluate(process_graph=process_graph, env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 277, in evaluate
    return evaluate(process_graph=process_graph, env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 348, in evaluate
    result = convert_node(result_node, env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 368, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1480, in apply_process
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1480, in <dictcomp>
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 380, in convert_node
    return convert_node(processGraph['node'], env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 368, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1512, in apply_process
    return process_function(args=ProcessArgs(args, process_id=process_id), env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 2088, in load_stac
    return env.backend_implementation.load_stac(url=url, load_params=load_params, env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/backend.py", line 1031, in load_stac
    itm.properties.get("datetime") or itm.properties["start_datetime"],
KeyError: 'start_datetime'
@bossie
Copy link
Collaborator

bossie commented Aug 24, 2023

With the temporary catalog workarounds applied, this batch job runs successfully and produces 3 GeoTiffs with 3 bands each:

connection = openeo.connect("openeo-3-1.openeo-vlcc-prod").authenticate_oidc()

data_cube = (connection
             .load_stac(url="https://geoville/resto/collections/BVLPROBA_v1",
                        spatial_extent={"west": 20.579494542018466, "south": 54.31120577537291,
                                        "east": 20.631426035739594, "north": 54.33263361375995},
                        temporal_extent=["2019-01-01", "2021-12-31"],
                        bands=["band1", "band2", "band3"])
             .save_result("GTiff"))

job = data_cube.execute_batch()
job.download_results("/tmp")

# openEO_2019-01-01Z.tif
# openEO_2020-01-01Z.tif
# openEO_2021-01-01Z.tif

bossie added a commit that referenced this issue Aug 28, 2023
bossie added a commit to Open-EO/openeo-python-driver that referenced this issue Aug 28, 2023
Open-EO/openeo-geopyspark-driver#402

Traceback (most recent call last):
  File "/home/bossie/PycharmProjects/openeo/venv38/lib/python3.8/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/bossie/PycharmProjects/openeo/venv38/lib/python3.8/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/users/auth.py", line 88, in decorated
    return f(*args, **kwargs)
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/views.py", line 619, in result
    result = backend_implementation.processing.evaluate(process_graph=process_graph, env=env)
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 277, in evaluate
    return evaluate(process_graph=process_graph, env=env)
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 342, in evaluate
    convert_node(result_node, env=env.push({ENV_DRY_RUN_TRACER: dry_run_tracer, ENV_SAVE_RESULT:[], "node_caching":False}))
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 368, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 1480, in apply_process
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 1480, in <dictcomp>
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 380, in convert_node
    return convert_node(processGraph['node'], env=env)
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 368, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 1512, in apply_process
    return process_function(args=ProcessArgs(args, process_id=process_id), env=env)
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/ProcessGraphDeserializer.py", line 752, in reduce_dimension
    dimension = args.get_required(
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/processes.py", line 309, in get_required
    self._check_value(name=name, value=value, expected_type=expected_type, validator=validator)
  File "/home/bossie/PycharmProjects/openeo/openeo-python-driver/openeo_driver/processes.py", line 336, in _check_value
    raise ProcessParameterInvalidException(parameter=name, process=self.process_id, reason=reason)
openeo_driver.errors.ProcessParameterInvalidException: The value passed for parameter 'dimension' in process 'reduce_dimension' is invalid: Must be one of [] but got 't'.
bossie added a commit that referenced this issue Sep 5, 2023
From https://github.com/stac-api-extensions/query:

"It is recommended to implement the Filter Extension instead of the Query Extension. Filter Extension is more well-defined, more expressive, and uses the standardized CQL2 query language instead of the proprietary language defined here. There is no plan to deprecate this extension, but it is also unlikely to see any further refinement or changes."
@bossie
Copy link
Collaborator

bossie commented Sep 5, 2023

As discussed: seeing as client-side filtering on Item properties tends to work better than the STAC API Filter extension (400 "Unknown property in filter" most of the time) it was decided to postpone this TODO.

@m-mohr
Copy link
Member

m-mohr commented Sep 5, 2023

Are you aware of the queryables endpoint, which you can use to retrieve the available properties for the filters?

@bossie
Copy link
Collaborator

bossie commented Sep 5, 2023

This particular STAC API returns "additionalProperties": true but still rejects most properties.

@m-mohr
Copy link
Member

m-mohr commented Sep 5, 2023

Is it a public API? If yes, which?

@bossie
Copy link
Collaborator

bossie commented Sep 5, 2023

The API is not public.

@jdries
Copy link
Contributor Author

jdries commented Sep 12, 2023

done!

@jdries jdries closed this as completed Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment