load_result: Load by URL and filter by extents and bands #292

m-mohr · 2021-10-26T09:39:49Z

implements #220, see #220 (comment) for some details.

sophieherrmann · 2021-10-27T14:59:50Z

Adding filtering by temporal / spatial extent and bands sounds good. For clarification how would the load_result via signed url work? Do we just download the job result or could we somehow access the job result metadata as well. I think metadata could be helpful for correctly loading the data.

m-mohr · 2021-10-27T15:39:36Z

load_result requests the signed URL of the metadata so you can use that for filtering as described in the process:

The URL to the STAC metadata for a batch job result. This needs to the signed URL that is provided by some back-ends through the canonical link relation in the batch job result metadata. Please note that this is only supported since openEO API version 1.1.0.
If supported by the underlying metadata and file format, the data that is added to the data cube can be restricted with the parameters spatial_extent, temporal_extent and bands.

sophieherrmann · 2021-10-27T15:44:56Z

Then that's perfect

soxofaan · 2021-10-28T07:56:04Z

proposals/load_result.json

@@ -1,7 +1,7 @@
 {
    "id": "load_result",
    "summary": "Load batch job results",
-    "description": "Loads batch job results by job id from the server-side user workspace. The job must have been stored by the authenticated user on the back-end currently connected to.",
+    "description": "Loads batch job results and returns them as a processable data cube. A batch job result can be loaded by ID or URL:\n\n* **ID**: A batch job identifier available on the server-side user workspace. The job must have been stored by the authenticated user on the back-end currently connected to.\n* **URL**: The URL to the STAC metadata for a batch job result. This needs to the signed URL that is provided by some back-ends through the `canonical` link relation in the batch job result metadata. Please note that this is only supported since openEO API version 1.1.0.\n\nIf supported by the underlying metadata and file format, the data that is added to the data cube can be restricted with the parameters `spatial_extent`, `temporal_extent` and `bands`.\n\n**Remarks:**\n\n* The bands (and all dimensions that specify nominal dimension labels) are expected to be ordered as specified in the metadata if the `bands` parameter is set to `null`.\n* If no additional parameter is specified this would imply that the whole data set is expected to be loaded. Due to the large size of many data sets, this is not recommended and may be optimized by back-ends to only load the data that is actually required after evaluating subsequent processes such as filters. This means that the pixel values should be processed only after the data has been limited to the required extent and as a consequence also to a manageable size.",


batch job identifier available on the server-side user workspace. The job must have been stored by the ...

"id available in workspace" is a bit weird here I think.
The concept "workspace" is used in the API in the "files" sub-API, but not for batch jobs. I don't think we specify that batch job results should be accessible directly through this "files" workspace. Also, the identifier itself is just a string listed in job listing, I don't think it's necessary to describe it in terms of workspaces and storage.

Sidenote: I think it should also be mentioned that the job must have completed successfully before results can be loaded.

This needs to the signed URL

why this requirement?
Problem with the signed URL is that it can expire or be invalidated . If you don't want your process graph to break when signed URLs expire, you probably want to use a more persistent UR, e.g. a non-signed but authenticated URL

... note that this is only supported since ...

it's not clear what this "this" refers to: signed urls, the canonical link, the fact that load_results accepts urls, ...?

If no additional parameter is specified this would imply that the whole data set is expected to be loaded. Due to the large size of many data sets, this is not recommended and may be optimized by back-ends to only load the data that is actually required after evaluating subsequent processes such as filters. This means that the pixel values should be processed only after the data has been limited to the required extent and as a consequence also to a manageable size.

There are quite some implementation details here, and I'm not sure if it is necessary to do so. We're talking about results of a successful batch job, so the data size is probably already "manageable"

I added "thumbs up" to the comments I've already fixed, see the latest commit for details.

why this requirement? Problem with the signed URL is that it can expire or be invalidated.

When writing the original batch job result spec we had e.g. Google Docs in mind where signed URLs don't expire unless you invalidate them specifically.

If you don't want your process graph to break when signed URLs expire, you probably want to use a more persistent UR, e.g. a non-signed but authenticated URL

The signed URL was meant to be exactly this as there's no good alternative yet or what would a "non-signed but authenticated URL" be for you? We can't really start an OIDC flow inside a process, I think.

There are quite some implementation details here, and I'm not sure if it is necessary to do so. We're talking about results of a successful batch job, so the data size is probably already "manageable"

It's basically the same level of detail we have already in load_collection (mostly copy&paste) and that was requested to be there to give users an understanding of the implications of being "lazy" and requesting everything. So I think it good to have it there. A batch job could still be continental-scale...

fine-tuning suggestion:

Not specifying any additional parameter implies to load the whole data set, which might be expensive performance-wise. Back-ends may optimize this with lazy loading techniques, but it is generally recommended to limit the loaded data as soon as possible to the relevant spatio-temporal extent.

expensive performance-wise

It might be expensive in different regards, not sure I'd specifically mention performance here.

lazy loading techniques

Is this detail important? I thought we are going to be less technical. ;-)

to the relevant spatio-temporal extent.

This doesn't capture the bands (and in load_collection: properties).

A merge of both?

Not specifying any additional parameter implies loading the whole data set into a data cube, which might be expensive. Back-ends may optimize this, but it is generally recommended to limit the data cubes to the required size as soon as possible.

A bare "expensive" sounds money-related to me. In the end it will be reflected in the billing, but I guess users should be worried first about performance or computational cost.

"lazy loading" is indeed implementation detail. Fine to omit.

"required size": I wanted to avoid "size", because that is not exposed explicitly, let alone controllable by the user, they control spatio-temporal extents. The fact that the description doesn't cover bands or properties is more something "left as exercise to the reader". Possibile alternative: "data subset".

expensive: I added performance/cost.

size: I've checked the xarray docs as an example and they seem to also call it "size", so I'd expect that a good amount of users understand it. I had "subset" in the description for a moment, but felt like size is easier to understand (probably very subjective).

It's just that "the required size" is confusing to me as a simple user: I have no idea about the actual size (is it in pixels, tiles, file bytes, memory bytes?) and I don't know what is "required" (is it required for my use case or is it a back-end limitation?). To play the devil's advocate: a naive user might also read the "size" recommendation as "I can load the whole world, if I down-sample it early enough". If you talk about extents, you don't have that interpretation.

proposal:

.. to limit the data cube to the relevant extent as soon as possible.

m-mohr · 2021-10-29T11:09:14Z

fyi: I've left out property filtering intentionally for now as it increases the complexity and I don't see us filtering on asset properties right now. We could surely add it later once Open-EO/openeo-api#398 has been implemented. With an API in the background property filtering is much more feasible. At that point you could even do a "load_from_stac_api" process. ;-)

m-mohr · 2021-11-29T17:37:27Z

I plan to release v1.2.0 soon. Is this ready to go into this version or should we postpone and just put it into 1.3.0?

soxofaan · 2021-11-30T09:18:52Z

I think the discussion is mainly about finetuning docs, so I would not prevent that for including the current state in 1.2.0

Current state looks fine to me.

One minor note about:

Back-ends may optimize this, but it is generally recommended to limit the data cubes to the required size as soon as possible.

I don't think it is clear for a novice user what "back-ends may optimize this" is about.
We could elaborate on that, but I think the sentence is good enough without that part:

It is generally recommended to limit the data cubes to the required size as soon as possible.

SerRichard · 2021-11-30T09:33:33Z

I don't think it is clear for a novice user what "back-ends may optimize this" is about. We could elaborate on that, but I think the sentence is good enough without that part:

Does this not also come across to a user that they may get different rates for the same processioning on different back-ends, depending which is optimized the best? In a distributed platform like this it's bound to be the case, but I wonder if it is beneficial to advertise it to the user (at least at the current stage).

Rest looks good to me though!

m-mohr · 2021-11-30T10:06:16Z

I'm not exactly sure how to solve this, because I think it's still important to say that users should carefully choose the amount of data to load. For now, I have reverted back to the original load_collection description as that's the one that had been agreed on by all openEO.org consortium partners before and load_collection/result are similar in their purpose except that it's a different data source so can share the same wording.

@SerRichard Yes, that could mean that costs differ between back-ends. We are here defining the overall processes for openEO so there's no fine-tuning here for openEO Platform. I don't think this is really a concern for Platform, but if it is, the Platform is free to remove this from the description.

jdries · 2021-11-30T10:21:11Z

So this would go into draft fro 1.2.0? Or really a full release? Do we have a confirmed implementation that supports the additional parameters?

Regarding docs: your problem is that you're writing for different audiences: backend developers vs end users.
For me, the primary docs for end users are the various guides and the dedicated client documentation. The process can and should also cater towards backend developers. If an end user chooses to use our spec as documentation, then he shouldn't be too confused when backend guidelines pop up in the description...

Regarding optimization differences: we should be very clear at this stage that users can and should try different backends to be sure that they optimize for cost, especially when they think the cost is too much or have some other bad experience.

m-mohr · 2021-11-30T10:29:53Z

So this would go into draft fro 1.2.0? Or really a full release? Do we have a confirmed implementation that supports the additional parameters?

Planned to go into draft now for a full release in the next 4 weeks, but of course, as an experimental proposal (where we already have several processes included without any implementation). I can also hold it back and only release it in 1.3.0 next year if that's important for anyone. Nevertheless, the process is simply a "merge" of a well-established process (load_collection) and the old load_result, so I don't think the description can be overly wrong. Of course, we may figure out that this won't work as of now, but then I think the main mitigation is to add STAC API support instead of removing the parameters again.

The primary reason for releasing soon is that once I start the greater set of changes for vector processes, I'd like to get the existing changes out first and have a clean basis so that we don't need to wait for the vector stuff to be completed.

Regarding docs: your problem is that you're writing for different audiences: backend developers vs end users. For me, the primary docs for end users are the various guides and the dedicated client documentation. The process can and should also cater towards backend developers. If an end user chooses to use our spec as documentation, then he shouldn't be too confused when backend guidelines pop up in the description...

Don't agree here. The process descriptions are primarily targeted towards end users and were always written as such. It's what they see first when looking for a specific process in most of our tooling (the only exception could be the Python client, not sure). Back-end implementation details can be added to the implementation guide we recently added to the repository. If there's anything I should add for this process to the implementation guide, please let me know what could be useful to have there.

Improve load_result #220 and other minor alignments

f8ee247

m-mohr added the minor label Oct 26, 2021

m-mohr added this to the 1.3.0 milestone Oct 26, 2021

m-mohr requested review from soxofaan, jdries, aljacob and sophieherrmann October 26, 2021 09:39

m-mohr mentioned this pull request Oct 26, 2021

load_result: Align with load_collection #220

Closed

m-mohr self-assigned this Oct 26, 2021

m-mohr linked an issue Oct 26, 2021 that may be closed by this pull request

load_result: Align with load_collection #220

Closed

m-mohr modified the milestones: 1.3.0, 1.2.0 Oct 26, 2021

sophieherrmann mentioned this pull request Oct 27, 2021

Implement load_result from signed url Open-EO/openeo-processes-python#63

Open

Merge branch 'draft' into issue-220

1212895

soxofaan reviewed Oct 28, 2021

View reviewed changes

Improvements from code review

cb03852

m-mohr added 2 commits November 2, 2021 11:11

Improvements from code review

cdad109

Shorten description to filter data cube asap

aac1e96

m-mohr requested review from SerRichard and removed request for sophieherrmann November 3, 2021 10:29

Fine-tune descriptions

cd78dcf

m-mohr mentioned this pull request Nov 4, 2021

Clarify save_result #289

Merged

Merge branch 'draft' into issue-220

686d04a

m-mohr added the platform label Nov 16, 2021

Revert back to original wording

bdb8433

m-mohr merged commit 5abad4b into draft Dec 1, 2021

m-mohr deleted the issue-220 branch December 1, 2021 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_result: Load by URL and filter by extents and bands #292

load_result: Load by URL and filter by extents and bands #292

m-mohr commented Oct 26, 2021 •

edited

Loading

sophieherrmann commented Oct 27, 2021

m-mohr commented Oct 27, 2021 •

edited

Loading

sophieherrmann commented Oct 27, 2021

soxofaan Oct 28, 2021

soxofaan Oct 28, 2021

soxofaan Oct 28, 2021

soxofaan Oct 28, 2021

m-mohr Oct 28, 2021 •

edited

Loading

soxofaan Nov 3, 2021

m-mohr Nov 3, 2021

soxofaan Nov 3, 2021

m-mohr Nov 4, 2021 •

edited

Loading

soxofaan Nov 4, 2021

m-mohr commented Oct 29, 2021 •

edited

Loading

m-mohr commented Nov 29, 2021

soxofaan commented Nov 30, 2021

SerRichard commented Nov 30, 2021

m-mohr commented Nov 30, 2021

jdries commented Nov 30, 2021

m-mohr commented Nov 30, 2021 •

edited

Loading

load_result: Load by URL and filter by extents and bands #292

load_result: Load by URL and filter by extents and bands #292

Conversation

m-mohr commented Oct 26, 2021 • edited Loading

sophieherrmann commented Oct 27, 2021

m-mohr commented Oct 27, 2021 • edited Loading

sophieherrmann commented Oct 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr Nov 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr commented Oct 29, 2021 • edited Loading

m-mohr commented Nov 29, 2021

soxofaan commented Nov 30, 2021

SerRichard commented Nov 30, 2021

m-mohr commented Nov 30, 2021

jdries commented Nov 30, 2021

m-mohr commented Nov 30, 2021 • edited Loading

m-mohr commented Oct 26, 2021 •

edited

Loading

m-mohr commented Oct 27, 2021 •

edited

Loading

m-mohr Oct 28, 2021 •

edited

Loading

m-mohr Nov 4, 2021 •

edited

Loading

m-mohr commented Oct 29, 2021 •

edited

Loading

m-mohr commented Nov 30, 2021 •

edited

Loading