Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_result: Load by URL and filter by extents and bands #292

Merged
merged 8 commits into from
Dec 1, 2021
Merged

Conversation

m-mohr
Copy link
Member

@m-mohr m-mohr commented Oct 26, 2021

implements #220, see #220 (comment) for some details.

@m-mohr m-mohr added the minor label Oct 26, 2021
@m-mohr m-mohr added this to the 1.3.0 milestone Oct 26, 2021
@m-mohr m-mohr self-assigned this Oct 26, 2021
@m-mohr m-mohr linked an issue Oct 26, 2021 that may be closed by this pull request
@m-mohr m-mohr modified the milestones: 1.3.0, 1.2.0 Oct 26, 2021
@sophieherrmann
Copy link

Adding filtering by temporal / spatial extent and bands sounds good. For clarification how would the load_result via signed url work? Do we just download the job result or could we somehow access the job result metadata as well. I think metadata could be helpful for correctly loading the data.

@m-mohr
Copy link
Member Author

m-mohr commented Oct 27, 2021

load_result requests the signed URL of the metadata so you can use that for filtering as described in the process:

The URL to the STAC metadata for a batch job result. This needs to the signed URL that is provided by some back-ends through the canonical link relation in the batch job result metadata. Please note that this is only supported since openEO API version 1.1.0.
If supported by the underlying metadata and file format, the data that is added to the data cube can be restricted with the parameters spatial_extent, temporal_extent and bands.

@sophieherrmann
Copy link

Then that's perfect

@@ -1,7 +1,7 @@
{
"id": "load_result",
"summary": "Load batch job results",
"description": "Loads batch job results by job id from the server-side user workspace. The job must have been stored by the authenticated user on the back-end currently connected to.",
"description": "Loads batch job results and returns them as a processable data cube. A batch job result can be loaded by ID or URL:\n\n* **ID**: A batch job identifier available on the server-side user workspace. The job must have been stored by the authenticated user on the back-end currently connected to.\n* **URL**: The URL to the STAC metadata for a batch job result. This needs to the signed URL that is provided by some back-ends through the `canonical` link relation in the batch job result metadata. Please note that this is only supported since openEO API version 1.1.0.\n\nIf supported by the underlying metadata and file format, the data that is added to the data cube can be restricted with the parameters `spatial_extent`, `temporal_extent` and `bands`.\n\n**Remarks:**\n\n* The bands (and all dimensions that specify nominal dimension labels) are expected to be ordered as specified in the metadata if the `bands` parameter is set to `null`.\n* If no additional parameter is specified this would imply that the whole data set is expected to be loaded. Due to the large size of many data sets, this is not recommended and may be optimized by back-ends to only load the data that is actually required after evaluating subsequent processes such as filters. This means that the pixel values should be processed only after the data has been limited to the required extent and as a consequence also to a manageable size.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch job identifier available on the server-side user workspace. The job must have been stored by the ...

"id available in workspace" is a bit weird here I think.
The concept "workspace" is used in the API in the "files" sub-API, but not for batch jobs. I don't think we specify that batch job results should be accessible directly through this "files" workspace. Also, the identifier itself is just a string listed in job listing, I don't think it's necessary to describe it in terms of workspaces and storage.

Sidenote: I think it should also be mentioned that the job must have completed successfully before results can be loaded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to the signed URL

why this requirement?
Problem with the signed URL is that it can expire or be invalidated . If you don't want your process graph to break when signed URLs expire, you probably want to use a more persistent UR, e.g. a non-signed but authenticated URL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... note that this is only supported since ...

it's not clear what this "this" refers to: signed urls, the canonical link, the fact that load_results accepts urls, ...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no additional parameter is specified this would imply that the whole data set is expected to be loaded. Due to the large size of many data sets, this is not recommended and may be optimized by back-ends to only load the data that is actually required after evaluating subsequent processes such as filters. This means that the pixel values should be processed only after the data has been limited to the required extent and as a consequence also to a manageable size.

There are quite some implementation details here, and I'm not sure if it is necessary to do so. We're talking about results of a successful batch job, so the data size is probably already "manageable"

Copy link
Member Author

@m-mohr m-mohr Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added "thumbs up" to the comments I've already fixed, see the latest commit for details.

why this requirement? Problem with the signed URL is that it can expire or be invalidated.

When writing the original batch job result spec we had e.g. Google Docs in mind where signed URLs don't expire unless you invalidate them specifically.

If you don't want your process graph to break when signed URLs expire, you probably want to use a more persistent UR, e.g. a non-signed but authenticated URL

The signed URL was meant to be exactly this as there's no good alternative yet or what would a "non-signed but authenticated URL" be for you? We can't really start an OIDC flow inside a process, I think.

There are quite some implementation details here, and I'm not sure if it is necessary to do so. We're talking about results of a successful batch job, so the data size is probably already "manageable"

It's basically the same level of detail we have already in load_collection (mostly copy&paste) and that was requested to be there to give users an understanding of the implications of being "lazy" and requesting everything. So I think it good to have it there. A batch job could still be continental-scale...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine-tuning suggestion:

Not specifying any additional parameter implies to load the whole data set, which might be expensive performance-wise. Back-ends may optimize this with lazy loading techniques, but it is generally recommended to limit the loaded data as soon as possible to the relevant spatio-temporal extent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expensive performance-wise

It might be expensive in different regards, not sure I'd specifically mention performance here.

lazy loading techniques

Is this detail important? I thought we are going to be less technical. ;-)

to the relevant spatio-temporal extent.

This doesn't capture the bands (and in load_collection: properties).

A merge of both?

Not specifying any additional parameter implies loading the whole data set into a data cube, which might be expensive. Back-ends may optimize this, but it is generally recommended to limit the data cubes to the required size as soon as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bare "expensive" sounds money-related to me. In the end it will be reflected in the billing, but I guess users should be worried first about performance or computational cost.

"lazy loading" is indeed implementation detail. Fine to omit.

"required size": I wanted to avoid "size", because that is not exposed explicitly, let alone controllable by the user, they control spatio-temporal extents. The fact that the description doesn't cover bands or properties is more something "left as exercise to the reader". Possibile alternative: "data subset".

Copy link
Member Author

@m-mohr m-mohr Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expensive: I added performance/cost.

size: I've checked the xarray docs as an example and they seem to also call it "size", so I'd expect that a good amount of users understand it. I had "subset" in the description for a moment, but felt like size is easier to understand (probably very subjective).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just that "the required size" is confusing to me as a simple user: I have no idea about the actual size (is it in pixels, tiles, file bytes, memory bytes?) and I don't know what is "required" (is it required for my use case or is it a back-end limitation?). To play the devil's advocate: a naive user might also read the "size" recommendation as "I can load the whole world, if I down-sample it early enough". If you talk about extents, you don't have that interpretation.

proposal:

.. to limit the data cube to the relevant extent as soon as possible.

@m-mohr
Copy link
Member Author

m-mohr commented Oct 29, 2021

fyi: I've left out property filtering intentionally for now as it increases the complexity and I don't see us filtering on asset properties right now. We could surely add it later once Open-EO/openeo-api#398 has been implemented. With an API in the background property filtering is much more feasible. At that point you could even do a "load_from_stac_api" process. ;-)

@m-mohr m-mohr requested review from SerRichard and removed request for sophieherrmann November 3, 2021 10:29
@m-mohr m-mohr mentioned this pull request Nov 4, 2021
@m-mohr
Copy link
Member Author

m-mohr commented Nov 29, 2021

I plan to release v1.2.0 soon. Is this ready to go into this version or should we postpone and just put it into 1.3.0?

@soxofaan
Copy link
Member

I think the discussion is mainly about finetuning docs, so I would not prevent that for including the current state in 1.2.0

Current state looks fine to me.

One minor note about:

Back-ends may optimize this, but it is generally recommended to limit the data cubes to the required size as soon as possible.

I don't think it is clear for a novice user what "back-ends may optimize this" is about.
We could elaborate on that, but I think the sentence is good enough without that part:

It is generally recommended to limit the data cubes to the required size as soon as possible.

@SerRichard
Copy link

I don't think it is clear for a novice user what "back-ends may optimize this" is about. We could elaborate on that, but I think the sentence is good enough without that part:

Does this not also come across to a user that they may get different rates for the same processioning on different back-ends, depending which is optimized the best? In a distributed platform like this it's bound to be the case, but I wonder if it is beneficial to advertise it to the user (at least at the current stage).

Rest looks good to me though!

@m-mohr
Copy link
Member Author

m-mohr commented Nov 30, 2021

I'm not exactly sure how to solve this, because I think it's still important to say that users should carefully choose the amount of data to load. For now, I have reverted back to the original load_collection description as that's the one that had been agreed on by all openEO.org consortium partners before and load_collection/result are similar in their purpose except that it's a different data source so can share the same wording.

@SerRichard Yes, that could mean that costs differ between back-ends. We are here defining the overall processes for openEO so there's no fine-tuning here for openEO Platform. I don't think this is really a concern for Platform, but if it is, the Platform is free to remove this from the description.

@jdries
Copy link
Contributor

jdries commented Nov 30, 2021

So this would go into draft fro 1.2.0? Or really a full release? Do we have a confirmed implementation that supports the additional parameters?

Regarding docs: your problem is that you're writing for different audiences: backend developers vs end users.
For me, the primary docs for end users are the various guides and the dedicated client documentation. The process can and should also cater towards backend developers. If an end user chooses to use our spec as documentation, then he shouldn't be too confused when backend guidelines pop up in the description...

Regarding optimization differences: we should be very clear at this stage that users can and should try different backends to be sure that they optimize for cost, especially when they think the cost is too much or have some other bad experience.

@m-mohr
Copy link
Member Author

m-mohr commented Nov 30, 2021

So this would go into draft fro 1.2.0? Or really a full release? Do we have a confirmed implementation that supports the additional parameters?

Planned to go into draft now for a full release in the next 4 weeks, but of course, as an experimental proposal (where we already have several processes included without any implementation). I can also hold it back and only release it in 1.3.0 next year if that's important for anyone. Nevertheless, the process is simply a "merge" of a well-established process (load_collection) and the old load_result, so I don't think the description can be overly wrong. Of course, we may figure out that this won't work as of now, but then I think the main mitigation is to add STAC API support instead of removing the parameters again.

The primary reason for releasing soon is that once I start the greater set of changes for vector processes, I'd like to get the existing changes out first and have a clean basis so that we don't need to wait for the vector stuff to be completed.

Regarding docs: your problem is that you're writing for different audiences: backend developers vs end users. For me, the primary docs for end users are the various guides and the dedicated client documentation. The process can and should also cater towards backend developers. If an end user chooses to use our spec as documentation, then he shouldn't be too confused when backend guidelines pop up in the description...

Don't agree here. The process descriptions are primarily targeted towards end users and were always written as such. It's what they see first when looking for a specific process in most of our tooling (the only exception could be the Python client, not sure). Back-end implementation details can be added to the implementation guide we recently added to the repository. If there's anything I should add for this process to the implementation guide, please let me know what could be useful to have there.

@m-mohr m-mohr merged commit 5abad4b into draft Dec 1, 2021
@m-mohr m-mohr deleted the issue-220 branch December 1, 2021 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

load_result: Align with load_collection
5 participants