-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_result: Load by URL and filter by extents and bands #292
Conversation
Adding filtering by temporal / spatial extent and bands sounds good. For clarification how would the load_result via signed url work? Do we just download the job result or could we somehow access the job result metadata as well. I think metadata could be helpful for correctly loading the data. |
load_result requests the signed URL of the metadata so you can use that for filtering as described in the process:
|
Then that's perfect |
proposals/load_result.json
Outdated
@@ -1,7 +1,7 @@ | |||
{ | |||
"id": "load_result", | |||
"summary": "Load batch job results", | |||
"description": "Loads batch job results by job id from the server-side user workspace. The job must have been stored by the authenticated user on the back-end currently connected to.", | |||
"description": "Loads batch job results and returns them as a processable data cube. A batch job result can be loaded by ID or URL:\n\n* **ID**: A batch job identifier available on the server-side user workspace. The job must have been stored by the authenticated user on the back-end currently connected to.\n* **URL**: The URL to the STAC metadata for a batch job result. This needs to the signed URL that is provided by some back-ends through the `canonical` link relation in the batch job result metadata. Please note that this is only supported since openEO API version 1.1.0.\n\nIf supported by the underlying metadata and file format, the data that is added to the data cube can be restricted with the parameters `spatial_extent`, `temporal_extent` and `bands`.\n\n**Remarks:**\n\n* The bands (and all dimensions that specify nominal dimension labels) are expected to be ordered as specified in the metadata if the `bands` parameter is set to `null`.\n* If no additional parameter is specified this would imply that the whole data set is expected to be loaded. Due to the large size of many data sets, this is not recommended and may be optimized by back-ends to only load the data that is actually required after evaluating subsequent processes such as filters. This means that the pixel values should be processed only after the data has been limited to the required extent and as a consequence also to a manageable size.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch job identifier available on the server-side user workspace. The job must have been stored by the ...
"id available in workspace" is a bit weird here I think.
The concept "workspace" is used in the API in the "files" sub-API, but not for batch jobs. I don't think we specify that batch job results should be accessible directly through this "files" workspace. Also, the identifier itself is just a string listed in job listing, I don't think it's necessary to describe it in terms of workspaces and storage.
Sidenote: I think it should also be mentioned that the job must have completed successfully before results can be loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to the signed URL
why this requirement?
Problem with the signed URL is that it can expire or be invalidated . If you don't want your process graph to break when signed URLs expire, you probably want to use a more persistent UR, e.g. a non-signed but authenticated URL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... note that this is only supported since ...
it's not clear what this "this" refers to: signed urls, the canonical link, the fact that load_results accepts urls, ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no additional parameter is specified this would imply that the whole data set is expected to be loaded. Due to the large size of many data sets, this is not recommended and may be optimized by back-ends to only load the data that is actually required after evaluating subsequent processes such as filters. This means that the pixel values should be processed only after the data has been limited to the required extent and as a consequence also to a manageable size.
There are quite some implementation details here, and I'm not sure if it is necessary to do so. We're talking about results of a successful batch job, so the data size is probably already "manageable"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added "thumbs up" to the comments I've already fixed, see the latest commit for details.
why this requirement? Problem with the signed URL is that it can expire or be invalidated.
When writing the original batch job result spec we had e.g. Google Docs in mind where signed URLs don't expire unless you invalidate them specifically.
If you don't want your process graph to break when signed URLs expire, you probably want to use a more persistent UR, e.g. a non-signed but authenticated URL
The signed URL was meant to be exactly this as there's no good alternative yet or what would a "non-signed but authenticated URL" be for you? We can't really start an OIDC flow inside a process, I think.
There are quite some implementation details here, and I'm not sure if it is necessary to do so. We're talking about results of a successful batch job, so the data size is probably already "manageable"
It's basically the same level of detail we have already in load_collection (mostly copy&paste) and that was requested to be there to give users an understanding of the implications of being "lazy" and requesting everything. So I think it good to have it there. A batch job could still be continental-scale...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fine-tuning suggestion:
Not specifying any additional parameter implies to load the whole data set, which might be expensive performance-wise. Back-ends may optimize this with lazy loading techniques, but it is generally recommended to limit the loaded data as soon as possible to the relevant spatio-temporal extent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expensive performance-wise
It might be expensive in different regards, not sure I'd specifically mention performance here.
lazy loading techniques
Is this detail important? I thought we are going to be less technical. ;-)
to the relevant spatio-temporal extent.
This doesn't capture the bands (and in load_collection: properties).
A merge of both?
Not specifying any additional parameter implies loading the whole data set into a data cube, which might be expensive. Back-ends may optimize this, but it is generally recommended to limit the data cubes to the required size as soon as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bare "expensive" sounds money-related to me. In the end it will be reflected in the billing, but I guess users should be worried first about performance or computational cost.
"lazy loading" is indeed implementation detail. Fine to omit.
"required size": I wanted to avoid "size", because that is not exposed explicitly, let alone controllable by the user, they control spatio-temporal extents. The fact that the description doesn't cover bands or properties is more something "left as exercise to the reader". Possibile alternative: "data subset".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expensive: I added performance/cost.
size: I've checked the xarray docs as an example and they seem to also call it "size", so I'd expect that a good amount of users understand it. I had "subset" in the description for a moment, but felt like size is easier to understand (probably very subjective).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just that "the required size" is confusing to me as a simple user: I have no idea about the actual size (is it in pixels, tiles, file bytes, memory bytes?) and I don't know what is "required" (is it required for my use case or is it a back-end limitation?). To play the devil's advocate: a naive user might also read the "size" recommendation as "I can load the whole world, if I down-sample it early enough". If you talk about extents, you don't have that interpretation.
proposal:
.. to limit the data cube to the relevant extent as soon as possible.
fyi: I've left out property filtering intentionally for now as it increases the complexity and I don't see us filtering on asset properties right now. We could surely add it later once Open-EO/openeo-api#398 has been implemented. With an API in the background property filtering is much more feasible. At that point you could even do a "load_from_stac_api" process. ;-) |
I plan to release v1.2.0 soon. Is this ready to go into this version or should we postpone and just put it into 1.3.0? |
I think the discussion is mainly about finetuning docs, so I would not prevent that for including the current state in 1.2.0 Current state looks fine to me. One minor note about:
I don't think it is clear for a novice user what "back-ends may optimize this" is about.
|
Does this not also come across to a user that they may get different rates for the same processioning on different back-ends, depending which is optimized the best? In a distributed platform like this it's bound to be the case, but I wonder if it is beneficial to advertise it to the user (at least at the current stage). Rest looks good to me though! |
I'm not exactly sure how to solve this, because I think it's still important to say that users should carefully choose the amount of data to load. For now, I have reverted back to the original load_collection description as that's the one that had been agreed on by all openEO.org consortium partners before and load_collection/result are similar in their purpose except that it's a different data source so can share the same wording. @SerRichard Yes, that could mean that costs differ between back-ends. We are here defining the overall processes for openEO so there's no fine-tuning here for openEO Platform. I don't think this is really a concern for Platform, but if it is, the Platform is free to remove this from the description. |
So this would go into draft fro 1.2.0? Or really a full release? Do we have a confirmed implementation that supports the additional parameters? Regarding docs: your problem is that you're writing for different audiences: backend developers vs end users. Regarding optimization differences: we should be very clear at this stage that users can and should try different backends to be sure that they optimize for cost, especially when they think the cost is too much or have some other bad experience. |
Planned to go into draft now for a full release in the next 4 weeks, but of course, as an experimental proposal (where we already have several processes included without any implementation). I can also hold it back and only release it in 1.3.0 next year if that's important for anyone. Nevertheless, the process is simply a "merge" of a well-established process (load_collection) and the old load_result, so I don't think the description can be overly wrong. Of course, we may figure out that this won't work as of now, but then I think the main mitigation is to add STAC API support instead of removing the parameters again. The primary reason for releasing soon is that once I start the greater set of changes for vector processes, I'd like to get the existing changes out first and have a clean basis so that we don't need to wait for the vector stuff to be completed.
Don't agree here. The process descriptions are primarily targeted towards end users and were always written as such. It's what they see first when looking for a specific process in most of our tooling (the only exception could be the Python client, not sure). Back-end implementation details can be added to the implementation guide we recently added to the repository. If there's anything I should add for this process to the implementation guide, please let me know what could be useful to have there. |
implements #220, see #220 (comment) for some details.