Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_result: Align with load_collection #220

Closed
m-mohr opened this issue Jan 8, 2021 · 8 comments · Fixed by #292
Closed

load_result: Align with load_collection #220

m-mohr opened this issue Jan 8, 2021 · 8 comments · Fixed by #292
Assignees
Labels
platform question Further information is requested vector
Milestone

Comments

@m-mohr
Copy link
Member

m-mohr commented Jan 8, 2021

load_result (and load_uploaded_files) can only load a full result, but not filter extents like in load_collection. Should load_result be aligned (i.e. add spatial, temporal extents and metadata filters?)

@m-mohr m-mohr added the question Further information is requested label Jan 8, 2021
@soxofaan
Copy link
Member

We have already processes like filter_temporal, filter_bbox, filter_bands to cover that, no?

load_result can only load a full result,

abstractly speaking yes, but a backend can work in a lazy-loading approach and limit the actual loading within the constraints provided by subsequent filter_ processes. Lazy loading is of course less straightforward to implement than handling direct load_result arguments.

The same question can be raise about load_uploaded_files, FYI

@m-mohr
Copy link
Member Author

m-mohr commented Jan 11, 2021

We have already processes like filter_temporal, filter_bbox, filter_bands to cover that, no?

Yes, but the argumentation for adding those parameters to load_collection was that you have to load all data first just to discard a lot of data afterwards. Thus we added them and I think this argumentation is still valid, although I'm aware that VITO optimizes filter_ operations more than other back-ends.

The same question can be raise about load_uploaded_files, FYI

Indeed, added above.

@m-mohr m-mohr changed the title load_result: Align with load_collection load_result/load_uploaded_files: Align with load_collection Jan 11, 2021
@m-mohr
Copy link
Member Author

m-mohr commented Apr 9, 2021

Related issue: Open-EO/openeo-api#376

@m-mohr m-mohr added this to the 1.1.0 milestone Apr 12, 2021
@m-mohr m-mohr added the vector label Apr 21, 2021
@m-mohr
Copy link
Member Author

m-mohr commented Apr 27, 2021

With the recent discussions around #241 and read/get_vector in the Python driver, I'm wondering whether we should aim for something generic that doesn't care about the actual place the data is stored and can load cubes from different sources. I'm not sure we actually need to have specific functions for all of them. It would avoid the issue in #241 a bit at least.

Something like load_collection, but instead of specifying a collection ID, you specify either a URL, file on the back-end user workspace or a result (is basically also just a URL where you check whether it's accessible by the current user).

import(location, ?spatial_extent, ?temporal_extent, ?bands, ?filter) -> raster/vector cube

@soxofaan
Copy link
Member

that sounds like an interesting solution

but then the "collection_id" (or "location" in your snippet) probably has to become a more complex argument: a string that's a traditional collection_id or some kind of URL. Or optionally also an object that allows setting additional load options (e.g. filename globs, file type whitelists or blacklists, ...)

The next problem is then probably that you need a new "capabilities endpoint" where a backend can declare which kind of "locations" are supported

@m-mohr
Copy link
Member Author

m-mohr commented Apr 28, 2021

I'm not sure that is clear, but I'd leave load_collection untouched and not allow (internal) collection IDs in import.

The location could be defined as follows in schema:

{
	"name": "location",
	"description": "...",
	"schema": [
		{
			"title": "Multiple files on server-side user workspace",
			"type": "array",
			"subtype": "file-paths",
			"items": {
				"type": "string",
				"subtype": "file-path",
				"pattern": "^[^\r\n\\:'\"]+$"
			}
		},
		{
			"title": "Single file on server-side user workspace (we may want to remove this for simplicity)",
			"type": "string",
			"subtype": "file-path",
			"pattern": "^[^\r\n\\:'\"]+$"
		},
		{
			"title": "Remote files (Absolute URL)",
			"type": "string",
			"subtype": "uri",
			"pattern": "^(http|https|s3)://"
		},
		{
			"title": "Batch Job ID",
			"type": "string",
			"subtype": "job-id",
			"description": "A batch job id, either one of the jobs a user has stored or a publicly available job.",
			"pattern": "^[\\w\\-\\.~]+$"
		}
	]
}

Yes, we have the issue that we overload some data types a bit in a number of processes.

If we want to have very specific things like globs, then this idea doesn't work very well and I'd say we need to stick with individual processes.

@m-mohr
Copy link
Member Author

m-mohr commented Jun 4, 2021

@aljacob @lforesta @sophieherrmann I heard this (load_result) being discussed today as being required by a use-case. I didn't have that on my list as being required for UC3 or 6 in openEO Platform. Could you please clarify?

@m-mohr m-mohr added the platform label Jun 4, 2021
@m-mohr m-mohr modified the milestones: 1.2.0, 1.3.0 Oct 25, 2021
@m-mohr
Copy link
Member Author

m-mohr commented Oct 26, 2021

I've got a PR up, although this is just load_result now. Reasoning:

  1. For load_uploaded_files I'm now thinking we don't need additional parameters as I'd expect that someone really only uploads what is needed for a use case and then doesn't need additional filtering in the process. If that rare use case is still required, use filter functions.
  2. As such, I've simply added spatial_extent, temporal_extent and bands to the load_result function and allow loading by URLs now. These additions are depending a lot on the underlying data structure, e.g. I'd think you could remotely filter by spatial_extent easily on a COG, but for bands and temporal_extent you need proper metadata. For other file formats this may not work at all. Also, loading by URL is not available on all back-ends (only introduced in API v1.1.0). I've excluded property filtering for now as this would require an API to be present: STAC API for batch job results? openeo-api#398 (or User-generated Collections openeo-api#376)

See PR #292.

@m-mohr m-mohr linked a pull request Oct 26, 2021 that will close this issue
@m-mohr m-mohr modified the milestones: 1.3.0, 1.2.0 Oct 26, 2021
@m-mohr m-mohr changed the title load_result/load_uploaded_files: Align with load_collection load_result: Align with load_collection Nov 16, 2021
m-mohr added a commit that referenced this issue Dec 1, 2021
* Improve load_result #220 and other minor alignments
@m-mohr m-mohr closed this as completed Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform question Further information is requested vector
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants