Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve from_parameter correctly #24

Closed
claxn opened this issue Aug 4, 2020 · 6 comments
Closed

Resolve from_parameter correctly #24

claxn opened this issue Aug 4, 2020 · 6 comments
Assignees

Comments

@claxn
Copy link
Collaborator

claxn commented Aug 4, 2020

The parser currently resolves all from_parameter tags with either the default value of the parent process or the input node ID of the parent process. This makes it easy for back-ends to determine the input and output parameters of a process node, without traversing the graph. However, this caused some discrepancies for certain processes mentioned in the issues #23 and Open-EO/openeo-processes#184.

To summarise, the parser traverses the whole graph and sets up the corresponding node dependencies. These dependencies are either set with 'from_node' (referencing to a node on the same level) or 'from_parameter' (referencing to data traced through a callback). An example would be the following process graph:

{
  "process_graph": {
    "1": {
      "process_id": "load_collection",
      "arguments": {
        "id": "test",
        "spatial_extent": {
          "east": 11.41,
          "south": 46.40,
          "north": 46.55,
          "west": 11.25
        },
        "temporal_extent": [
          "2015-11-06T00:00:00.000Z",
          "2016-09-25T00:00:00.000Z"
        ]
      }
    },
    "2": {
      "process_id": "filter_bands",
      "arguments": {
        "data": {
          "from_node": "1"
        },
        "bands": [
          "VH"
        ],
        "common_names": [],
        "wavelengths": []
      }
    },
    "3": {
      "process_id": "reduce_dimension",
      "arguments": {
        "data": {
          "from_node": "2"
        },
        "dimension": "t",
        "reducer": {
          "process_graph": {
            "max1": {
              "process_id": "max",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              },
              "result": true
            }
          }
        }
      },
      "result": true
    }
  }
}

Being translated into the following representation:

Node ID: 1_0 
Node Name: 1 
{'arguments': {'id': 'test',
               'spatial_extent': {'east': 11.41,
                                  'north': 46.55,
                                  'south': 46.4,
                                  'west': 11.25},
               'temporal_extent': ['2015-11-06T00:00:00.000Z',
                                   '2016-09-25T00:00:00.000Z']},
 'process_id': 'load_collection'}

Node ID: 2_1 
Node Name: 2 
{'arguments': {'bands': ['VH'],
               'common_names': [],
               'data': {'from_node': '1_0'},
               'wavelengths': []},
 'process_id': 'filter_bands'}

Node ID: 3_2 
Node Name: 3 
{'arguments': {'data': {'from_node': '2_1'},
               'dimension': 't',
               'reducer': {'from_node': 'max1_3'}},
 'process_id': 'reduce_dimension',
 'result': True}

Node ID: max1_3 
Node Name: max1 
{'arguments': {'data': {'from_node': '2_1'}},
 'process_id': 'max',
 'result': True}

As you can see, 'from_node' is replaced with the node ID of the parent process. For 'from_parameter' the same is done, namely 'data': {'from_parameter': 'data'} is replaced with {'data': {'from_node': '2_1'}}. This was implemented in this way, because the parent process (in this case reduce_dimension) anyway simply passes its input data to the callback function. This makes it easier to directly evaluate the callback process max.

However, for such processes as merge_cubes the relationship becomes more complicated and the back-end needs to decide what to do with the input data. What is your opinion how the parser should handle this in general? I think the feature above is nice to have, but it causes some inconsistencies for binary processes on the other side.

@m-mohr
Copy link
Member

m-mohr commented Aug 4, 2020

Resolving the node max1_3 and setting the data to reference node 2_1 is completely wrong behavior. You are expecting an array of values for a certain dimension (t), but you get the whole data cube. That's like I request 5€ from you and you just give me access to your whole bank account for me to grab 5€ from it. So you should only pass the corresponding values, but not everything. The implementation of reduce_dimension needs to get the corresponding values from the data cube and pass them to the sub process graph so that they can be made available as parameter data.

So node1_3 must still be:

Node ID: max1_3 
Node Name: max1 
{'arguments': {'data': {'from_parameter': 'data'}},
 'process_id': 'max',
 'result': True}

after parsing.

@m-mohr m-mohr removed their assignment Aug 4, 2020
@claxn
Copy link
Collaborator Author

claxn commented Aug 5, 2020

Yeah, I get your point from a formal side. This implementation was a bit biased how you would implement such processes in Python, but shouldn't be, of course. I will change this behaviour in a new branch,

@clausmichele
Copy link
Member

Actually I like how the parser translates the graph, even if formally it has a wrong behavior. With the current implementation in was fairly easy for me to deal with the graph in my opendatacube based back-end.But the merge_cubes parsing has to be addressed somehow.
Replying to @m-mohr: in my opinion it is not that bad passing the whole datacube/array by reference and not by copy. I think that it totally depends on the back-end implementation and how datacubes are handled, in my case I have many datacubes, one for each partial result.

@m-mohr
Copy link
Member

m-mohr commented Aug 5, 2020

Actually I like how the parser translates the graph, even if formally it has a wrong behavior. With the current implementation in was fairly easy for me to deal with the graph in my opendatacube based back-end.But the merge_cubes parsing has to be addressed somehow.

It should be solved correctly, otherwise you need to find workarounds for every bit that is different. And there are some, you'll run into issues eventually. Also, you can still do it similarly, but that should be driven by process implementations, not by the parser. Only in process implementation you can act according to the definitions.

Replying to @m-mohr: in my opinion it is not that bad passing the whole datacube/array by reference and not by copy. I think that it totally depends on the back-end implementation and how datacubes are handled, in my case I have many datacubes, one for each partial result.

I don't say you shouldn't pass by reference in general, I'm just saying that the parser shouldn't mandate any behavior based on (wrong) assumptions. That will eventually fail. As said above, it should be driven by process implementations. If it's better to pass a whole datacube in a reduce process, but not in merge_cubes, that should be made by the process itself. We do the same in GEE and it works quite well (except that merge_cubes has a poor implementation). But that's not a PG parser issue but a process implementation issue.

Also, I see the Python and JS parsers as reference implementation for openEO processes. If we want people to adopt it, then it should behave according to the spec, not on any assumptions.

@clausmichele
Copy link
Member

Was there any development on this? @claxn @lforesta

claxn pushed a commit that referenced this issue Oct 17, 2020
…ency structure of nodes by introducing "callback", "process" and "data" edges.; implemented new topological sorting algorithm
@claxn
Copy link
Collaborator Author

claxn commented Oct 17, 2020

@m-mohr @clausmichele @lforesta : According to our discussion a few months ago, I finally implemented the correct (at least I hope so) behaviour. For the example above this would yield:

Node ID: 1_0 
Node Name: 1 
{'arguments': {'id': 'test',
               'spatial_extent': {'east': 11.41,
                                  'north': 46.55,
                                  'south': 46.4,
                                  'west': 11.25},
               'temporal_extent': ['2015-11-06T00:00:00.000Z',
                                   '2016-09-25T00:00:00.000Z']},
 'process_id': 'load_collection'}

Node ID: 2_1 
Node Name: 2 
{'arguments': {'bands': ['VH'],
               'common_names': [],
               'data': {'from_node': '1_0'},
               'wavelengths': []},
 'process_id': 'filter_bands'}

Node ID: 3_2 
Node Name: 3 
{'arguments': {'data': {'from_node': '2_1'},
               'dimension': 't',
               'reducer': {'from_node': 'max1_3'}},
 'process_id': 'reduce_dimension',
 'result': True}

Node ID: max1_3 
Node Name: max1 
{'arguments': {'data': {'from_parameter': 'data'}},
 'process_id': 'max',
 'result': True}

Please test if everything works for you (see #29) and close this issue if that's the case, thanks!

@claxn claxn mentioned this issue Oct 27, 2020
@claxn claxn closed this as completed Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants