Resolve from_parameter correctly #24

claxn · 2020-08-04T15:12:33Z

The parser currently resolves all from_parameter tags with either the default value of the parent process or the input node ID of the parent process. This makes it easy for back-ends to determine the input and output parameters of a process node, without traversing the graph. However, this caused some discrepancies for certain processes mentioned in the issues #23 and Open-EO/openeo-processes#184.

To summarise, the parser traverses the whole graph and sets up the corresponding node dependencies. These dependencies are either set with 'from_node' (referencing to a node on the same level) or 'from_parameter' (referencing to data traced through a callback). An example would be the following process graph:

{
  "process_graph": {
    "1": {
      "process_id": "load_collection",
      "arguments": {
        "id": "test",
        "spatial_extent": {
          "east": 11.41,
          "south": 46.40,
          "north": 46.55,
          "west": 11.25
        },
        "temporal_extent": [
          "2015-11-06T00:00:00.000Z",
          "2016-09-25T00:00:00.000Z"
        ]
      }
    },
    "2": {
      "process_id": "filter_bands",
      "arguments": {
        "data": {
          "from_node": "1"
        },
        "bands": [
          "VH"
        ],
        "common_names": [],
        "wavelengths": []
      }
    },
    "3": {
      "process_id": "reduce_dimension",
      "arguments": {
        "data": {
          "from_node": "2"
        },
        "dimension": "t",
        "reducer": {
          "process_graph": {
            "max1": {
              "process_id": "max",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              },
              "result": true
            }
          }
        }
      },
      "result": true
    }
  }
}

Being translated into the following representation:

Node ID: 1_0 
Node Name: 1 
{'arguments': {'id': 'test',
               'spatial_extent': {'east': 11.41,
                                  'north': 46.55,
                                  'south': 46.4,
                                  'west': 11.25},
               'temporal_extent': ['2015-11-06T00:00:00.000Z',
                                   '2016-09-25T00:00:00.000Z']},
 'process_id': 'load_collection'}

Node ID: 2_1 
Node Name: 2 
{'arguments': {'bands': ['VH'],
               'common_names': [],
               'data': {'from_node': '1_0'},
               'wavelengths': []},
 'process_id': 'filter_bands'}

Node ID: 3_2 
Node Name: 3 
{'arguments': {'data': {'from_node': '2_1'},
               'dimension': 't',
               'reducer': {'from_node': 'max1_3'}},
 'process_id': 'reduce_dimension',
 'result': True}

Node ID: max1_3 
Node Name: max1 
{'arguments': {'data': {'from_node': '2_1'}},
 'process_id': 'max',
 'result': True}

As you can see, 'from_node' is replaced with the node ID of the parent process. For 'from_parameter' the same is done, namely 'data': {'from_parameter': 'data'} is replaced with {'data': {'from_node': '2_1'}}. This was implemented in this way, because the parent process (in this case reduce_dimension) anyway simply passes its input data to the callback function. This makes it easier to directly evaluate the callback process max.

However, for such processes as merge_cubes the relationship becomes more complicated and the back-end needs to decide what to do with the input data. What is your opinion how the parser should handle this in general? I think the feature above is nice to have, but it causes some inconsistencies for binary processes on the other side.

The text was updated successfully, but these errors were encountered:

m-mohr · 2020-08-04T15:18:50Z

Resolving the node max1_3 and setting the data to reference node 2_1 is completely wrong behavior. You are expecting an array of values for a certain dimension (t), but you get the whole data cube. That's like I request 5€ from you and you just give me access to your whole bank account for me to grab 5€ from it. So you should only pass the corresponding values, but not everything. The implementation of reduce_dimension needs to get the corresponding values from the data cube and pass them to the sub process graph so that they can be made available as parameter data.

So node1_3 must still be:

Node ID: max1_3 
Node Name: max1 
{'arguments': {'data': {'from_parameter': 'data'}},
 'process_id': 'max',
 'result': True}

after parsing.

claxn · 2020-08-05T06:40:30Z

Yeah, I get your point from a formal side. This implementation was a bit biased how you would implement such processes in Python, but shouldn't be, of course. I will change this behaviour in a new branch,

clausmichele · 2020-08-05T07:17:48Z

Actually I like how the parser translates the graph, even if formally it has a wrong behavior. With the current implementation in was fairly easy for me to deal with the graph in my opendatacube based back-end.But the merge_cubes parsing has to be addressed somehow.
Replying to @m-mohr: in my opinion it is not that bad passing the whole datacube/array by reference and not by copy. I think that it totally depends on the back-end implementation and how datacubes are handled, in my case I have many datacubes, one for each partial result.

m-mohr · 2020-08-05T08:28:34Z

Actually I like how the parser translates the graph, even if formally it has a wrong behavior. With the current implementation in was fairly easy for me to deal with the graph in my opendatacube based back-end.But the merge_cubes parsing has to be addressed somehow.

It should be solved correctly, otherwise you need to find workarounds for every bit that is different. And there are some, you'll run into issues eventually. Also, you can still do it similarly, but that should be driven by process implementations, not by the parser. Only in process implementation you can act according to the definitions.

Replying to @m-mohr: in my opinion it is not that bad passing the whole datacube/array by reference and not by copy. I think that it totally depends on the back-end implementation and how datacubes are handled, in my case I have many datacubes, one for each partial result.

I don't say you shouldn't pass by reference in general, I'm just saying that the parser shouldn't mandate any behavior based on (wrong) assumptions. That will eventually fail. As said above, it should be driven by process implementations. If it's better to pass a whole datacube in a reduce process, but not in merge_cubes, that should be made by the process itself. We do the same in GEE and it works quite well (except that merge_cubes has a poor implementation). But that's not a PG parser issue but a process implementation issue.

Also, I see the Python and JS parsers as reference implementation for openEO processes. If we want people to adopt it, then it should behave according to the spec, not on any assumptions.

clausmichele · 2020-09-22T13:55:13Z

Was there any development on this? @claxn @lforesta

…ency structure of nodes by introducing "callback", "process" and "data" edges.; implemented new topological sorting algorithm

claxn · 2020-10-17T11:19:59Z

@m-mohr @clausmichele @lforesta : According to our discussion a few months ago, I finally implemented the correct (at least I hope so) behaviour. For the example above this would yield:

Node ID: 1_0 
Node Name: 1 
{'arguments': {'id': 'test',
               'spatial_extent': {'east': 11.41,
                                  'north': 46.55,
                                  'south': 46.4,
                                  'west': 11.25},
               'temporal_extent': ['2015-11-06T00:00:00.000Z',
                                   '2016-09-25T00:00:00.000Z']},
 'process_id': 'load_collection'}

Node ID: 2_1 
Node Name: 2 
{'arguments': {'bands': ['VH'],
               'common_names': [],
               'data': {'from_node': '1_0'},
               'wavelengths': []},
 'process_id': 'filter_bands'}

Node ID: 3_2 
Node Name: 3 
{'arguments': {'data': {'from_node': '2_1'},
               'dimension': 't',
               'reducer': {'from_node': 'max1_3'}},
 'process_id': 'reduce_dimension',
 'result': True}

Node ID: max1_3 
Node Name: max1 
{'arguments': {'data': {'from_parameter': 'data'}},
 'process_id': 'max',
 'result': True}

Please test if everything works for you (see #29) and close this issue if that's the case, thanks!

claxn assigned lforesta, m-mohr and clausmichele Aug 4, 2020

m-mohr removed their assignment Aug 4, 2020

claxn pushed a commit that referenced this issue Oct 17, 2020

removed 'from_parameter' node assignments (#24); fixed correct depend…

c615221

…ency structure of nodes by introducing "callback", "process" and "data" edges.; implemented new topological sorting algorithm

claxn mentioned this issue Oct 27, 2020

Bug fixes #29

Merged

claxn closed this as completed Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve from_parameter correctly #24

Resolve from_parameter correctly #24

claxn commented Aug 4, 2020

m-mohr commented Aug 4, 2020 •

edited

Loading

claxn commented Aug 5, 2020

clausmichele commented Aug 5, 2020

m-mohr commented Aug 5, 2020 •

edited

Loading

clausmichele commented Sep 22, 2020

claxn commented Oct 17, 2020

Resolve from_parameter correctly #24

Resolve from_parameter correctly #24

Comments

claxn commented Aug 4, 2020

m-mohr commented Aug 4, 2020 • edited Loading

claxn commented Aug 5, 2020

clausmichele commented Aug 5, 2020

m-mohr commented Aug 5, 2020 • edited Loading

clausmichele commented Sep 22, 2020

claxn commented Oct 17, 2020

m-mohr commented Aug 4, 2020 •

edited

Loading

m-mohr commented Aug 5, 2020 •

edited

Loading