Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] implement scalar metrics #551

Closed
19 tasks
adebardo opened this issue Jul 31, 2024 · 7 comments
Closed
19 tasks

[POC] implement scalar metrics #551

adebardo opened this issue Jul 31, 2024 · 7 comments
Labels
[POC] Conception stuck The issue conception is stopped

Comments

@adebardo
Copy link

adebardo commented Jul 31, 2024

Context

This ticket aims to propose an initial implementation of scalar metrics in xDEM.

We assume that these metrics can be an attribute of DEM just like the slope, and that users can retrieve them as follows: dem_mean = xdem.scalar_metric.mean(dem). If this API is validated, here is our proposal:

Code

  • Create the file scalar_metrics.py (inspired by terrain.py)

  • List of metrics to implement:

    • List item:
      • "mean"
      • "max"
      • "min"
      • "std"
      • "rmse"
      • "median"
      • "nmad"
      • "sum"
      • "sumSquaredErr"
      • "Percentil90"
  • Implement the get_scalar_metric function with all the options from the above list:

@overload  
def get_scalar_metric(  
    dem: NDArrayf | MArrayf,  
    attribute: str,  
) -> np.float32:  
    ...  
  
  
@overload  
def get_scalar_metric(  
    dem: RasterType,  
    attribute: list[str],  
) -> np.float32:  
    ...  
  
  
def get_scalar_metric(  
    dem: NDArrayf | MArrayf | RasterType,  
    attribute: str | list[str],  
) -> NDArrayf | list[NDArrayf] | RasterType | list[RasterType]:  
""" 
:param dem: The DEM to analyze. 
:param attribute: The terrain attribute(s) to calculate. 

:raises ValueError: If the inputs are poorly formatted or are invalid. 

:examples: >>> dem = np.repeat(np.arange(3), 3).reshape(3, 3) 
			>>> dem array([[0, 0, 0], [1, 1, 1], [2, 2, 2]]) 
			>>> min, max = get_scalar_metric(dem, ["min", "max"]) 
			>>> min 
			0 
			>>> max 
			2 

:returns: One or multiple arrays of the requested attribute(s) """
 # Validate and format the inputs  
 if isinstance(attribute, str):  
        attribute = [attribute]  
  
    # Initialize the scalar_metrics dictionary, which will be filled with the requested values.  
  terrain_attributes: dict[float] = {}  
  
    # Get array of DEM  
  if isinstance(data, np.ma.masked_array):  
    data_arr = get_array_and_mask(data, check_shape=False)[0]  
  elif isinstance(data, Raster):  
    data_arr = data  
  else:  
    data_arr = np.asarray(data)
    
    if "mean" in attribute:  
        scalar_metrics["mean"] = np.nanmean(data_arr)  
    if "max" in attribute:  
        scalar_metrics["mean"] = np.nanmax(data_arr)  
    if ...:  
  
          
  
    return output_attributes if len(output_attributes) > 1 else output_attributes[0]
  • Each metric should be implemented as follows:
@overload  
def mean(  
    dem: NDArrayf | MArrayf,  
) -> NDArrayf:  
    ...  
  
  
@overload  
def mean(  
    dem: RasterType,  
) -> RasterType:  
    ...  
  
  
def mean(  
    dem: NDArrayf | MArrayf,  
) -> NDArrayf | RasterType:  
    """  
 Calculate dem mean value :param dem: The input DEM to calculate the mean from.  
 :raises AssertionError: If the given DEM is not a 2D array. :raises ValueError: If invalid argument types or ranges were given.  
 :returns: A mean value with the dtype "float32". """  

return get_scalar_metric(  
        dem,  
        attribute="mean",  
    )
  • Handling the NMAD metric:
    • It is already implemented; retrieve it from the spatialstats file.
    • It is called in various places; reinitialize it as follows: nmad = xdem.scalar_metric.nmad(dem)
      • affine.py
      • base.py
      • workflows.py
      • spatialstats.py
    • The nfact is hardcoded in demcompare and parameterized in xdem; if we decide to use it as a parameter, modify the APIs accordingly, e.g.:
@overload  
def get_scalar_metric(  
    dem: RasterType,  
    attribute: list[str], 
    nfact: float = 1.4826 
) -> np.float32:  
    ...
  • from DEM class : should implement each metrics as follow
    @copy_doc(scalar_metrics, remove_dem_res_params=True)
    def mean(
        self
    ) -> float:

        return scalar_metrics.mean(
            self
        )

Tests

  • Create the file test_scalar_metrics.py
  • Test all possibilities: It is possible to retrieve the TUs from demcompare: test_scalar_metric.py
  • To test with all array types, move the test_nmad function from test_spatial_stats to test_scalar_metrics.py

Example

  • Add an example script in the basic folder that displays the scalar metrics of a DEM using the new APIs

Documentation

  • Add a scalar_metrics.md file in the xdem documentation inspired by the terrain.md documentation

Estimation

5 days

@adebardo adebardo added the [POC] Conception To review Tickets needs approval about it conception label Jul 31, 2024
@adebardo adebardo changed the title [POC] implement scalard metrics [POC] implement scalar metrics Jul 31, 2024
@adebardo adebardo added [POC] Conception To review Tickets needs approval about it conception and removed [POC] Conception To review Tickets needs approval about it conception labels Aug 1, 2024
@duboise-cnes
Copy link
Member

Ok for the need of scalar metrics, the issue is clear.

Will it be possible to choose a metric from a calling pipeline with a configuration file as demcompare ?
My question is to be sure that there will be enough architecture flexibility of xdem code to have a kind of a generic metric structure, function, call, object ... to be manipulated easily from a cli asking one or another metric

For example here to call the scalar metrics from the pipeline: https://demcompare.readthedocs.io/en/stable/userguide/statistics/metrics.html and the following example of input json configuration to explain my question ?

{
    "output_dir": "./test_output/",
    "input_ref": {
        "path": "./Gironde.tif",
        "nodata": -9999.0
    },
    "input_sec": {
        "path": "./FinalWaveBathymetry_T30TXR_20200622T105631_D_MSL_invert.TIF",
        "nodata": -32768
    }
    "statistics": {
        "alti-diff": {
            "remove_outliers": true,
            "metrics": ["mean", {"ratio_above_threshold": {"elevation_threshold": [1, 2, 3]}}]
        },
        "ref": {
            "remove_outliers": true,
            "metrics": [
                {
                    "slope-orientation-histogram": {
                        "output_plot_path": "path_to_plot"
                    }
                }
            ]
        }
    }
}

Will this be possible with the development here and we may need an evolution ou wrapper on existing API to generalize metrics ?
I try to foresee possible problems to put a cli

@adebardo
Copy link
Author

adebardo commented Aug 2, 2024

Based on my limited experience with xdem, I believe it's possible. I will create an issue for the metrics in the CLI before developing this one, just to be sure.

@adebardo
Copy link
Author

adebardo commented Aug 2, 2024

From @adehecq :

"In fact, the other functions are not specifically implemented because they are available with NumPy (e.g., min, max, std, var, etc.). The interface with NumPy (https://geoutils.readthedocs.io/en/stable/core_array_funcs.html) makes these calculations direct and intuitive.

In this sense, xdem seems more advanced because we have access to any existing metric in NumPy or SciPy."

@adebardo adebardo added [POC] Conception stuck The issue conception is stopped and removed [POC] Conception To review Tickets needs approval about it conception labels Aug 2, 2024
@rhugonnet
Copy link
Contributor

rhugonnet commented Aug 2, 2024

One more note in addition to @adehecq's that might be relevant here: Xarray has exactly the same NumPy array interface as we have for the Raster object (can call any np.func() or derivative like some SciPy's functions, on the object, and it knows where to go look for the array), so the array interfacing approach will work exactly the same with the planned dem Xarray accessor.

Usually the critical point when deriving those metrics through an array interface is the behaviour with nodata values:

  • For a Xarray object, one needs to force a float32+/NaN array to ensure the nodata are masked during the array interfacing operation,
  • For a Raster object, the NumPy masked arrays allows to support any data type, but the casting of classic NumPy functions to masked arrays is suspicious for a couple functions. For instance np.percentile cannot be cast to np.ma.percentile that does not exist in the masked array module, and so returns a NaN instead of masking nodata values, while other functions are naturally cast. There's only a few exceptions that don't work so well like this. To remedy this, I've added an override for those functions here: https://github.com/GlacioHack/geoutils/blob/main/geoutils/raster/raster.py#L2329. NumPy is working to make masked array fully consistent (including being able to call np.ma.func on any object that has a masked array, which is impossible right now), see for details: Update Raster array interface to work with masked arrays once available in NumPy geoutils#358. Optionally, we could open an issue in NumPy to re-ignite the topic and make those aspects fully consistent.

@adehecq
Copy link
Member

adehecq commented Aug 9, 2024

Overall, I agree with adding functionalities to address certain scalar metrics. Just a few remarks/questions:

  • I believe this should sit in geoutils rather than xdem, as it can be generalized to any raster
  • I'm wondering if having a function geoutils.scalar_metric.mean() for each metric makes sense. This would add confusion to using the easier alternative of np.mean(my_raster).
  • your example with the mean shows that it would add at least 20 lines of code (because of the overloads) to the code, for each metric, whereas the same can be obtained in 1 line for many of these statistics.
  • because of the two previous points, I'm wondering if we should not just add a method to the raster class for these statistics, like in pandas, e.g. my_raster.min(), my_raster.nmad() etc. Or if we don't want to add too many methods, they could be gathered in a my_raster.scalar_metric.min() etc. We get rid of the overloads (because it will only work on Raster type) and most methods would be a 1 line code.
  • regarding NMAD: we definitely discussed before the fact that nmad should be moved to a more appropriate location, so that's perfect. The only reason for having the nfact argument is probably to calculate the MAD instead of NMAD, but personally I have never used that argument and don't know of anyone who has, so we could also remove it (and add a MAD metric if relevant).

@duboise-cnes, I can't really answer your question as I didn't understand the meaning of this line

        "metrics": ["mean", {"ratio_above_threshold": {"elevation_threshold": [1, 2, 3]}}]  

@rhugonnet
Copy link
Contributor

One more note that @adehecq's comment reminded me of on the MAD:
Adding the MAD to NumPy was already proposed and rejected 5+ years ago (with a not such a convincing excuse, in my opinion, pointing out the "niche" aspect of robust estimators), they redirected towards a statsmodel implementation at the time. Since then (3 years ago), there is a consistent SciPy implementation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.median_abs_deviation.html which is already a dependency for many other functionalities.

See details in our issue here: #349.

@adebardo
Copy link
Author

An another approach has been discuss, i close for #567

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[POC] Conception stuck The issue conception is stopped
Projects
None yet
Development

No branches or pull requests

4 participants