forked from OpenGeoHub/spatial-sampling-ml
-
Notifications
You must be signed in to change notification settings - Fork 0
/
extending.Rmd
95 lines (74 loc) · 4.6 KB
/
extending.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# Generating 2nd, 3rd round sampling
```{r, results = "asis", echo = FALSE}
status("drafting")
```
```{r, include=FALSE, message=FALSE, results='hide'}
ls <- c("rgdal", "raster", "plotKML", "spatstat", "ranger", "mlr", "forestError",
"clhs", "xgboost", "glmnet", "matrixStats", "landmap", "LICORS", "h2o", "yardstick",
"hexbin", "parallelMap", "Metrics", "fastSave", "fields", "devtools", "Cubist")
new.packages <- ls[!(ls %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(ls, require, character.only = TRUE)
#load.pigz("sampling_data.RData")
source("sampling_functions.R")
library(ggplot2)
library(dplyr)
library(tidyr)
```
## Uncertainty guided sampling
A sensible approach to improving quality of predictions is to: (a) estimate initial
ML models, (b) produce a realistic estimate of prediction errors, and (c) revisit
area and collect 2nd round samples that help at smallest possible costs significantly
improve predictions. Logical focus of the 2nd round sampling could be thus on minimizing
the overall prediction errors i.e. revising parts of the study area that shows the
highest prediction errors / prediction problems (Fig. \@ref(fig:eberg-var-eml)).
This is exactly procedure that @stumpf2017uncertainty recommend in their paper and
that they refer to as the _“Uncertainty guided sampling”_.
The 2nd round Uncertainty guided sampling can be implemented either by:
- Producing a strata based on the prediction error map (e.g. extrapolation areas / areas with highest uncertainty), then sampling only within that strata,
- Using the probability of exceeding threshold mapping accuracy to generate extra sampling points,
In both cases 2nd round points can be then added to the original training dataset and the
model for the area can then be refitted (this procedure in data science is also referred to as _“re-analysis”_).
Assuming that our initial model was unbiased, then even few tens of extra points
could lead to significant reduction in RMSPE. In practice, we might have to also
organize the 3rd round sampling until we finally reach some maximum possible accuracy [@hengl2018random].
To generate 2nd round sampling for the Edgeroi dataset we can first estimate probability
of prediction errors exceeding some threshold probability e.g. RMSE=0.2. We first
load point data and prediction error map produced in the previous example using
Ensemble Machine Learning:
```{r}
data(edgeroi)
edgeroi.sp <- edgeroi$sites
coordinates(edgeroi.sp) <- ~ LONGDA94 + LATGDA94
proj4string(edgeroi.sp) <- CRS("+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs")
edgeroi.sp <- spTransform(edgeroi.sp, CRS("+init=epsg:28355"))
```
The probability of exceeding some threshold, assuming normal distribution of prediction
erros can be derived as:
```{r}
t.RMSE = 0.2
edgeroi.error = readGDAL("./output/edgeroi_soc_rmspe.tif")
edgeroi.error$inv.prob = 1/(1-2*(pnorm(edgeroi.error$band1/t.RMSE)-0.5))
```
next, we can generate a sampling plan using the **spatstat** package and the inverse
probability of the pixels exceeding threshold uncertainty (`f` argument):
```{r, cache=TRUE}
dens.var <- spatstat.geom::as.im(sp::as.image.SpatialGridDataFrame(edgeroi.error["inv.prob"]))
pnts.new <- spatstat.core::rpoint(50, f=dens.var)
```
So the new sampling plan, assuming adding only 50 new points would thus look like this:
```{r eberg-pnts-eml, echo = TRUE, out.width = "100%", fig.cap = "Uncertainty guided 2nd round sampling: locations of initial (+) and 2nd round points (dots)."}
plot.map(df=log1p(raster(edgeroi.error[c("inv.prob")])), points = coordinates(edgeroi.sp),points2 =as(pnts.new, "SpatialPoints") , fill = rev(terrain.colors(10)), title="Uncertainty guided 2nd round sampling")
```
This puts much higher weight at locations where the prediction errors are higher,
however, technically speaking we finish still sampling points across the whole
study area and, of course, including randomization.
```{r scheme-resample, echo=FALSE, fig.cap="General scheme for re-sampling and re-analysis using uncertainty guided sampling principles.", out.width="70%"}
knitr::include_graphics("./img/Fig_sampling_reanalysis.png")
```
By adding 2nd, 3rd round sampling we can imagine that the mapping accuracy / RMSE
would gradually decrease following a decay function (e.g. `RMSE = b0 * x ^ -b1`) until some
maximum possible accuracy is achieved. This way soil surveyor can optimize the
predictions and reduce costs by minimizing number of additional samples i.e. it
could be considered _the shortest path_ to increasing the mapping accuracy without
a need to start sampling from scratch.