forked from OpenGeoHub/spatial-sampling-ml
-
Notifications
You must be signed in to change notification settings - Fork 0
/
resampling.Rmd
733 lines (585 loc) · 36.1 KB
/
resampling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
# Resampling methods for Machine Learning
```{r, results = "asis", echo = FALSE}
status("drafting")
```
```{r, include=FALSE, message=FALSE, results='hide'}
ls <- c("rgdal", "raster", "plotKML", "spatstat", "ranger", "mlr", "forestError",
"clhs", "xgboost", "glmnet", "matrixStats", "landmap", "LICORS", "h2o", "yardstick",
"hexbin", "parallelMap", "Metrics", "fastSave", "fields", "devtools", "Cubist")
new.packages <- ls[!(ls %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(ls, require, character.only = TRUE)
#load.pigz("sampling_data.RData")
source("sampling_functions.R")
```
## Resampling and Cross-Validation
In the previous examples we have demonstrated how to prepare sampling designs
for your own study area assuming no previous point data is available. We have also
demonstrated how to run some sampling representation diagnostics to detect potential
problems especially in the feature space. In the next sections we will focus on
how to use different **resampling** methods i.e.
**cross-validation strategies** [@roberts2017cross] to help reduce problems such as:
- _overfitting_ i.e. producing models that are biased and/or over-optimistic;
- _missing out covariates_ that are important but possibly _shadowed_ by the covariates
over-selected due to overfitting;
- _producing poor extrapolation_ i.e. generating artifacts or blunders in predictions;
- _over-/under-estimating mapping accuracy_ i.e. producing biased estimates of model performance;
**Resamping** methods are discussed in detail in @hastie2009elements, @kuhn2013applied and
@roberts2017cross, and is also commonly implemented in many statistical and machine
learning packages such as the [caret](https://topepo.github.io/caret/) or [mlr](https://mlr3.mlr-org.com/).
Spatial resampling methods are discussed in detail also in @lovelace2019geocomputation.
For an introduction to Cross-Validation please refer to [this tutorial](https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-right)
and the **[“Data Analysis and Prediction Algorithms with R”](https://rafalab.github.io/dsbook/cross-validation.html)** especially chapters on cross validation.
For an introduction to **spatial Cross-Validation** refer to the **[“Geocomputation with R”](https://geocompr.robinlovelace.net/spatial-cv.html)** book.
It can be said that, in general, purpose of Machine Learning for predictive
mapping is to try to produce **Best Unbiased Predictions** (BUPS) of the target
variable and the associated uncertainty (e.g. prediction errors).
BUPS can be commonly implemented through: (1) selecting the Best Unbiased Predictor
[@Venables2002Springer], (2) selecting the optimal subset of covariates and model
parameters (usually by iterations), (3) applying predictions and providing an
estimate of the prediction uncertainty i.e. estimate of the prediction errors /
prediction intervals for a given probability distribution. In this tutorial the
path to BUPS is based on:
- **Ensemble Machine Learning** using stacking approach with 5-fold Cross-Validation
with a meta-learner i.e. an independent model correlating competing base-learners
with the target variable [@polley2010super;@bischl2016mlr],
- Estimating prediction errors using **quantile regression Random Forest** [@lu2021unified],
The reasoning for using Ensemble ML for predictive mapping is explained in detail
in [this tutorial](https://opengeohub.github.io/spatial-prediction-eml/), and it's
advantages for minimizing extrapolation effects in this [medium post](https://medium.com/nerd-for-tech/extrapolation-is-tough-for-trees-tree-based-learners-combining-learners-of-different-type-makes-659187a6f58d).
## Resampling training points using declustering
In the previous example we have shown that the actual soil survey points for
Ebergotzen are somewhat spatially _clustered_, they under-sample forest areas / hillands.
In statistical term these sampling points are geographically unbalanced i.e. non-IID.
If we plot the original sampling points (N=2780) we see high spatial clustering of samples:
```{r}
library(rgdal)
library(raster)
library(plotKML)
library(landmap)
library(mlr)
library(ggplot2)
library(dplyr)
library(tidyr)
data("eberg_grid25")
gridded(eberg_grid25) <- ~x+y
proj4string(eberg_grid25) <- CRS("+init=epsg:31467")
eberg_grid25 = cbind(eberg_grid25, readRDS("./extdata/eberg_dtm_25m.rds"))
eberg_spc = landmap::spc(eberg_grid25[-c(2,3)])
data(eberg)
eberg.xy <- eberg[,c("X","Y")]
coordinates(eberg.xy) <- ~X+Y
proj4string(eberg.xy) <- CRS("+init=epsg:31467")
ov.xy = sp::over(eberg.xy, eberg_grid25[1])
eberg.xy = eberg.xy[!is.na(ov.xy$DEMTOPx),]
```
```{r eberg-allpnts, echo = TRUE, fig.cap = "All sampling points available for Ebergotzen case study."}
plot.map(df=raster(eberg_spc@predicted[1]), points = as.data.frame(eberg.xy, xy=TRUE) , pcex=2,
title = "All Ebergotzen sampling points")
```
If we ignore that property of the data and directly fit a predictive model for e.g.
top-soil clay content using e.g. random forest [@wright2017ranger] we get:
```{r, cache=TRUE}
library(ranger)
rm.eberg = cbind(eberg[!is.na(ov.xy$DEMTOPx),], sp::over(eberg.xy, eberg_spc@predicted))
cly.fm = as.formula(paste0("CLYMHT_A ~ ", paste0("PC", 1:13, collapse = "+")))
sel.cly = complete.cases(rm.eberg[,all.vars(cly.fm)])
rm.cly = rm.eberg[sel.cly,]
rf.cly = ranger::ranger(cly.fm, data=rm.cly)
rf.cly
```
This shows an RMSE of about 7.3% and an R-square of about 0.61. The problem of this
accuracy measure is that with this Random Forest model we ignore spatial clustering
of points, hence both the model and the accuracy metric could be over-optimistic [@roberts2017cross;@meyer2018improving]. Because we are typically interested in
how does the model perform over the WHOLE area of interest, not only in comparison
to out-of-bag points, we need to apply some adjustments to reduce overfitting or any bias in the BUPS.
Strategy #1 for producing more objective estimate of model parameters is to
resample training points by forcing as much as possible equal sampling intensity
(hence mimicking the SRS), then observe performance of the model accuracy under
different resampling strategies. We can implement such spatial resampling using
the `sample.grid` function, which basically resamples the existing point samples
with an objective of producing a sample more similar to SRS. This type of
subsetting can be run `M` times and then an ensemble model can be produced in
which each individual model is based on spatially balanced samples. These are
not true SRS samples but we can refer to them as the **pseudo-SRS samples** as
they would probably pass most of the **Spatial Randomness tests**.
In R we can implement spatial resampling using the following three steps. First,
we generate e.g. 10 random subsets where the sampling intensity of points is
relatively homogeneous:
```{r}
eberg.sp = SpatialPointsDataFrame(eberg.xy, rm.eberg[c("ID","CLYMHT_A")])
sub.lst = lapply(1:10, function(i){landmap::sample.grid(eberg.sp, c(500, 500), n=2)})
```
This randomly subsets each 500-m block to max 2 points i.e. trims down the densely
sampled points to produce a relatively balanced spatial sampling intensity. We can
check that the training point sample looks more like a SRS or similar.
```{r eberg-grid-sample, fig.height=4.5, fig.cap = "Resampling original points using `sample.grid` function, which produces a sample with similar properties such as SRS."}
plot.map(df=sub.lst[[1]]$grid, points = as.data.frame(sub.lst[[1]]$subset)[,c("X","Y")], pcex = 2, maptype="grey",
title="Resampling using `sample.grid`")
```
Second, we can fit a list of random forest models using 10 random draws mimicking
some IID sampling:
```{r, cache=TRUE}
rf.cly.lst = lapply(1:length(sub.lst), function(i){
x <- rm.eberg[which(rm.eberg$ID %in% sub.lst[[i]]$subset$ID),];
x <- x[complete.cases(x[,all.vars(cly.fm)]),];
y <- ranger::ranger(cly.fm, data=x, num.trees = 50);
return(y)
}
)
```
Third, we produce an Ensemble model that combines all predictions by simple
averaging [@wright2017ranger]. To produce final predictions, we can use
simple averaging because all subsets are symmetrical i.e. have exactly the same
inputs and settings, hence all models fitted have equal importance for the ensemble model.
The out-of-bag accuracy of ranger now shows a somewhat higher RMSE, which is also
probably more realistic:
```{r}
mean(sapply(rf.cly.lst, function(i){sqrt(i[["prediction.error"]])}))
```
In summary, the actual error is most likely about 20% higher than if we ignore
clustering and the previous model was likely over-optimistic. The model is too
much influenced by the clustered point samples and this should be taken into
account during model fitting [@roberts2017cross;@meyer2021predicting]. If we
visually compare the predictions between the original and ensemble models we see:
```{r}
cn = rf.cly$forest$independent.variable.names
pred.cly = predict(rf.cly, eberg_spc@predicted@data[,cn])
eberg_grid25$pred.cly.rf = pred.cly$predictions
pred.cly.lst = lapply(rf.cly.lst, function(i){
predict(i, eberg_spc@predicted@data[,cn])$predictions })
eberg_grid25$pred.cly.erf = rowMeans(do.call(cbind, pred.cly.lst), na.rm=TRUE)
```
```{r eberg-pred-erf, echo = TRUE, fig.width=8.5, fig.height=4.5, fig.cap = "Predictions of clay content: (left) ignoring spatial clustering effect on model, (right) after de-clustering of points."}
#zlim.cly = quantile(rm.eberg$CLYMHT_A, c(0.05, 0.95), na.rm=TRUE)
compare.map(df=stack(eberg_grid25[c("pred.cly.rf", "pred.cly.erf")]), title="Clay content Predictions non-spatial clustering and de-clustering of points")
```
Overall, the difference is small but there is certainly visible difference in predictions.
To the end users we would probably suggest to use `pred.cly.erf` (predictions produced
using the de-clustered points) map because the other model completely ignores
spatial clustering of points and this could have resulted in bias estimate of the
regression parameters. This solution to producing predictions is fully scalable
and relatively easy to implement, it only requires from user to decide on (1)
size of the block for subsampling, (2) max sampling intensity per block. In practice,
both could be determined by iterations.
## Weighted Machine Learning
Note that the function `grid.sample` per definition draws points that are relatively
isolated (Fig. \@ref(fig:eberg-fs)) with higher probability. We could implement a similar
principle but this time use the `case.weights` parameter approach (Strategy #2). Many
Machine Learning algorithms allow for inclusion probabilities to be specified. For
example, we can instruct `ranger` to put more emphasis on isolated points / remove impact of
clustered points. The weighted estimation of model parameters is common in regression,
and also in spatial statistics (see e.g. the [nlme::gls](https://stat.ethz.ch/R-manual/R-devel/library/nlme/html/gls.html); Generalized Least Square (GLS) function).
Theoretical basis for GLS / weighted regression is that the points that are clustered
might also be spatially correlated, and that means that they would introduce bias in
estimation of the regression parameters. Ideally, regression residuals should be
uncorrelated and uniform, hence a correction is needed that helps ensure these properties.
Weighted Machine Learning where bias in the sampling intensity is incorporated in
the modeling can be implemented in two steps. First, we derive the occurrence
probability (0–1) using the `spsample.prob` method:
```{r, cache=TRUE}
iprob.all <- landmap::spsample.prob(eberg.sp, eberg_spc@predicted[1:4])
```
```{r eberg-iprob, echo = TRUE, fig.cap = "Occurrence probability for existing point samples for the Ebergotzen case study derived as an average between the kernel density and maxlike occurrence probabilities."}
plot.map(df=raster(iprob.all$prob), points=as.data.frame(iprob.all$observations, xy=TRUE),
fill = rev(terrain.colors(10)), pcex=2, grid = 4, title="maxlike occurrence probabilities")
```
Second, we fit a model using all points, but this time we set `case.weights` to be
reversely proportional to probability of occurrence (hence points with lower occurrence
probability get higher weights):
```{r, cache=TRUE}
ov.weigths = 1/sp::over(eberg.xy, iprob.all$prob)[]
rf.clyI = ranger::ranger(cly.fm, data=rm.cly, case.weights = ov.weigths[sel.cly,])
```
[Weighted regression](https://www.statology.org/weighted-least-squares-in-r/) is a common technique in statistics and in this case the weights are used to helps reduce
clustering effect i.e. give weights to points proportionally to the sampling bias.
This method is in principle similar to the [nlme::gls](https://stat.ethz.ch/R-manual/R-devel/library/nlme/html/gls.html) Generalized Least Square (GLS) procedure,
but with the difference that we do not estimate any spatial autocorrelation structure, but
instead incorporate the probability of occurrence to reduce effect of spatial clustering.
## Resampling using Ensemble ML
Another approach to improve generating BUPS from clustered point data is to switch
to Ensemble ML i.e. use a multitude of ML methods (so called **base-learners**),
then estimate final predictions using robust resampling and blocking. Ensemble ML
has shown to help increase mapping accuracy, but also helps with reducing _over-shooting_
effects due to [extrapolation](https://medium.com/nerd-for-tech/extrapolation-is-tough-for-trees-tree-based-learners-combining-learners-of-different-type-makes-659187a6f58d).
One way to reduce effects of point clustering for predictive is to use _spatial blocking_
i.e. to make sure that spatially clustered points are not used both for training and
internal validation [@roberts2017cross]. First, we need to define a spatial grid that we will use
as a blocking parameter. We set here arbitrarily size of spatial blocks to 500-m,
in practice the block size can be determined more objectively by e.g. analyzing
at which distances is clustering reduced:
```{r}
grd <- sp::GridTopology(cellcentre.offset=eberg_grid25@bbox[,1], cellsize=rep(500,2),
cells.dim=c(ceiling(abs(diff(eberg_grid25@bbox[1,])/500))+1,
ceiling(abs(diff(eberg_grid25@bbox[2,])/500))+1))
r.sp <- sp::SpatialGridDataFrame(grd, proj4string = eberg_grid25@proj4string,
data=data.frame(gid=1:([email protected][1] * [email protected][2])))
id <- sp::over(eberg.xy, r.sp)$gid
#summary(as.factor(id))
```
```{r eberg-grid, echo = TRUE, fig.height=5, fig.cap = "500 m grid for spatial resampling."}
plot.map(df=r.sp, points = as.data.frame(eberg.xy, xy=TRUE), pcex=2, fill = terrain.colors(10), grid=6,
title="Spatial resampling grid (500 m)")
```
This shows that many blocks basically have no training points, and some blocks
are densely sampled with e.g. 15–20 points. Next, we can compare models fitted
using spatial blocking vs no special settings. First, we fit ensemble ML model using no blocking:
```{r}
parallelMap::parallelStartSocket(parallel::detectCores())
```
```{r}
library(mlr)
library(glmnet)
library(Cubist)
lrns <- list(mlr::makeLearner("regr.ranger",
num.threads = parallel::detectCores(), num.trees=150, importance="impurity"),
mlr::makeLearner("regr.glm"), mlr::makeLearner("regr.cubist"),
mlr::makeLearner("regr.cvglmnet"))
tsk0 <- mlr::makeRegrTask(data = rm.cly[,all.vars(cly.fm)], target = "CLYMHT_A")
init0.m <- mlr::makeStackedLearner(lrns, method = "stack.cv",
super.learner = "regr.lm",
resampling=mlr::makeResampleDesc(method = "CV"))
eml0 = train(init0.m, tsk0)
summary(eml0$learner.model$super.model$learner.model)
```
This shows that `ranger` i.e. random forest and `cubist` are the most important
learners and the overall model performance matches the previously fitted model using ranger.
Next, we fit an ensemble ML model with the spatial blocking (500-m):
```{r}
tsk1 <- mlr::makeRegrTask(data = rm.cly[,all.vars(cly.fm)], target = "CLYMHT_A",
blocking = as.factor(id[sel.cly]))
init1.m <- mlr::makeStackedLearner(lrns, method = "stack.cv",
super.learner = "regr.lm",
resampling=mlr::makeResampleDesc(method = "CV", blocking.cv=TRUE))
eml1 = train(init1.m, tsk1)
summary(eml1$learner.model$super.model$learner.model)
```
which shows a difference: the RMSE drops for 10–20% and the `regr.glm` learner is
now also significant. In the previous example, it is possible that the `regr.glm` model
was possibly _shadowed_ by the fitting power of ranger and Cubist, while after
more strict CV, also more simple models seem to perform with a comparable accuracy.
We can produce predictions using the Ensemble ML model developed with blocking by running:
```{r}
pred.cly.eml = predict(eml1, newdata=eberg_spc@predicted@data[,eml1$features])
eberg_grid25$pred.cly.eml = pred.cly.eml$data$response
```
```{r eberg-eml, echo = TRUE, fig.width=8.5, fig.height=4.5, fig.cap = "Predictions of clay content: (left) Ensemble ML with spatial blocking, (right) after de-clustring of points."}
compare.map(df=stack(eberg_grid25[c("pred.cly.eml", "pred.cly.erf")]), title="Clay content predictions using EML spatial blocking and de-clustring")
```
Visual comparison with the predictions produced in previous section, show that
the Ensemble method `pred.cly.eml` predicts somewhat higher clay content in the
extrapolation area, but also smooths out some higher values in the plains.
Again, if we have to choose we would suggest users to use the map on the left for
two main reasons:
1. It is based on multiple base-learners, not only on Random Forest and results
show that both Cubist and glmnet package produce comparable results to RF.
2. It is probably more sensible to use the predictions produced by the meta-learner,
especially in the extrapolation space.
## Estimating the Area of Applicability
@meyer2021predicting have developed a method to estimate so-called ["Area of Applicability"](https://cran.r-project.org/web/packages/CAST/vignettes/AOA-tutorial.html)
using a fitted model and feature space analysis. This method can be used for
post-modeling analysis and helps users realize what are the true extrapolation
areas and where the predictions are critically poor. The users can then choose to
e.g. limit predictions only to combinations of pixels that are NOT too risky (extrapolation).
For the RF model fitted above we can derive AoA using:
```{r, eval=FALSE}
library(CAST)
train.df = rm.cly[,all.vars(cly.fm)[-1]]
weight = as.data.frame(mlr::getFeatureImportance(eml1$learner.model$base.models[[1]])$res)
AOA <- CAST::aoa(train=train.df, predictors=eberg_spc@predicted@data[,eml1$features], weight=weight)
```
This method can be computational so it is probably not recommended for larger datasets.
Some examples of the Area of Applicability can be found in the [CAST package tutorial](https://cran.r-project.org/web/packages/CAST/vignettes/AOA-tutorial.html).
## Estimating per-pixel mapping accuracy
Using the Ensemble Model we can also estimate the mapping accuracy per pixel i.e.
by deriving the **Mean Square Prediction Error** (MSPE). The **forestError** package currently
provides a _“Unified Framework for Random Forest Prediction Error Estimation”_ [@lu2021unified]
and is probably the most worked-out procedure for deriving prediction errors
and estimating potential bias. This requires two steps, (1) first, we need to
fit an additional quantile Regression RF model (using the four base learners from the
previous section), (2) second, we can then estimate complete error statistics
per pixel using the [quantForestError](https://rdrr.io/cran/forestError/man/quantForestError.html) function.
Because derivation of prediction errors per pixel can often be computational
(even at the order of magnitude more computational than predictions), it is important
to use a method that is computationally efficient and precise enough. In the
landmap package the uncertainty is derived using base learners instead of using
ALL raster layers which could be hundreds. This approach of using (few) base learners
instead of (many) original covariates helps compress the complexity of model and
significantly speed-up computing. The base learners can be accessed from the
mlr object:
```{r, cache=TRUE}
eml.t = eml1$learner.model$super.model$learner.model$terms
paste(eml.t)
eml.m = eml1$learner.model$super.model$learner.model$model
```
We use the spatially resampled base-learners to fit (an independent) quantile RF:
```{r, cache=TRUE}
eml.qr <- ranger::ranger(eml.t, eml.m, num.trees=85, importance="impurity",
quantreg=TRUE, keep.inbag = TRUE)
#eml.qr
```
Next, we can use the `forestError` package to derive prediction errors by:
```{r, cache=TRUE}
library(forestError)
quantiles = c((1-.682)/2, 1-(1-.682)/2)
n.cores = parallel::detectCores()
out.c <- as.data.frame(mlr::getStackedBaseLearnerPredictions(eml1,
newdata=eberg_spc@predicted@data[,eml1$features]))
pred.q = forestError::quantForestError(eml.qr,
X.train = eml.m[,all.vars(eml.t)[-1]],
X.test = out.c,
Y.train = eml.m[,all.vars(eml.t)[1]],
alpha = (1-(quantiles[2]-quantiles[1])), n.cores=n.cores)
```
We could have also subset e.g. 10% of the input points and keep them ONLY for
estimating the prediction errors using the `quantForestError`, which is also
recommended by the authors of the forestError package. In practice, because
base learners have been fitted using 5-fold Cross-Validation with blocking,
they are already out-of-bag samples hence taking out extra OOB samples is
probably not required, but you can also test this with your own data.
The `quantForestError` function runs a complete uncertainty assessment and
includes both MSPE, bias and upper and lower confidence intervals:
```{r}
str(pred.q)
```
In this case, for the lower and upper confidence intervals we use 1-standard
deviation probability which is about 2/3 probability and hence the lower value is 0.159
and upper 0.841. The RMSPE should match the half of the difference between
the lower and upper interval in this case, although there will be difference in exact numbers.
The mean RMSPE for the whole study area (mean of all pixels) should be as expected
somewhat higher than the RMSE we get from model fitting:
```{r}
mean(sqrt(pred.q$estimates$mspe), na.rm=TRUE)
```
This is because we are also extrapolating in the large part of the area.
The map in Fig. \@ref(fig:eberg-var-eml) correctly depicts the extrapolation areas
as having much higher RMSPE (compare with Fig. \@ref(fig:eberg-iprob)):
```{r eberg-var-eml, echo = TRUE, out.width = "90%", fig.cap = "Prediction errors for the clay content based on the forestError package and Ensemble ML."}
eberg_grid25$rmspe.cly.eml = sqrt(pred.q$estimates$mspe)
plot.map(df=eberg_grid25[c("rmspe.cly.eml")],points = as.data.frame(eberg.sp, xy=T), pcex=2, fill=SAGA_pal[[16]], grid=4,
title="Clay content prediction errors using forestError and EML")
```
In the previous examples we have shown that the actual samples from the Ebergotzen
dataset are actually clustered and cover only agricultural land. Can we
still use these samples to estimate the mapping accuracy for the whole area?
The answer is yes, but we need to be aware that our estimate might be biased
(usually over-optimistic) and we need to do our best to reduce the over-fitting
effects by implementing some of the strategies above e.g.: assign different
weights to training points, and/or implement blocking settings.
Assuming that forest soils are possibly very different from agricultural
soils, once we collect new sampling in the forest part of the study area we
might discovering that the actual mapping accuracy we estimated for the whole
study area using only agricultural soil samples is significantly lower than what
we have estimated in Fig. \@ref(fig:eberg-var-eml).
```{r example-cv, echo=FALSE, fig.cap="Example of effects of the size of the spatial blocking (buffer) on mapping accuracy.", out.width="70%"}
knitr::include_graphics("./img/Fig_zone_radius_CV.png")
```
Fig. \@ref(fig:example-cv) shows the usual effect of spatial blocking (for varying buffer size) on the
Cross-Validation accuracy [@pohjankukka2017estimating]. Note that the accuracy
tends to stabilize at some distance, although too strict blocking can also lead
to over-pessimistic estimates of accuracy hence bias in predictions [@Wadoux2021EM].
In the Ebergotzen case, prediction error could be over-pessimistic although the
block size is relatively small considering the size of the study area.
On the other hand, if the training points are clustered, blocking becomes
important because otherwise the estimate of error and choice of model parameters
could get over-optimistic [@meyer2018improving;@lovelace2019geocomputation].
If in doubt of whether to produce over-optimistic or over-pessimistic estimates of
uncertainty, it is of course ideal to avoid both, but if necessary consider that
somewhat over-pessimistic estimate of accuracy could be slightly more _on a safe side_ [@roberts2017cross].
## Testing mapping accuracy using resampling and blocking
We can switch to the Edgeroi dataset [@malone2009mapping] that is originally
based on _designed_ sampling and as such is more interesting for assessing
effects of various blocking strategies on overall mapping accuracy. We can load
the dataset and prepare a regression matrix by using [@hengl2019predictive]:
```{r}
data(edgeroi)
edgeroi.sp <- edgeroi$sites
coordinates(edgeroi.sp) <- ~ LONGDA94 + LATGDA94
proj4string(edgeroi.sp) <- CRS("+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs")
edgeroi.sp <- spTransform(edgeroi.sp, CRS("+init=epsg:28355"))
length(edgeroi.sp)
h2 <- hor2xyd(edgeroi$horizons)
edgeroi.grids.100m = readRDS("./extdata/edgeroi.grids.100m.rds")
edgeroi.grids = landmap::spc(edgeroi.grids.100m[-1])
```
```{r edgeroi-map, echo = TRUE, fig.cap = "The Edgeroi dataset consisting of 359 soil profiles."}
plot.map(df=raster(edgeroi.grids@predicted[1]), points = coordinates(edgeroi.sp), fill = rev(terrain.colors(10)),
grid=5, title = "Edgeroi dataset: 359 soil profiles")
```
The dataset documentation indicates that from a total of 359 profiles, 210 soil profiles
were sampled on a systematic, equilateral triangular grid with a spacing of 2.8 km
between sites; the further 131 soil profiles are distributed more irregularly
or on transects [@malone2009mapping]. This is hence a **hybrid sampling design**
but in general satisfying IID, and hence any subsample of these points should give
an unbiased estimate of the mapping accuracy. We first prepare a regression matrix that includes
all covariates, location IDs and we also add a spatial grid of 500-m size:
```{r}
grdE <- sp::GridTopology(cellcentre.offset=edgeroi.grids.100m@bbox[,1], cellsize=rep(500,2),
cells.dim=c(ceiling(abs(diff(edgeroi.grids.100m@bbox[1,])/500))+1,
ceiling(abs(diff(edgeroi.grids.100m@bbox[2,])/500))+1))
rE.sp <- sp::SpatialGridDataFrame(grdE, proj4string = edgeroi.grids.100m@proj4string,
data=data.frame(gid=1:([email protected][1] * [email protected][2])))
ovF <- sp::over(edgeroi.sp, edgeroi.grids@predicted)
ovF$SOURCEID <- edgeroi.sp$SOURCEID
ovF$gid <- sp::over(edgeroi.sp, rE.sp)$gid
ovF$x = edgeroi.sp@coords[,1]
ovF$y = edgeroi.sp@coords[,2]
rmF <- plyr::join_all(dfs = list(edgeroi$sites, h2, ovF))
```
This produces a regression matrix with unique IDs of profiles `SOURCEID`, spatial
block IDs (`gid`) and all target and covariate layers.
We can now fit a model to predict soil organic carbon content (in g/kg) in 3D:
```{r}
rmF$log.ORCDRC = log1p(rmF$ORCDRC)
formulaStringPF <- as.formula(paste0("log.ORCDRC ~ DEPTH + ", paste0("PC", 1:10, collapse = "+")))
rmPF <- rmF[complete.cases(rmF[,all.vars(formulaStringPF)]),]
#str(rmPF[,all.vars(formulaStringPF)])
```
We first fit a model distribution of soil organic carbon ignoring any spatial
clustering, overlap in 3rd dimension (soil depth) or similar:
```{r, cache=TRUE}
soc.rf = ranger(formulaStringPF, rmPF)
soc.rf
```
If we compare this model with an Ensemble ML where whole blocks i.e. including
also whole soil profiles are taken out from modeling:
```{r, cache=TRUE}
SL.library2 = c("regr.ranger", "regr.glm", "regr.cvglmnet", "regr.xgboost", "regr.ksvm")
lrnsE <- lapply(SL.library2, mlr::makeLearner)
tskE <- mlr::makeRegrTask(data = rmPF[,all.vars(formulaStringPF)], target="log.ORCDRC",
blocking = as.factor(rmPF$gid))
initE.m <- mlr::makeStackedLearner(lrnsE, method = "stack.cv",
super.learner = "regr.lm",
resampling=mlr::makeResampleDesc(method = "CV", blocking.cv=TRUE))
emlE = train(initE.m, tskE)
summary(emlE$learner.model$super.model$learner.model)
```
Notice a large difference in the model accuracy with R-square dropping from
about 0.82 to 0.65. How can we check which of the two models is more accurate /
which shows a more realistic mapping accuracy? We can again generate pseudo-grid
samples e.g. 10 subsets where in each subset we make sure that the points are at
least 3.5-km apart and are taken selected randomly:
```{r, cache=TRUE}
subE.lst = lapply(1:10, function(i){landmap::sample.grid(edgeroi.sp, c(3.5e3, 3.5e3), n=1)})
```
```{r edgeroi-grid-sample, echo = TRUE, fig.cap = "Resampling original points using `sample.grid` function for the Edgeroi dataset."}
plot.map(df=subE.lst[[1]]$grid, points = subE.lst[[1]]$subset, pcex=2, grid=5, maptype = "grey", title="Resampling Edgeroi dataset using `sample.grid`")
```
So in any random pseudo-grid subset we take out about 100 profiles from 359 and
keep for validation only. We can next repeatedly fit models using the two
approaches and derive prediction errors. First, for simple model ignoring any
spatial clustering / soil profile locations:
```{r, cache=TRUE}
rf.soc.lst = lapply(1:length(subE.lst), function(i){
sel <- !rmPF$SOURCEID %in% subE.lst[[i]]$subset$SOURCEID;
y <- ranger::ranger(formulaStringPF, rmPF[sel,]);
out <- data.frame(meas=rmPF[!sel,"log.ORCDRC"], pred=predict(y, rmPF[!sel,])$predictions);
return(out)
}
)
rf.cv = do.call(rbind, rf.soc.lst)
Metrics::rmse(rf.cv$meas, rf.cv$pred)
```
This gives an RMSPE of 0.44, which is higher than what is reported by the OOB for RF
without blocking. The accuracy plot shows that the [Concordance Correlation Coefficient (CCC)](https://rdrr.io/cran/yardstick/man/ccc.html)
is about 0.79 (corresponding to a R-square of about 0.62 and thus significantly
less than what is reported by ranger):
```{r, message=FALSE, echo=TRUE}
t.b = quantile(log1p(rmPF$ORCDRC), c(0.001, 0.01, 0.999), na.rm=TRUE)
plot_hexbin(varn="SOC_RF", breaks=c(t.b[1], seq(t.b[2], t.b[3], length=25)),
meas=rf.cv$meas, pred=rf.cv$pred, main="SOC [RF]")
```
```{r ac-soc1, echo=FALSE, fig.cap="Accuracy plot for soil organic carbon fitted using RF.", out.width="70%"}
knitr::include_graphics("./img/plot_CV_SOC_RF.png")
```
We repeat the same process of re-fitting the model using Ensemble ML with spatial
blocking:
```{r, eval=FALSE, cache=TRUE}
eml.soc.lst = lapply(1:length(subE.lst), function(i){
sel <- !rmPF$SOURCEID %in% subE.lst[[i]]$subset$SOURCEID;
x <- mlr::makeRegrTask(data = rmPF[sel,all.vars(formulaStringPF)],
target="log.ORCDRC", blocking = as.factor(rmPF$gid[sel]));
y <- train(initE.m, x)
out <- data.frame(meas=rmPF[!sel,"log.ORCDRC"], pred=predict(y, newdata=rmPF[!sel, y$features])$data$response);
return(out)
}
)
eml.cv = do.call(rbind, eml.soc.lst)
Metrics::rmse(eml.cv$meas, eml.cv$pred)
```
```{r, eval=FALSE, message=FALSE, echo=TRUE}
plot_hexbin(varn="SOC_EML", breaks=c(t.b[1], seq(t.b[2], t.b[3], length=25)),
meas=eml.cv$meas, pred=eml.cv$pred, main="SOC [EML]")
```
```{r ac-soc2, echo=FALSE, fig.cap="Accuracy plot for soil organic carbon fitted using Ensemble Machine Learning with spatial blocking.", out.width="70%"}
knitr::include_graphics("./img/plot_CV_SOC_EML.png")
```
So in summary, independent validation using pseudo-probability samples indicates
that the Ensemble ML produces more accurate predictions (in this case only slightly
better) and the RMSE estimated by the meta-learner the Ensemble ML approach is more
realistic (`Residual standard error: 0.45`). This clearly demonstrates that spatial
blocking is important to (a) prevent from over-fitting, (b) produce a more realistic
estimate of the uncertainty / mapping accuracy. Ensemble ML comes at costs of
at the order of magnitude higher computing costs however.
We can again plot the predictions produced by two methods next to each other:
```{r}
newdata = edgeroi.grids@predicted@data[,paste0("PC", 1:10)]
newdata$DEPTH = 5
edgeroi.grids.100m$rf_soc_5cm = predict(soc.rf, newdata)$predictions
edgeroi.grids.100m$eml_soc_5cm = predict(emlE, newdata=newdata[,emlE$features])$data$response
```
```{r, pred-soc, echo=FALSE, fig.width=9, fig.height=4, fig.cap="Predictions of soil organic carbon (in log-scale) based on Random Forest (RF) and Ensemble ML (EML)."}
compare.map(df=stack(edgeroi.grids.100m[c("rf_soc_5cm", "eml_soc_5cm")]), points = coordinates(edgeroi.sp), pcex=2,
title="SOC predictions (in log-scale) using EML and RF.")
```
Which shows that the Ensemble ML seems to predict significantly higher SOC in the
hillands (right part of the study area), so again significant difference in predictions
between the two models. Even though the Ensemble ML with spatial blocking is only
slightly better in accuracy (RMSE based on 10-times out-of-bag declustered validation points),
these results confirm that it helps produce a more realistic map of RMSPE.
This matches the result of @roberts2017cross and @meyer2018improving who suggest that block cross-validation is
nearly universally more appropriate than random cross-validation if the goal is
predicting to new data or predictor space, or for selecting causal predictors.
We can also map the prediction errors:
```{r, cache=TRUE}
emlE.t = emlE$learner.model$super.model$learner.model$terms
paste(emlE.t)
emlE.m = emlE$learner.model$super.model$learner.model$model
emlE.qr <- ranger::ranger(emlE.t, emlE.m, num.trees=85, importance="impurity",
quantreg=TRUE, keep.inbag = TRUE)
outE.c <- as.data.frame(mlr::getStackedBaseLearnerPredictions(emlE,
newdata=newdata[,emlE$features]))
predE.q = forestError::quantForestError(emlE.qr,
X.train = emlE.m[,all.vars(emlE.t)[-1]],
X.test = outE.c,
Y.train = emlE.m[,all.vars(emlE.t)[1]],
alpha = (1-(quantiles[2]-quantiles[1])), n.cores=n.cores)
```
Which again shows where could be the main extrapolation problems i.e. where
multiple base learners perform poorly:
```{r edgeroi-var-eml, echo = TRUE, fig.cap = "Prediction errors for the soil organic carbon content based on the forestError package and Ensemble ML."}
plot.map(df=raster(edgeroi.grids.100m[c("rmspe.soc.eml")]), points = edgeroi.sp, fill=SAGA_pal[[16]], pcex=3,
title = "SOC errors prediction unsing forestError and EML", grid=5)
```
```{r}
rgdal::writeGDAL(edgeroi.grids.100m[c("rmspe.soc.eml")],
"./output/edgeroi_soc_rmspe.tif",
options=c("COMPRESS=DEFLATE"))
```
The overall mapping accuracy for Edgeroi based on the mean prediction error is thus:
```{r}
mean(sqrt(predE.q$estimates$mspe), na.rm=TRUE)
```
which in general matches what we get through repeated validation using pseudo-SRS
subsampling.
From this experiment we can conclude that the mapping accuracy estimated using
ranger and out-of-bag samples and ignoring locations of profiles was
probably over-optimistic and hence ranger has possibly over-fitted the target variable.
This is in fact common problem observed with many 3D predictive soil mapping models
where soil profiles basically have ALL the same values of covariates and Random
Forest thus easier predicts values due to overlap in covariate data. For a
discussion on why is important to run internal training and Cross-Validation
using spatial blocking refer also to @gasch2015spatio and @meyer2018improving.
```{r}
parallelMap::parallelStop()
```