25-cLHS.Rmd

# Conditioned Latin hypercube sampling {#cLHS}

This chapter and Chapter \@ref(SpatialResponseSurface) on response surface sampling are about experimental designs that have been adapted for spatial surveys. Adaptation is necessary because, in contrast to experiments, in observational studies one is not free to choose any possible combination of levels of different factors. When two covariates are strongly positively correlated, it may happen that there are no population units with a relatively large value for one covariate and a relatively small value for the other covariate. By contrast, in experimental research it is possible to select any combination of factor levels.

In a full factorial design\index{Full factorial design}, all combinations of factor levels\index{Factor level} are observed. With $k$ factors and $l$ levels per factor, the total number of observations is $l^k$. With numerous factors and/or numerous levels per factor, observing $l^k$ experimental units becomes unfeasible in practice. Alternative experimental designs have been developed that need fewer observations but still provide detailed information about how the study variable responds to changes in the factor levels. This chapter will describe and illustrate the survey sampling analogue of Latin hypercube sampling\index{Latin hypercube sample}. Response surface sampling follows in the next chapter.

Latin hypercube sampling is used in designing industrial processes, agricultural experiments, and computer experiments, with numerous covariates and/or factors of which we want to study the effect on the output [@McKay1979].  A much cheaper alternative to a full factorial design is an experiment with, for all covariates, exactly one observation per level. So, in the agricultural experiment described in Chapter \@ref(IntroSamplingforMapping) with the application rates of N and P as factors and four levels for each factor, this would entail four observations only, distributed in a square in such way that we have in all rows and in all columns one observation, see Figure \@ref(fig:LatinSquare). This is referred to as a Latin square. The generalisation of a Latin square to a higher number of dimensions is a Latin hypercube (LH).

```{r LatinSquare, echo = FALSE, fig.width = 5, out.width = "60%", fig.cap = "Latin square for agricultural experiment with four application rates of N and P."}
df <- data.frame(P = c(3, 1, 2, 4), N = c(1, 2, 3, 4))
ggplot(df) +
  geom_point(aes(x = P, y = N), size = 2.5) +
  scale_x_continuous(name = "P level", limits = c(1, 4)) +
  scale_y_continuous(name = "N level", limits = c(1, 4)) +
  coord_equal()
```


@Minasny2006 adapted LH sampling for observational studies; this adaptation is referred to as conditioned Latin hypercube (cLH) sampling\index{Conditioned Latin hypercube sampling}. For each covariate, a series of intervals (marginal strata) is defined. The number of marginal strata per covariate is equal to the sample size, so that the total number of marginal strata equals $p^n$, with $p$ the number of covariates and $n$ the sample size. The bounds of the marginal strata are chosen such that the numbers of raster cells in these marginal strata are equal. This is achieved by using the quantiles corresponding to evenly spaced cumulative probabilities as stratum bounds. For instance, for five marginal strata we use the quantiles corresponding to the cumulative probabilities 0.2, 0.4, 0.6, and 0.8.

The minimisation criterion proposed by @Minasny2006 is a weighted sum of three components:   

1.  O1: the sum over all marginal strata of the absolute deviations of the marginal stratum sample size from  the targeted sample size (equal to 1);
2.  O2: the sum over all classes of categorical covariates of the absolute deviations of the sample proportion of a given class from the population proportion of that class; and
3.  O3: the sum over all entries of the correlation matrix of the absolute deviation of the correlation in the sample from the correlation in the population.  

With cLH sampling, the marginal distributions of the covariates in the sample are close to these distributions in the population. This can be advantageous for mapping methods that do not rely on linear relations, for instance in machine learning techniques like classification and regression trees (CART), and random forests (RF). In addition, criterion O3 ensures that the correlations between predictors are respected in the sample set.

cLH samples can be selected with function `clhs` of package **clhs** [@Roudier2011]. With this package, the criterion is minimised by simulated annealing, see Section \@ref(SSA) for an explanation of this optimisation method. Arguments `iter`, `temp`, `tdecrease`, and ` length.cycle` of function `clhs` are control parameters of the simulated annealing algorithm. In the next code chunk, I use default values for these arguments. With argument `weights`, the weights of the components of the minimisation criterion can be set. The default weights are equal to 1.

Argument `cost` is for cost-constrained cLH sampling [@Roudier2012], and argument `eta` can be used to control the sampling intensities of the marginal strata [@Minasny2010]. This argument is of interest if we would like to oversample the marginal strata near the edge of the multivariate distribution.

cLH sampling is illustrated with the five covariates of Eastern Amazonia that were used before in covariate space coverage sampling (Chapter \@ref(kmeans)). 

```{r, echo = FALSE}
grdAmazonia <- read_rds(file = "results/grdAmazonia_5km.rds")
```


```{r}
library(clhs)
covs <- c("SWIR2", "Terra_PP", "Prec_dm", "Elevation", "Clay")
set.seed(314)
res <- clhs(
  grdAmazonia[, covs], size = 20, iter = 50000, temp = 1, tdecrease = 0.95,
  length.cycle = 10, progress = FALSE, simple = FALSE)
mysample_CLH <- grdAmazonia[res$index_samples, ]
```

Figure \@ref(fig:cLHS) shows the selected sample in a map of SWIR2. In Figure \@ref(fig:cLHSscat) the sample is plotted in a biplot of Prec_dm against SWIR2. Each black dot in the biplot represents one grid cell in the population. The vertical and horizontal lines in the biplot are at the bounds of the marginal strata of SWIR2 and Prec_dm, respectively. The number of grid cells between two consecutive vertical lines is constant, as well as the number of grid cells between two consecutive horizontal lines, i.e., the marginal strata have equal sizes. The intervals are the narrowest where the density of grid cells in the plot is highest. Ideally, in each column and row, there is exactly one sampling unit (red dot).

```{r cLHS, echo = FALSE, out.width = "100%", fig.cap = "Conditioned Latin hypercube sample from Eastern Amazonia in a map of SWIR2."}
ggplot(data = grdAmazonia) +
  geom_raster(mapping = aes(x = x1 / 1000, y = x2 / 1000, fill = SWIR2)) +
  geom_point(data = mysample_CLH, mapping = aes(x = x1 / 1000, y = x2 / 1000), colour = "red", size = 2) +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +  
  scale_fill_viridis_c(name = "SWIR2") +
  coord_fixed()
```

```{r cLHSscat, echo = FALSE, out.width = "100%", fig.cap = "Conditioned Latin hypercube sample plotted in a biplot of precipitation in the dryest month against SWIR2. The vertical and horizontal lines are at the bounds of the marginal strata of the covariates SWIR2 and precipitation dryest month, respectively."}
probs <- seq(from = 0, to = 1, length.out = nrow(mysample_CLH) + 1)
bounds <- apply(grdAmazonia[, covs], MARGIN = 2, FUN = function(x) quantile(x, probs = probs))
ggplot(data = grdAmazonia) +
  geom_point(mapping = aes(x = SWIR2, y = Prec_dm), colour = "black", size = 1, alpha = 0.5) +
  geom_vline(xintercept = bounds[c(-1, -nrow(bounds)), 1], colour = "grey") +
  geom_hline(yintercept = bounds[c(-1, -nrow(bounds)), 3], colour = "grey") +
  geom_point(data = mysample_CLH, mapping = aes(x = SWIR2, y = Prec_dm), colour = "red", size = 1) +
  scale_x_continuous(name = "SWIR2") +
  scale_y_continuous(name = "Precipitation dryest month")
```

Figure \@ref(fig:StratumSampleSizes) shows the sample sizes for all 100 marginal strata. The next code chunk shows how the marginal stratum sample sizes are computed.

```{r}
probs <- seq(from = 0, to = 1, length.out = nrow(mysample_CLH) + 1)
bounds <- apply(grdAmazonia[, covs], MARGIN = 2,
  FUN = function(x) quantile(x, probs = probs))
mysample_CLH <- as.data.frame(mysample_CLH)
counts <- lapply(1:5, function(i)
  hist(mysample_CLH[, i + 3], bounds[, i], plot = FALSE)$counts)
```

```{r StratumSampleSizes, out.width = '100%', echo = FALSE, fig.asp = 0.4, fig.cap = "Sample sizes of marginal strata for the conditioned Latin hypercube sample of size 20 from Eastern Amazonia."}
countslf <- data.frame(counts = unlist(counts))
countslf$covariate <- rep(covs, each = 20)
countslf$stratum <- rep(1:20, times = 5)
ggplot(countslf) +
  geom_point(mapping = aes(x = stratum, y = counts), colour = "black", size = 1) +
  facet_wrap(~covariate) +
  scale_x_continuous(name = "Stratum") +
  scale_y_continuous(name = "Sample size", breaks = c(0, 1, 2))
```

For all marginal strata\index{Marginal stratum} with one sampling unit, the contribution to component O1 of the minimisation criterion is 0. For marginal strata with zero or two sampling units, the contribution is 1, for marginal strata with three sampling units the contribution equals 2, etc. In Figure \@ref(fig:StratumSampleSizes) there are four marginal strata with zero units and four marginal strata with two units. Component O1 therefore equals 8 in this case.

Figure \@ref(fig:tracecLHS) shows the trace of the objective function, i.e., the values of the minimisation criterion during the optimisation. The trace plot indicates that 50,000 iterations are sufficient. I do not expect that the criterion can be reduced anymore. The final value of the minimisation criterion is extracted with function `tail` using argument `n = 1`.

```{r}
trace <- res$obj
tail(trace, n = 1)
```

```{r tracecLHS, echo = FALSE, fig.asp = 0.6, fig.width = 5, fig.cap = "Trace of minimisation criterion during optimisation of conditioned Latin hypercube sampling from Eastern Amazonia."}
tracedf <- data.frame(trace = trace)
ggplot(tracedf) +
  geom_line(mapping = aes(x = seq_len(nrow(tracedf)), y = trace), colour = "black", size = 0.6) +
  scale_x_continuous(name = "Iteration") +
  scale_y_continuous(name = "Criterion")
```

In the next code chunk, the minimised value of the criterion is computed "by hand".

```{r}
countslf <- data.frame(counts = unlist(counts))
O1 <- sum(abs(countslf$counts - 1))
rho <- cor(grdAmazonia[, covs])
r <- cor(mysample_CLH[, covs])
O3 <- sum(abs(rho - r))
print(O1 + O3)
```

#### Exercises {-}

1. Use the data of Hunter Valley (`grdHunterValley` of package **sswr**) to select a cLH sample of size 50, using elevation_m, slope_deg, cos_aspect, cti, and ndvi as covariates. Plot the selected sample in a map of covariate cti, and plot the selected sample in a biplot of cti against elevation_m. In which part of the biplot are most units selected, and which part is undersampled?   
2. Load the simulated data of Figure \@ref(fig:twosamples) (`results/SimulatedSquare.rda`), and select a cLH sample of size 16, using the covariate $x$ and the spatial coordinates as stratification variables. Plot the selected sample in the square with simulated covariate values.   
    +  What do you think of the geographical spreading of the sampling units? Is it optimal?
    +  Compute the number of sampling units in the marginal strata of $s1$, $s2$, and the covariate $x$. First, compute the bounds of these marginal strata.  Are all marginal strata of $s1$ and $s2$ sampled?  Suppose that all marginal strata of $s1$ and $s2$ are sampled (contain one sampling point), does this guarantee good spatial coverage?
    +  Plot the trace of the minimisation criterion, and retrieve the minimised value. Is this minimised value in agreement with the marginal stratum sample sizes?

## Conditioned Latin hypercube infill sampling {#cLHIS}

Package **clhs** can also be used for selecting a conditioned Latin hypercube sample in addition to existing sampling units (legacy sample), as in spatial infill sampling (Section \@ref(SpatialCoverage)). The units of the legacy sample are assigned to argument `must.include`. Argument `size` must be set to the total sample size, i.e., the number of mandatory units (legacy sample units) plus the number of additional infill units.

To illustrate conditioned Latin hypercube *infill* sampling (cLHIS), in the next code chunk I select a simple random sample of ten units from Eastern Amazonia to serve as the legacy sample. Twenty new units are selected by cLHIS. The ten mandatory units (i.e., units which are already sampled and thus must be in the sample set computed by cLHIS) are at the end of the vector with the index of the selected raster cells.

```{r}
set.seed(314)
units <- sample(nrow(grdAmazonia), 10, replace = FALSE)
res <- clhs(grdAmazonia[, covs], size = 30, must.include = units,
  tdecrease = 0.95, iter = 50000, progress = FALSE, simple = FALSE)
mysample_CLHI <- grdAmazonia[res$index_samples, ]
mysample_CLHI$free <- as.factor(rep(c(1, 0), c(20, 10)))
```

Figure \@ref(fig:cLHIS) shows the selected cLHI sample in a map of SWIR2. In Figure \@ref(fig:cLHISscat) the sample is plotted in a biplot of SWIR2 against Prec_dm. The marginal strata already covered by the legacy sample are mostly avoided by the additional sample.

```{r cLHIS, echo = FALSE, out.width = "100%", fig.cap = "Conditioned Latin hypercube infill sample from Eastern Amazonia in a map of SWIR2. Legacy units have free-value 0; infill units have free-value 1."}
ggplot(data = grdAmazonia) +
  geom_raster(mapping = aes(x = x1 / 1000, y = x2 / 1000, fill = SWIR2)) +
  geom_point(data = mysample_CLHI, mapping = aes(x = x1 / 1000, y = x2 / 1000, shape = free), colour = "red", size = 2) +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  scale_fill_viridis_c(name = "SWIR2") +
  coord_fixed()
```

```{r cLHISscat, echo = FALSE, fig.asp = 0.7, out.width = "100%", fig.cap = "Conditioned Latin hypercube infill sample plotted in a biplot of SWIR2 against precipitation in the dryest month. Legacy units have free-value 0; infill units have free-value 1."}
#recompute stratum bounds as we have now 10 + 20 units
probs <- seq(from = 0, to = 1, length.out = 31)
bounds <- apply(grdAmazonia[, covs], MARGIN = 2, FUN = function(x) quantile(x, probs = probs))

ggplot(data = grdAmazonia) +
  geom_point(mapping = aes(x = SWIR2, y = Prec_dm), colour = "black", size = 1, alpha = 0.5) +
  geom_vline(xintercept = bounds[c(-1, -nrow(bounds)), 1], colour = "grey") +
  geom_hline(yintercept = bounds[c(-1, -nrow(bounds)), 3], colour = "grey") +
  geom_point(data = as.data.frame(mysample_CLHI), mapping = aes(x = SWIR2, y = Prec_dm, colour = free), size = 2) +
  scale_colour_discrete() +
  scale_x_continuous(name = "SWIR2") +
  scale_y_continuous(name = "Precipitation dryest month")
```

## Performance of conditioned Latin hypercube sampling in random forest prediction

The performance of cLH sampling is studied in the same experiment as covariate space coverage sampling of the previous chapter. In total, 500 cLH samples of size 25 are selected and an equal number of samples of sizes 50 and 100. Each sample is used to calibrate a RF model for the aboveground biomass (AGB) using five covariates as predictors. The calibrated models are used to predict AGB at the 25,000 validation units selected by simple random sampling without replacement. Simple random (SI) sampling is added as a reference sampling design that ignores the covariates. The prediction errors are used to estimate three map quality indices\index{Map quality index}, the population mean error\index{Population mean error} (ME), the population root mean squared error\index{Population root mean squared error} (RMSE), and the population Nash-Sutcliffe model efficiency coefficient\index{Model efficiency coefficient} (MEC), see Chapter \@ref(Validation).

```{r repeatedCSCandcLHS, echo = FALSE, eval = FALSE}
library(LICORS)
library(ranger)
#select subgrid to be used as master sample for selecting calibration samples
grdAmazonia$id <- seq_len(nrow(grdAmazonia))
gridded(grdAmazonia) <- ~x1 + x2
subgrid <- spsample(grdAmazonia, type = "regular", cellsize = 10000, offset = c(0.5, 0.5))
res <- over(subgrid, grdAmazonia)
subgrid <- data.frame(coordinates(subgrid), res)

#Use all nodes of 1 km $\times$ 1 km grid not selected in 10 km $\times$ 10 km grid for selecting a large validation by simple random sampling
grdAmazonia <- as_tibble(grdAmazonia)
grdmincalibration <- grdAmazonia[-subgrid$id, ]
nval <- 25000
set.seed(314)
units <- sample(nrow(grdmincalibration), size = nval, replace = FALSE)
myvalidationsample <- grdmincalibration[units, ]

populationmeans <- apply(grdAmazonia[, covs], MARGIN = 2, FUN = mean)
populationsds <- apply(grdAmazonia[, covs], MARGIN = 2, FUN = sd)
r.population <- cor(grdAmazonia[, covs])

n <- 25
probs <- seq(from = 0, to = 1, length.out = n + 1)
bounds <- apply(subgrid[, covs], MARGIN = 2, FUN = function(x) quantile(x, probs = probs))

S <- 500
MSSSD.CSC <- MSSSD.CLH <- MSSSD.SI <- O1O3.CSC <- O1O3.CLH <- O1O3.SI <- time.CSC <- time.CLH <- numeric(length = S)
ME.CSC <- ME.CLH <- ME.SI <- RMSE.CSC <- RMSE.CLH <- RMSE.SI <- numeric(length = S)
for (i in 1:S) {
  print(i)

#select CSC sample

  time <- system.time(myClusters <- kmeans(scale(grdAmazonia[, covs]), centers = n, iter.max = 10000, nstart = 500)) #n = 25
#  time <- system.time(myClusters <- kmeans(scale(grdAmazonia[, covs]), centers = n, iter.max = 10000, nstart = 350)) #n = 50
#  time <- system.time(myClusters <- kmeans(scale(subgrid[, covs]), centers = n, iter.max = 10000, nstart = 200)) #n = 100
 time.CSC[i] <- time[1]
 subgrid$cluster <- myClusters$cluster
 res <- rdist(x1 = myClusters$centers,
                    x2 = scale(subgrid[, covs]))
 units <- apply(res, MARGIN = 1, FUN = which.min)
 myCSCsample <- subgrid[units, ]
#compute MSSSD
 res <- rdist(x1 = scale(myCSCsample[, covs], center = populationmeans, scale = populationsds),
                  x2 = scale(subgrid[, covs]))
 dmin <- apply(res, MARGIN = 2, min)
 MSSSD.CSC[i] <- mean(dmin^2)
#compute O1 + O3
 counts <- lapply(1:5, function(i)
 hist(myCSCsample[, i + 3], bounds[, i], plot = FALSE)$counts
 )
 countslf <- data.frame(counts = unlist(counts))
 O1 <- sum(abs(countslf - 1))
 r.sample <- cor(myCSCsample[, covs])
 O3 <- sum(abs(r.population - r.sample))
 O1O3.CSC[i] <- O1 + O3

#select cLH sample

  time <- system.time(clhs.out <- clhs(subgrid[, covs], size = n, tdecrease = 0.95, iter = 10000,
                 progress = FALSE, simple = FALSE))
  time.CLH[i] <- time[1]
  index <- clhs.out$index_samples
  mysample_CLH <- subgrid[index, ]
#save O1 + O3
  O1O3.CLH[i] <- tail(clhs.out$obj, 1)
#compute MSSSD
  res <- rdist(x1 = scale(mysample_CLH[, covs], center = populationmeans, scale = populationsds),
                  x2 = scale(subgrid[, covs]))
  dmin <- apply(res, MARGIN = 2, min)
  MSSSD.CLH[i] <- mean(dmin^2)

#select SI sample

 units <- sample(nrow(subgrid), size = n, replace = FALSE)
 mySIsample <- subgrid[units, ]
#compute MSSSD
 res <- rdist(x1 = scale(mySIsample[, covs], center = populationmeans, scale = populationsds),
                  x2 = scale(subgrid[, covs]))
 dmin <- apply(res, MARGIN = 2, min)
 MSSSD.SI[i] <- mean(dmin^2)
#compute O1 + O3
 counts <- lapply(1:5, function(i)
 hist(mySIsample[, i + 3], bounds[, i], plot = FALSE)$counts
 )
 countslf <- data.frame(counts = unlist(counts))
 O1 <- sum(abs(countslf - 1))
 r.sample <- cor(mySIsample[, covs])
 O3 <- sum(abs(r.population - r.sample))
 O1O3.SI[i] <- O1 + O3

#fit random forest

#use CSC sample for calibration
 forest.sample <- ranger(
   AGB ~ SWIR2 + Terra_PP + Prec_dm + Elevation + Clay,
   data = myCSCsample,
   num.trees = 1000
 )
 out <- predict(forest.sample, data = myvalidationsample, type = "response")
 AGBpred <- out$predictions
 error <- AGBpred - myvalidationsample$AGB
 ME.CSC[i] <- mean(error)
 RMSE.CSC[i] <- sqrt(mean(error^2))

#use CLH sample for calibration
  forest.sample <- ranger(
    AGB ~ SWIR2 + Terra_PP + Prec_dm + Elevation + Clay,
    data = mysample_CLH,
    num.trees = 1000
  )
  out <- predict(forest.sample, data = myvalidationsample, type = "response")
  AGBpred <- out$predictions
  error <- AGBpred - myvalidationsample$AGB
  ME.CLH[i] <- mean(error)
  RMSE.CLH[i] <- sqrt(mean(error^2))

#use SI sample for calibration
  forest.sample <- ranger(
    AGB ~ SWIR2 + Terra_PP + Prec_dm + Elevation + Clay,
    data = mySIsample,
    num.trees = 1000
  )
  out <- predict(forest.sample, data = myvalidationsample, type = "response")
  AGBpred <- out$predictions
  error <- AGBpred - myvalidationsample$AGB
  ME.SI[i] <- mean(error)
  RMSE.SI[i] <- sqrt(mean(error^2))
}
save(ME.CSC, ME.CLH, ME.SI, RMSE.CSC, RMSE.CLH, RMSE.SI, MSSSD.CSC, MSSSD.CLH, MSSSD.SI, O1O3.CSC, O1O3.CLH, O1O3.SI, time.CSC, time.CLH, file = "results/CSCversusCLH_Amazonia_n25.rda")
```

Figure \@ref(fig:boxplotsval) shows the results as boxplots, each based on 500 estimates. For $n=25$ and $100$, cLH sampling performs best in terms of RMSE and MEC, whereas for $n=50$ CSC sampling performs best. For $n=25$ and $50$, the boxplots of cLH and SI show quite a few outliers with large values of RMSE, resulting in small values of MEC. For CSC, these map quality indices are more stable. Remarkably, for $n=100$ SI sampling performs about equally to CSC and cLH sampling.

```{r boxplotsval, echo = FALSE, out.width = "100%", fig.asp = 0.7, fig.cap = "Boxplots of ME, RMSE, and MEC of predictions with RF models calibrated on conditioned Latin hypercube (cLH), covariate space coverage (CSC), and simple random (SI) samples from Eastern Amazonia, for sample sizes of 25, 50, and 100 units."}
#select subgrid to be used as master sample for selecting calibration samples
grdAmazonia$id <- seq_len(nrow(grdAmazonia))
gridded(grdAmazonia) <- ~ x1 + x2
subgrid <- spsample(grdAmazonia, type = "regular", cellsize = 10000, offset = c(0.5, 0.5))
res <- over(subgrid, grdAmazonia)
subgrid <- data.frame(coordinates(subgrid), res)

#Use all nodes of 1 km $\times$ 1 km grid not selected in master sample for selecting calibration samples (10 km $\times$ 10 km grid) for selecting a large validation by simple random sampling
grdAmazonia <- as_tibble(grdAmazonia)
grdmincalibration <- grdAmazonia[-subgrid$id, ]
nval <- 25000
set.seed(314)
units <- sample(nrow(grdmincalibration), size = nval, replace = FALSE)
myvalidationsample <- grdmincalibration[units, ]

#ME

load("results/CSCversusCLH_Amazonia_n25.rda")
df <- data.frame(cbind(ME.CLH, ME.CSC, ME.SI))
names(df) <- c("cLH", "CSC", "SI")
d_25 <- df %>% pivot_longer(cols = c("cLH", "CSC", "SI"))
d_25$n <- 25

load("results/CSCversusCLH_Amazonia_n50.rda")
df <- data.frame(cbind(ME.CLH, ME.CSC, ME.SI))
names(df) <- c("cLH", "CSC", "SI")
d_50 <- df %>% pivot_longer(cols = c("cLH", "CSC", "SI"))
d_50$n <- 50

load("results/CSCversusCLH_Amazonia_n100.rda")
df <- data.frame(cbind(ME.CLH, ME.CSC, ME.SI))
names(df) <- c("cLH", "CSC", "SI")
d_100 <- df %>% pivot_longer(cols = c("cLH", "CSC", "SI"))
d_100$n <- 100

d <- data.frame(rbind(d_25, d_50, d_100))
d$n <- as.factor(d$n)
levels(d$n) <- c("25", "50", "100")
names(d)[c(1, 2)] <- c("Design", "ME")

plt1 <- ggplot(d, aes(x = n, y = ME)) +
  geom_boxplot(
    aes(color = Design),
    position = position_dodge(1)) +
  scale_colour_viridis_d()

#RMSE

load("results/CSCversusCLH_Amazonia_n25.rda")
df <- data.frame(cbind(RMSE.CLH, RMSE.CSC, RMSE.SI))
names(df) <- c("cLH", "CSC", "SI")
d_25 <- df %>% pivot_longer(cols = c("cLH", "CSC", "SI"))
d_25$n <- 25

load("results/CSCversusCLH_Amazonia_n50.rda")
df <- data.frame(cbind(RMSE.CLH, RMSE.CSC, RMSE.SI))
names(df) <- c("cLH", "CSC", "SI")
d_50 <- df %>% pivot_longer(cols = c("cLH", "CSC", "SI"))
d_50$n <- 50

load("results/CSCversusCLH_Amazonia_n100.rda")
df <- data.frame(cbind(RMSE.CLH, RMSE.CSC, RMSE.SI))
names(df) <- c("cLH", "CSC", "SI")
d_100 <- df %>% pivot_longer(cols = c("cLH", "CSC", "SI"))
d_100$n <- 100

d <- data.frame(rbind(d_25, d_50, d_100))
d$n <- as.factor(d$n)
levels(d$n) <- c("25", "50", "100")
names(d)[c(1, 2)] <- c("Design", "RMSE")

plt2 <- ggplot(d, aes(x = n, y = RMSE)) +
  geom_boxplot(
    aes(color = Design),
    position = position_dodge(1)) +
  scale_colour_viridis_d()

#MEC

S2.AGB <- var(myvalidationsample$AGB)
d$MEC <- 1 - d$RMSE^2 / S2.AGB
plt3 <- ggplot(d, aes(x = n, y = MEC)) +
  geom_boxplot(
    aes(color = Design),
    position = position_dodge(1)) +
  scale_colour_viridis_d()

grid.arrange(plt1, plt2, plt3, ncol = 2)
```

In Figure \@ref(fig:RelationO1O3RMSE) the RMSE is plotted against the minimised criterion (O1 + O3) for the cLH and the SI samples. For all three sample sizes, there is a weak positive correlation of the minimisation criterion and the RMSE: for $n=25$, 50, and 100 this correlation is 0.369, 0.290, and 0.140, respectively. On average, cLH performs slightly better than SI for $n=25$ (Table \@ref(tab:TableRMSE4cLHandSI)). The gain in accuracy decreases with the sample size. For $n=100$, the two designs perform about equally. Especially for $n=25$ and 50, the distribution of RMSE with SI has a long right tail. For these small sample sizes, the risk of selecting an SI sample leading to a poor map with large RMSE is much larger than with cLH sampling.


```{r RelationO1O3RMSE, echo = FALSE, fig.asp = 0.7, out.width = "100%", fig.cap = "Biplot of the minimisation criterion (O1 + O3) and the RMSE of RF predictions of AGB in Eastern Amazonia for conditioned Latin hypercube (cLH) sampling and simple random (SI) sampling, and three sample sizes."}
load("results/CSCversusCLH_Amazonia_n25.rda")
df.25 <- data.frame(RMSE = c(RMSE.SI, RMSE.CLH), O1O3 = c(O1O3.SI, O1O3.CLH), Design = rep(c("SI", "cLH"), each = 500), Samplesize = 25)

load("results/CSCversusCLH_Amazonia_n50.rda")
df.50 <- data.frame(RMSE = c(RMSE.SI, RMSE.CLH), O1O3 = c(O1O3.SI, O1O3.CLH), Design = rep(c("SI", "cLH"), each = 500), Samplesize = 50)

load("results/CSCversusCLH_Amazonia_n100.rda")
df.100 <- data.frame(RMSE = c(RMSE.SI, RMSE.CLH), O1O3 = c(O1O3.SI, O1O3.CLH), Design = rep(c("SI", "cLH"), each = 500), Samplesize = 100)

df <- rbind(df.25, df.50, df.100)

df$Samplesize <- as.factor(df$Samplesize)
df$Design <- as.factor(df$Design)
mRMSE <- tapply(df$RMSE, INDEX = list(df$Samplesize, df$Design), FUN = mean)
S2RMSE <- tapply(df$RMSE, INDEX = list(df$Samplesize, df$Design), FUN = var)
seRMSE <- sqrt(S2RMSE / 500)

ggplot(df, mapping = aes(x = O1O3, y = RMSE, shape = Samplesize, colour = Design)) +
  geom_point(alpha = 0.5) +
  scale_shape_manual(values = c(1, 0, 2), name = "Sample size") +
  scale_colour_discrete()
```

```{r TableRMSE4cLHandSI, echo = FALSE}
df <- data.frame(c(25, 50, 100), mRMSE[, 1], mRMSE[, 2])
df[, c(2, 3)] <- round(df[, c(2, 3)], 2)
rownames(df) <- NULL
knitr::kable(
  df, caption = "Mean RMSE of RF predictions of AGB in Eastern Amazonia of 500 conditioned Latin hypercube (cLH) samples and 500 simple random (SI) samples, and three sample sizes.",
  booktabs = TRUE,
  col.names = c("Sample size", "cLH", "SI"),
  linesep = ""
) %>%
  kable_classic()
```

These results are somewhat different from the results of @Wadoux2019 and @Ma2020. In these case studies, cLH sampling appeared to be an inefficient design for selecting a calibration sample that is subsequently used for mapping. @Wadoux2019 compared cLH, CSC, spatial coverage sampling (Section \@ref(SpatialCoverage)), and SI for mapping soil organic carbon in France with a RF model. The latter two sampling designs do not exploit the covariates in selecting the calibration units. Sample sizes were 100, 200, 500, and 1,000. cLH performed worse (larger RMSE) than CSC and not significantly better than SI for all sample sizes.

@Ma2020 compared cLH, CSC, and SI for mapping soil classes by various models, among which a RF model, in a study area in Germany. Sample sizes were 20, 30, 40, 50, 75, and 100 points. They found no relation between the minimisation criterion of cLH and the overall accuracy of the map with predicted soil classes. Models calibrated on CSC samples performed better on average, i.e., on average the overall accuracy of the maps obtained by calibrating the models on these CSC samples was higher. cLH was hardly better than SI.

```{r, echo = FALSE}
rm(list = ls())
```