04-STSI.Rmd

# Stratified simple random sampling {#STSI}

In stratified random sampling\index{Stratified random sampling} the population is divided into subpopulations, for instance, soil mapping units, areas with the same land use or land cover, administrative units, etc. The subareas are mutually exclusive, i.e., they do not overlap, and are jointly exhaustive, i.e., their union equals the entire population (study area). Within each subpopulation, referred to as a stratum\index{Stratum}, a probability sample is selected by some sampling design. If these probability samples are selected by simple random sampling, as described in the previous chapter, the design is stratified *simple* random sampling\index{Stratified random sampling!stratified simple random sampling}, the topic of this chapter. If sampling units are selected by cluster random sampling, then the design is stratified *cluster* random sampling. 

Stratified simple random sampling is illustrated with Voorst (Figure \@ref(fig:SampleSTSI)). Tibble `grdVoorst` with simulated data contains variable `stratum`. The strata are combinations of soil class and land use, obtained by overlaying a soil map and a land use map. To select a stratified simple random sample, we set the total sample size $n$. The sampling units must be apportioned to the strata. I chose to apportion the units proportionally to the size (area, number of grid cells) of the strata (for details, see Section \@ref(STSIallocation)). The larger a stratum, the more units are selected from this stratum. The sizes of the strata, i.e., the total number of grid cells, are computed with function `tapply`.

```{r stsi}
N_h <- tapply(grdVoorst$stratum, INDEX = grdVoorst$stratum, FUN = length)
w_h <- N_h / sum(N_h)
n <- 40
print(n_h <- round(n * w_h))
```

The sum of the stratum sample sizes is 41; we want 40, so we reduce the largest stratum sample size by 1. 

```{r}
n_h[1] <- n_h[1] - 1
```

The stratified simple random sample is selected with function `strata` of package **sampling** [@Tille2016].  Argument `size` specifies the stratum sample sizes.

```{block2, type = 'rmdcaution'}
The stratum sample sizes must be in the order the strata are encountered in tibble `grdVoorst`, which is determined first with function `unique`.
```

Within the strata, the grid cells are selected by simple random sampling *with replacement* (`method = "srswr"`), so that in principle more than one point can be selected within a grid cell, see Chapter \@ref(SI) for a motivation of this. Function `getdata` extracts the observations of the selected units from the sampling frame, as well as the spatial coordinates and the stratum of these units. The coordinates of the centres of the selected grid cells are jittered by an amount equal to half the side of the grid cells.  In the next code chunk, this is done with function `mutate` of package **dplyr** [@dplyr] which is part of package **tidyverse** [@Wickham2019]. We have seen the pipe operator `%>%` of package **magrittr** [@magrittr] before in Subsection \@ref(CDF). If you are not familiar with  **tidyverse** I recommend reading the excellent book \emph{R for Data Science} [@Wickham2017].

```{r}
library(sampling)
ord <- unique(grdVoorst$stratum)
set.seed(314)
units <- sampling::strata(
  grdVoorst, stratanames = "stratum", size = n_h[ord], method = "srswr")
mysample <- getdata(grdVoorst, units) %>%
  mutate(s1 = s1 %>% jitter(amount = 25 / 2),
         s2 = s2 %>% jitter(amount = 25 / 2))
```


```{block2, type = 'rmdnote'}
The name of the package is added to function `strata` (`sampling::strata`), as `strata` is also a function of another package. Not adding the name of the package may result in an error message.
```

Figure \@ref(fig:SampleSTSI) shows the selected sample.

```{r SampleSTSI, echo = FALSE, out.width = '100%', fig.cap = "Stratified simple random sample of size 40 from Voorst. Strata are combinations of soil class and land use."}
ggplot(data = grdVoorst, mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = stratum)) +
  geom_raster() +
  geom_point(data = mysample, size = 1.5) +
    scale_fill_manual(
      name = "",
      values = c(BA = "darkgreen", EA = "brown", PA = "orange", RA = "green", XF = "grey")
    ) +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  coord_fixed() +
  theme(panel.background = element_blank(),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.background = element_blank())
```

## Estimation of population parameters {#EstimatorsSTSI}

With simple random sampling within strata, the estimator of the population mean for simple random sampling (Equation \@ref(eq:HTMeanSI)) is applied at the level of the strata. The estimated stratum means are then averaged, using the relative sizes or areas of the strata as weights:

\begin{equation}
\hat{\bar{z}}= \sum\limits_{h=1}^{H} w_{h}\,\hat{\bar{z}}_{h} \;,
(\#eq:HTMeanSTSI)
\end{equation}

where $H$ is the number of strata, $w_{h}$ is the relative size (area) of stratum $h$ (stratum weight)\index{Stratum weight}: $w_h = N_h/N$, and $\hat{\bar{z}}_{h}$ is the estimated mean of stratum $h$, estimated by the sample mean for stratum $h$:

\begin{equation}
\hat{\bar{z}}_{h}=\frac{1}{n_h}\sum_{k \in \mathcal{S}_h} z_k\;,
(\#eq:HTStratumMeanSI)
\end{equation}

with $\mathcal{S}_h$ the sample selected from stratum $h$.

The same estimator is found when the $\pi$ estimator is worked out for stratified simple random sampling. With stratified simple random sampling without replacement and different sampling fractions for the strata, the inclusion probabilities differ among the strata and equal $\pi_{k} = n_h/N_h$ for all $k$ in stratum $h$, with $n_h$ the sample size of stratum $h$ and $N_h$ the size of stratum $h$. Inserting this in the $\pi$ estimator of the population mean (Equation \@ref(eq:HTMean)) gives

\begin{equation}
\hat{\bar{z}}= \frac{1}{N}\sum\limits_{h=1}^{H}\sum\limits_{k \in \mathcal{S}_h} \frac{z_{k}}{\pi_{k}} = \frac{1}{N}\sum\limits_{h=1}^{H} \frac{N_h}{n_h}\sum\limits_{k \in \mathcal{S}_h} z_{k} = \sum\limits_{h=1}^{H} w_{h}\,\hat{\bar{z}}_{h} \;.
(\#eq:HTMeanSTSI2)
\end{equation}

```{block2, type = 'rmdnote'}
The sampling fractions are usually slightly different, even with proportional allocation (Section \@ref(STSIallocation)), because $n_h/N_h$ cannot be made exactly equal for all strata. Sample sizes necessarily are integers, so $n_h/N_h$ must be rounded to integers.
```

The sampling variance of the estimator of the population mean is estimated by first estimating the sampling variances of the estimated stratum means, followed by computing the weighted average of the estimated sampling variances of the estimated stratum means. Note that we must square the stratum weights:

\begin{equation}
\widehat{V}\!\left(\hat{\bar{z}}\right)=\sum\limits_{h=1}^{H}w_{h}^{2}\,\widehat{V}\!\left(\hat{\bar{z}}_{h}\right)\;,
(\#eq:EstVarMeanSTSI)
\end{equation}

where $\widehat{V}\!\left(\hat{\bar{z}}_{h}\right)$ is the estimated sampling variance of $\hat{\bar{z}}_{h}$:

\begin{equation}
\widehat{V}\!\left(\hat{\bar{z}}_{h}\right)= (1-\frac{n_h}{N_h}) \frac{\widehat{S^2}_h(z)}{n_h}\;,
(\#eq:EstVarstratummean)
\end{equation}

with $\widehat{S^2}_h(z)$ the estimated variance of $z$ within stratum $h$:

\begin{equation}
\widehat{S^2}_h(z)=\frac{1}{n_h-1}\sum\limits_{k \in \mathcal{S}_h}\left(z_{k}-\hat{\bar{z}}_{h}\right)^{2}\;.
(\#eq:EstStratumVar)
\end{equation}

For stratified simple random sampling with replacement of finite populations and stratified simple random sampling of infinite populations  the fpcs $1-(n_h/N_h)$ can be dropped.

```{r}
mz_h <- tapply(mysample$z, INDEX = mysample$stratum, FUN = mean)
mz <- sum(w_h * mz_h)
S2z_h <- tapply(mysample$z, INDEX = mysample$stratum, FUN = var)
v_mz_h <- S2z_h / n_h
se_mz <- sqrt(sum(w_h^2 * v_mz_h))
```

```{r stratummeans, echo = FALSE}
N_h <- as.numeric(N_h)
nms <- names(n_h)
n_h_copy <- n_h
names(n_h_copy) <- NULL
names(mz_h) <- NULL
names(S2z_h) <- NULL
names(v_mz_h) <- NULL

df <- data.frame(nms, Nh = formatC(N_h, 0, format = "f", big.mark = ","), nh = n_h_copy, Mean = round(mz_h, 1), Var = formatC(S2z_h, 1, format = "f", big.mark = ","), se = round(sqrt(v_mz_h), 1))

knitr::kable(
  df, caption = "Size (Nh), sample size (nh), estimated mean (Mean), estimated variance (Variance), and estimated standard error of estimator of mean (se) of the five strata in Voorst.",
  col.names = c("Stratum", "Nh", "nh", "Mean", "Variance", "se"),
  booktabs = TRUE,
  linesep = ""
) %>%
  kable_classic()
```

Table \@ref(tab:stratummeans) shows per stratum the estimated mean, variance, and sampling variance of the estimated mean of the SOM concentration. We can see large differences in the within-stratum variances\index{Within-stratum variance}. For the stratified sample of Figure \@ref(fig:SampleSTSI), the estimated population mean equals `r formatC(mz, 1, format = "f")` g kg^-1^, and the estimated standard error of this estimator equals `r formatC(se_mz, 1, format = "f")` g kg^-1^.

The population mean can also be estimated directly using the basic $\pi$ estimator (Equation \@ref(eq:HTMean)). The inclusion probabilities are included in `data.frame` `mysample`, obtained with function `getdata` (see code chunk above), as variable `Prob`.

```{r}
head(mysample)
```

The population total is estimated first, and by dividing this estimated total by the total number of population units $N$ an estimate of the population mean is obtained.

```{r}
tz <- sum(mysample$z / mysample$Prob)
print(mz <- tz / sum(N_h))
```

The two estimates of the  population mean are not exactly equal. This is due to rounding errors in the inclusion probabilities. This can be shown by computing the sum of the inclusion probabilities over all population units. This sum should be equal to the sample size $n=40$, but as we can see below, this sum is slightly smaller.

```{r}
pi_h <- tapply(mysample$Prob, INDEX = mysample$stratum, FUN = unique)
print(sum(pi_h * N_h))
```

Now suppose we ignore that the sample data come from a stratified sampling design and we use the (unweighted) sample mean as an estimate of the population mean.

```{r}
print(mean(mysample$z))
```

The sample mean slightly differs from the proper estimate of the population mean (86.334). The sample mean is a *biased* estimator, but the bias is small. The bias is only small because the stratum sample sizes are about proportional to the sizes of the strata, so that the inclusion probabilities (sampling intensities) are about equal for all strata: `r pi_h`. The probabilities are not exactly equal because the stratum sample sizes are necessarily rounded to integers and because we reduced the largest sample size by one unit. The bias would have been substantially larger if an equal number of units would have been selected from each stratum, leading to much larger differences in the inclusion probabilities among the strata. Sampling intensity in stratum BA, for instance, then would be much smaller compared to the other strata, and so would be the inclusion probabilities of the units in this stratum as compared to the other strata. Stratum BA then would be underrepresented in the sample. This is not a problem as long as we account for the difference in inclusion probabilities of the units in the estimation of the population mean. The estimated mean of stratum BA then gets the largest weight, equal to the inverse of the inclusion probability. If we do not account for these  differences in inclusion probabilities, the estimator of the mean will be seriously biased.

The next code chunk shows how the population mean and its standard error can be estimated with package **survey** [@Lumley2020]. Note that the stratum weights $N_h/n_h$ must be passed to function `svydesign` using argument `weight`. These are first attached to `data.frame` `mysample` by creating a look-up table `lut`, which is then merged with function `merge` to `data.frame` `mysample`.

```{r}
library(survey)
labels <- sort(unique(mysample$stratum))
lut <- data.frame(stratum = labels, weight = N_h / n_h)
mysample <- merge(x = mysample, y = lut)
design_stsi <- svydesign(
  id = ~ 1, strata = ~ stratum, weight = ~ weight, data = mysample)
svymean(~ z, design_stsi)
```

### Population proportion, cumulative distribution function, and quantiles

The proportion of a population satisfying some condition can be estimated by Equations \@ref(eq:HTMeanSTSI) and \@ref(eq:HTStratumMeanSI), substituting for the study variable $z_k$ an 0/1 indicator $y_k$ with value 1 if for unit $k$ the condition is satisfied, and 0 otherwise (Subsection \@ref(PopProportion)). In general, with stratified simple random sampling the inclusion probabilities are not exactly equal, so that the estimated population proportion is not equal to the sample proportion.

These unequal inclusion probabilities must also be accounted for when estimating the cumulative distribution function (CDF) and quantiles (Subsection \@ref(CDF)), as shown in the next code chunk for the CDF.

```{r}
thresholds <- sort(unique(mysample$z))
cumfreq <- numeric(length = length(thresholds))
for (i in seq_len(length(thresholds))) {
  ind <- mysample$z <= thresholds[i]
  mh_ind <- tapply(ind, INDEX = mysample$stratum, FUN = mean)
  cumfreq[i] <- sum(w_h * mh_ind)
}
df <- data.frame(x = thresholds, y = cumfreq)
```

Figure \@ref(fig:EstimatedCDFVoorstSTSI) shows the estimated CDF, estimated from the stratified simple random sample of 40 units from Voorst (Figure \@ref(fig:SampleSTSI)).

(ref:EstimatedCDFVoorstSTSIlabel) Estimated cumulative distribution function of the SOM concentration (g kg^-1^) in Voorst, estimated from the stratified simple random sample of 40 units.

```{r EstimatedCDFVoorstSTSI, echo = FALSE, fig.asp = 0.7, fig.width = 5, fig.cap = "(ref:EstimatedCDFVoorstSTSIlabel)"}
dx <- mean(diff(df$x))
dat <- data.frame(
    x = c(df$x[1] - dx, df$x),
    xend = c(df$x, df$x[nrow(df)] + dx),
    yend =  c(df$y, 1)
)

ggplot(dat) +
  geom_segment(mapping = aes(x = x, y = yend, xend = xend, yend = yend)) +
  scale_x_continuous(name = "SOM") +
  scale_y_continuous(name = "F")
```

The estimated proportions, or cumulative frequencies, are used to estimate a quantile. These estimates are easily obtained with function `svyquantile` of package **survey**.

```{r}
svyquantile(~ z, design_stsi, quantile = c(0.5, 0.8))
```

### Why should we stratify? {#WhyStratify}

There can be two reasons for stratifying the population:  

1. we are interested in the mean or total per stratum; or
2. we want to increase the precision of the estimated mean or total for the entire population.  

Figure \@ref(fig:SamplingDistributionSTSI) shows boxplots of the approximated sampling distributions of the $\pi$ estimator of the mean SOM concentration for stratified simple random sampling and simple random sampling, both of size 40, obtained by repeating the random sampling with each design and estimation 10,000 times. 

```{r, eval = FALSE, echo = FALSE}
number_of_samples <- 10000

mz_STSI <- mz_SI <- numeric(length = number_of_samples)

for (i in 1:number_of_samples) {
  units <- sampling::strata(grdVoorst, stratanames = "stratum", size = n_h[unique(grdVoorst$stratum)], method = "srswr")
  mysample <- getdata(grdVoorst, units)
  mz_h <- tapply(mysample$z, INDEX = mysample$stratum, FUN = mean)
  mz_STSI[i] <- sum(w_h * mz_h)
  units <- sample(nrow(grdVoorst), size = n, replace = TRUE)
  mz_SI[i] <- mean(grdVoorst$z[units])
}
save(mz_STSI, mz_SI, file = "results/STSI_Voorst.rda")
```

(ref:SamplingDistributionSTSIlabel) Approximated sampling distribution of the $\pi$  estimator of the mean SOM concentration (g kg^-1^) in Voorst for stratified simple random sampling (STSI) and simple random sampling (SI) of size 40.

```{r SamplingDistributionSTSI, echo = FALSE, fig.asp = .8, fig.width = 5, fig.cap = "(ref:SamplingDistributionSTSIlabel)"}
load(file = "results/STSI_Voorst.rda")
estimates <- data.frame(mz_STSI, mz_SI)
names(estimates) <- c("STSI", "SI")
df <- estimates %>% pivot_longer(cols = c("STSI", "SI"))
df$name <- factor(df$name, levels = c("STSI", "SI"), ordered = TRUE)
ggplot(data = df) +
    geom_boxplot(mapping = aes(y = value, x = name)) +
    geom_hline(yintercept = mean(grdVoorst$z), colour = "red") +
    scale_x_discrete(name = "Sampling design") +
    scale_y_continuous(name = "Estimated mean SOM")
```

```{r, echo = FALSE}
S2z_h_pop <- tapply(grdVoorst$z, INDEX = grdVoorst$stratum, FUN = var)
v_mz_h <- S2z_h_pop / n_h
v_mz_STSI <- sum(w_h^2 * v_mz_h)
v_mz_SI <- var(grdVoorst$z) / n
mz_h_pop <- tapply(grdVoorst$z, INDEX = grdVoorst$stratum, FUN = mean)
```

The approximated sampling distributions of the estimators of the population mean with the two designs are not very different. With stratified random sampling, the spread of the estimated means is somewhat smaller. The horizontal red line is the population mean (`r formatC(mean(grdVoorst$z), 1, format = "f")` g kg^-1^). The gain in precision due to the stratification, referred to as the stratification effect\index{Stratification effect}, can be quantified by the ratio of the variance with simple random sampling and the variance with stratified simple random sampling. So, when this variance ratio is larger than 1, stratified simple random sampling is more precise than simple random sampling. For Voorst the stratification effect with proportional allocation (Section \@ref(STSIallocation)) equals `r formatC(v_mz_SI/v_mz_STSI, 3, format = "f")`. This means that with simple random sampling we need `r formatC(v_mz_SI/v_mz_STSI, 3, format = "f")` more sampling units than with stratified simple random sampling to obtain an estimate of the same precision. 

The stratification effect can be computed from the population variance $S^2(z)$ (Equation \@ref(eq:PopulationVariance)) and the variances within the strata $S^2_h(z)$. In the sampling experiment, these variances are known without error because we know the $z$-values for all units in the population. In practice, we only know the $z$-values for the sampled units. However, a design-unbiased estimator of the population variance is [@gru06]

\begin{equation}
\widehat{S^{2}}(z)= \widehat{\overline{z^{2}}}-\left(\hat{\bar{z}}\right)^{2}+
\widehat{V}\!\left(\hat{\bar{z}}\right) \;,
(\#eq:EstimatorPopulationVariancefromSTSI)
\end{equation}

where $\widehat{\overline{z^{2}}}$ denotes the estimated population mean of the study variable squared ($z^2$), obtained in the same way as $\hat{\bar{z}}$ (Equation \@ref(eq:HTMeanSTSI)), but using squared values, and $\widehat{V}\!\left(\hat{\bar{z}}\right)$ denotes the estimated variance of the estimator of the population mean (Equation \@ref(eq:EstVarMeanSTSI)). 

The estimated population variance is then divided by the sum of the stratum sample sizes to get an estimate of the sampling variance of the estimator of the mean with simple random sampling of an equal number of units:

\begin{equation}
\widehat{V}(\hat{\bar{z}}_{\text{SI}}) = \frac{\widehat{S^2}(z)}{\sum_{h=1}^{H}n_h}\;.
(\#eq:stratificationeffect)
\end{equation}

```{r, echo = FALSE}
ord <- unique(grdVoorst$stratum)
set.seed(314)
units <- sampling::strata(grdVoorst, stratanames = "stratum", size = n_h[ord], method = "srswr")
mysample <- getdata(grdVoorst, units)
labels <- sort(unique(mysample$stratum))
lut <- data.frame(stratum = labels, weight = N_h / n_h)
mysample <- merge(x = mysample, y = lut)
```

The population variance can be estimated with function `s2` of package **surveyplanning** [@surveyplanning]. However, this function is an implementation of an alternative, consistent estimator\index{Consistent estimator} of the population variance [@sar92]:

\begin{equation}
\widehat{S^2}(z) = \frac{N-1}{N} \frac{n}{n-1} \frac{1}{N-1} \sum_{k \in \mathcal{S}} \frac{(z_k - \hat{\bar{z}}_{\pi})^2}{\pi_k} \;.
(\#eq:EstimatorPopulationVariance4AnyDesign)
\end{equation}

```{block2, type = 'rmdnote'}
An estimator is consistent if it converges in probability to the true value of the parameter as the sample size tends to infinity [@sar92].
```

```{r}
library(surveyplanning)
S2z <- s2(mysample$z, w = mysample$weight)
```

The design effect\index{Design effect} is defined as the variance of an estimator of the population mean with the sampling design under study divided by the variance of the $\pi$ estimator of the mean with simple random sampling of an equal number of units (Section \@ref(DesignEffect)). So, the design effect of stratified random sampling is the reciprocal of the stratification effect. For the stratified simple random sample of Figure \@ref(fig:SampleSTSI), the design effect can then be estimated as follows. Function `SE` extracts the estimated standard error of the estimator of the mean from the output of function `svymean`. The extracted standard error is then squared to obtain an estimate of the sampling variance of the estimator of the population with stratified simple random sampling. Finally, this variance is divided by the variance with simple random sampling of an equal number of units.

```{r}
v_mz_SI <- S2z / n
res <- svymean(~ z, design_stsi)
SE(res)^2 / v_mz_SI
```

The same value is obtained with argument `deff` of function `svymean`. 

```{r designeffectSTSI}
design_stsi <- svydesign(
  id = ~ 1, strata = ~ stratum, weight = ~ weight, data = mysample)
svymean(~ z, design_stsi, deff = "replace")
```

So, when using package **survey**, estimation of the population variance is not needed to estimate the design effect. I only added this to make clear how the design effect is computed with functions in package **survey**. In following chapters I will skip the estimation of the population variance.  

The estimated design effect as estimated from the stratified sample is smaller than 1, showing that stratified simple random sampling is more efficient than simple random sampling. The reciprocal of the estimated design effect (1.448) is somewhat larger than the stratification effect as computed in the sampling experiment, but this is an estimate of the design effect from one stratified sample only. The estimated population variance varies among stratified samples, and so does the estimated design effect.

Stratified simple random sampling with proportional allocation (Section \@ref(STSIallocation)) is more precise than simple random sampling when the sum of squares of the stratum means is larger than the sum of squares within strata [@loh99]:

\begin{equation}
SSB > SSW\;,
(\#eq:STSImoreprecisewhen)
\end{equation}

with SSB the weighted sum-of-squares between the stratum means:

\begin{equation}
SSB = \sum_{h=1}^H N_h (\bar{z}_h-\bar{z})^2 \;,
(\#eq:SSB)
\end{equation}

and SSW the sum over the strata of the weighted variances within strata (weights equal to $1-N_h/N$):

\begin{equation}
SSW = \sum_{h=1}^H (1-\frac{N_h}{N})S^2_h\;.
(\#eq:SSW)
\end{equation}


In other words, the smaller the differences in the stratum means and the larger the variances within the strata, the smaller the stratification effect will be. Figure \@ref(fig:boxplotsSOMstrata) shows a boxplot of the SOM concentration per stratum (soil-land use combination). The stratum means are equal to `r formatC(mz_h_pop, 1, format = "f")` g kg^-1^. The stratum variances are `r formatC(S2z_h_pop, 1, format = "f")` (g kg^-1^)^2^. The large stratum variances explain the modest gain in precision realised by stratified simple random sampling compared to simple random sampling in this case.

(ref:boxplotsSOMstratalabel) Boxplots of the SOM concentration (g kg^-1^) for the five strata (soil-land use combinations) in Voorst.

```{r boxplotsSOMstrata, out.width = '100%', fig.asp = 0.5, echo = FALSE, fig.cap = "(ref:boxplotsSOMstratalabel)"}
ggplot(data = grdVoorst) +
  geom_boxplot(aes(y = z, x = stratum)) +
  scale_y_continuous(name = "SOM") +
  scale_x_discrete(name = "Stratum")
```

## Confidence interval estimate  {#CISTSI}

The $100(1-\alpha)$\% confidence interval for $\bar{z}$ is given by

\begin{equation}
\hat{\bar{z}} \pm t_{\alpha /2, df}\cdot
\sqrt{\widehat{V}\!\left(\hat{\bar{z}}\right)} \;,
(\#eq:CISTSI)
\end{equation}

where $t_{\alpha /2,df}$ is the $(1-\alpha /2)$ quantile of a *t* distribution with $df$ degrees of freedom. The degrees of freedom $df$ can be approximated by $n-H$, as proposed by @loh99. This is the number of the degrees of freedom if the variances within the strata are equal. With unequal variances within strata, $df$ can be approximated by Sattherwaite's method\index{Sattherwaite's method} [@nan04]:

\begin{equation}
df \approx \frac {\left(\sum_{h=1}^H w_h^2
\frac{\widehat{S^2}_h(z)}{n_h}\right)^2} {\sum_{h=1}^H w_h^4
\left(\frac{\widehat{S^2}_h(z)}{n_h}\right)^2 \frac {1}{n_h-1}} \;.
(\#eq:dfSattherwaite)
\end{equation}

A confidence interval estimate of the population mean can be extracted with method `confint` of package **survey**. It uses $n-H$ degrees of freedom.

```{r}
res <- svymean(~ z, design_stsi)
df_stsi <- degf(design_stsi)
confint(res, df = df_stsi, level = 0.95)
```


## Allocation of sample size to strata {#STSIallocation}

After we have decided on the total sample size $n$, we must decide how to apportion the units to the strata. It is reasonable to allocate more sampling units to large strata and fewer to small strata. The simplest way to achieve this is proportional allocation\index{Allocation!proportional allocation}:

\begin{equation}
n_{h}=n \cdot \frac{N_{h}}{\sum N_{h}}\;,
(\#eq:propallocation)
\end{equation}

with $N_h$ the total number of population units (size) of stratum $h$. With infinite populations $N_h$ is replaced by the area $A_h$. The sample sizes computed with this equation are rounded to the nearest integers.

If we have prior information on the variance of the study variable within the strata, then it makes sense to account for differences in variance. Heterogeneous strata should receive more sampling units than homogeneous strata, leading to Neyman allocation\index{Allocation!Neyman allocation}:

\begin{equation}
n_{h}= n \cdot \frac{N_{h}\,S_{h}(z)}{\sum\limits_{h=1}^{H} N_{h}\,S_{h}(z)} \;,
(\#eq:Neymanallocation)
\end{equation}

with $S_h(z)$ the standard deviation (square root of variance) of the study variable $z$ in stratum $h$.

Finally, the costs of sampling may differ among the strata. It can be relatively expensive to sample nearly inaccessible strata, and we do not want to sample many units there. This leads to optimal allocation\index{Allocation!optimal allocation}:

\begin{equation}
n_{h}= n \cdot \frac{\frac{N_{h}\,S_{h}(z)}{\sqrt{c_{h}}}}{\sum\limits_{h=1}^{H} \frac{N_{h}\,S_{h}(z)}{\sqrt{c_{h}}}} \;,
(\#eq:optallocation)
\end{equation}

with $c_h$ the costs per sampling unit in stratum $h$. Optimal means that given the total costs this allocation type leads to minimum sampling variance, assuming a linear costs model\index{Linear costs model}:

\begin{equation}
C = c_0 + \sum_{h=1}^H n_h c_h \;,
(\#eq:linearcostmodel)
\end{equation}

with $c_0$ overhead costs. So, the more variable a stratum and the lower the costs, the more units will be selected from this stratum.

```{r allocation}
S2z_h <- tapply(X = grdVoorst$z, INDEX = grdVoorst$stratum, FUN = var)
n_h_Neyman <- round(n * N_h * sqrt(S2z_h) / sum(N_h * sqrt(S2z_h)))
```

These optimal sample sizes can be computed with function `optsize` of package **surveyplanning**.

```{r}
labels <- sort(unique(mysample$stratum))
res <- optsize(labels, n, N_h, S2z_h)
round(res$nh, 0)
```

Table \@ref(tab:tableallocation) shows the proportional and optimal sample sizes for the five strata of the study area Voorst, for a total sample size of 40. Stratum XF is the one-but-smallest stratum and therefore receives only seven sampling units. However, the standard deviation in this stratum is the largest, and as a consequence with optimal allocation the sample size in this stratum is increased by three points, at the cost of stratum EA which is relatively homogeneous.


```{r tableallocation, echo = FALSE}
strata <- sort(unique(grdVoorst$stratum))
taballoc <- data.frame(Stratum = strata, Nh = formatC(as.numeric(N_h), 0, format = "f", big.mark = ","), Sh = as.numeric(round(sqrt(S2z_h), 1)), nhprop = as.numeric(n_h), nhNeyman = as.numeric(n_h_Neyman))

knitr::kable(
  taballoc, caption = "Proportional and Neyman sample sizes in stratified simple random sampling of Voorst with a total sample size of 40.",
  booktabs = TRUE,
  linesep = ""
) %>%
  kable_classic() %>%
   add_footnote("Nh: stratum size; Sh: stratum standard deviation.", notation = "none")
```

```{r sdmean_allocation, echo = FALSE}
n <- 20:50
S2z_h <- tapply(X = grdVoorst$z, INDEX = grdVoorst$stratum, FUN = var)
se_mz_pro <- se_mz_opt <- numeric(length = length(n))
for (i in seq_len(length(n))) {
  n_h_pro <- n[i] * w_h
  n_h_Neyman <- n[i] * N_h * sqrt(S2z_h) / sum(N_h * sqrt(S2z_h))
  se_mz_pro[i] <- sqrt(sum(w_h^2 * S2z_h / n_h_pro))
  se_mz_opt[i] <- sqrt(sum(w_h^2 * S2z_h / n_h_Neyman))
}
sdmean_SI <- sqrt(var(grdVoorst$z) / n)
df <- tibble(n = n, SI = sdmean_SI, STSIpro = se_mz_pro, STSIopt = se_mz_opt)
```

Figure \@ref(fig:plotsdmeanallocation) shows the standard error of the $\pi$ estimator of the mean SOM concentration as a function of the total sample size, for simple random sampling and for stratified simple random sampling with proportional and Neyman allocation. A small extra gain in precision can be achieved using Neyman allocation instead of proportional allocation. However, in practice often Neyman allocation is not achievable, because we do not know the standard deviations of the study variable within the strata. If a quantitative covariate $x$ is used for stratification (see Sections \@ref(cumrootf) and \@ref(Ospats)), the standard deviations $S_h(z)$ are approximated by $S_h(x)$, resulting in approximately optimal stratum sample sizes. The gain in precision compared to proportional allocation is then partly or entirely lost. 

(ref:plotsdmeanallocationlabel) Standard error of the $\pi$ estimator of the mean SOM concentration (g kg^-1^) as a function of the total sample size, for simple random sampling (SI) and for stratified simple random sampling with proportional (STSI(prop)) and Neyman allocation (STSI(Neyman)) for Voorst.

```{r plotsdmeanallocation, echo = FALSE, out.width = "100%", fig.asp = 0.7, fig.cap = "(ref:plotsdmeanallocationlabel)"}
df_lf <- df %>% pivot_longer(cols = c("SI", "STSIpro", "STSIopt"))
ggplot(df_lf) +
  geom_point(mapping = aes(x = n, y = value, shape = name), size = 2.5) +
  scale_shape_manual(values = c(1, 0, 2), name = "Design", labels = c("SI",  "STSI(Neyman)", "STSI(prop)")) +
  scale_x_continuous(name = "Sample size") +
  scale_y_continuous(name = "Standard error")
```

Optimal allocation and Neyman allocation assume univariate stratification\index{Univariate stratification}, i.e., the stratified simple random sample is used to estimate the mean of a single study variable. If we have multiple study variables, optimal allocation becomes more complicated. In Bethel allocation\index{Allocation!Bethel allocation}, the total sampling costs, assuming a linear costs model (Equation \@ref(eq:linearcostmodel)), are minimised given a constraint on the precision of the estimated mean for each study variable [@Bethel1989], see Section \@ref(MultivariateStratification). Bethel allocation can be computed with function `bethel` of package **SamplingStrata** [@Barcaroli2020].

#### Exercises {-}

1. Use function `strata` of package **sampling** to select a stratified simple random sample with replacement of size 40 from Voorst, using proportional allocation. Check that the sum of the stratum sample sizes is 40.  
    + Estimate the population mean and the standard error of the estimator.
    + Compute the true standard error of the estimator. Hint: compute the population variances of the study variable $z$ per stratum, and divide these by the stratum sample sizes.
    + Compute a 95% confidence interval estimate of the population mean, using function `confint` of package **survey**.  
2. Looking at Figure \@ref(fig:boxplotsSOMstrata), which strata do you expect can be merged without losing much precision of the estimated population mean?  
3. Use function `fct_collapse` of package **forcats** [@forcats] to merge the strata EA and PA.  
    + Compute the true sampling variance of the estimator of the mean for this new stratification, for a total sample size of 40 and proportional allocation.
    + Compare this true sampling variance with the true sampling variance using the original five strata (same sample size, proportional allocation). What is your conclusion about the new stratification?  
4. Proof that the sum of the inclusion probabilities over all population units with stratified simple random sampling equals the sample size $n$.

## *Cum-root-f* stratification {#cumrootf}

When we have a quantitative covariate $x$ related to the study variable $z$ and $x$ is known for all units in the population, strata can be constructed with the *cum-root-f* method using this covariate as a stratification variable, see @Dalenius1959 and @coc77. Population units with similar values for the covariate (stratification variable) are grouped into a stratum. Strata are computed as follows:  

1. Compute a frequency histogram of the stratification variable using a large number of bins.  
2. Compute the square root of the frequencies.  
3. Cumulate the square root of the frequencies, i.e., compute $\sqrt{f_1}$, $\sqrt{f_1} + \sqrt{f_2}$, $\sqrt{f_1} + \sqrt{f_2} + \sqrt{f_3}$, etc.   
4. Divide the cumulative sum of the last bin by the number of strata, multiply this value by $1,2, \dots, H-1$, with *H* the number of strata, and select the boundaries of the histogram bins closest to these values.

In *cum-root-f* stratification\index{\emph{Cum-root-f} stratification}, it is assumed that the covariate values are nearly perfect predictions of the study variable, so that the prediction errors do not affect the stratification. Under this assumption the stratification is optimal with Neyman allocation of sampling units to the strata \@ref(STSIallocation).

*Cum-root-f* stratification is illustrated with the data of Xuancheng in China. We wish to estimate the mean organic matter concentration in the topsoil (SOM, g kg^-1^) of this area. Various covariates are available that are correlated with SOM, such as elevation, yearly average temperature, slope, and various other terrain attributes. Elevation, the name of this variable in the tibble is dem, is used as as a single stratification variable, see Figure \@ref(fig:DEMXuancheng). The correlation coefficient of SOM and elevation in a sample of 183 observations is 0.59. The positive correlation can be explained as follows. Temperature is decreasing with elevation, leading to a smaller decomposition rate of organic matter in the soil.

(ref:DEMXuanchenglabel) Elevation used as a stratification variable in *cum-root-f* stratification of Xuancheng.

```{r DEMXuancheng, echo = FALSE, out.width = "100%", fig.cap = "(ref:DEMXuanchenglabel)"}
ggplot(data = grdXuancheng) +
  geom_raster(mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = dem)) +
  scale_fill_viridis_c(name = "Elevation") +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  coord_fixed()
```

The strata can be constructed with the package **stratification** [@Baillargeon2011]. Argument `n` of this function is the total sample size, but this value has no effect on the stratification. Argument `Ls` is the number of strata. I arbitrarily chose to construct five strata.  Argument `nclass` is the number of bins of the histogram. The output object of function `strata.cumrootf` is a list containing amongst others a numeric vector with the stratum bounds (`bh`) and a factor with the stratum levels of the grid cells (`stratumID`). Finally, note that the values of the stratification variable must be positive. The minimum elevation is -5 m, so I added the absolute value of this minimum to elevation.

```{r, echo = FALSE, eval = FALSE}
library(stratification)
dem_new <- grdXuancheng$dem + abs(min(grdXuancheng$dem))
crfstrata <- strata.cumrootf(x = dem_new, n = 100, Ls = 5, nclass = 500)
bh <- crfstrata$bh
grdXuancheng$crfstrata <- crfstrata$stratumID
```

```{r}
library(stratification)
grdXuancheng <- grdXuancheng %>%
  mutate(dem_new = dem + abs(min(dem)))
crfstrata <- strata.cumrootf(
  x = grdXuancheng$dem_new, n = 100, Ls = 5, nclass = 500)
bh <- crfstrata$bh
grdXuancheng$crfstrata <- crfstrata$stratumID
```

Stratum bounds are threshold values of the stratification variable elevation; these stratum bounds\index{Stratum bound} are equal to `r formatC(bh, 1, format = "f")`. Note that the number of stratum bounds is one less than the number of strata. The resulting stratification is shown in Figure \@ref(fig:optstrata). Note that most strata are not single polygons, but are made up of many smaller polygons. This may be even more so if the stratification variable shows a noisy spatial pattern. This is not a problem at all, as a stratum is just a collection of population units (raster cells) and need not be spatially contiguous.

(ref:optstratalabel) Stratification of Xuancheng obtained with the *cum-root-f* method, using elevation as a stratification variable.

```{r optstrata, echo = FALSE, out.width = "100%", fig.cap = "(ref:optstratalabel)"}
labels <- as.character(round(bh, 0))
first <- paste("<", labels[1])
second <- paste(labels[1], "-", labels[2])
third <- paste(labels[2], "-", labels[3])
fourth <- paste(labels[3], "-", labels[4])
fifth <- paste(">", labels[4])
labs <- c(first, second, third, fourth, fifth)

ggplot(data = grdXuancheng) +
  geom_raster(mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = factor(crfstrata))) +
  scale_fill_viridis_d(name = "Elevation", labels = labs) +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  coord_fixed()
```

#### Exercises {-}

5. Write an **R** script to compute five *cum-root-f* strata for Eastern Amazonia (`grdAmazonia` in package **sswr**) to estimate the population mean of aboveground biomass (AGB), using log-transformed short-wave infrared radiation (SWIR2)  as a stratification variable.   
    +  Compute ten *cum-root-f* strata, using function `strata` of package **sampling**.
    + Select a stratified simple random sample of 100 units. First, compute the stratum sample sizes for proportional allocation.
    + Estimate the population mean of AGB and its sampling variance.
    + Compute the true sampling variance of the estimator of the mean for this sampling design (see Exercise 1 for a hint).
    + Compute the stratification effect (gain in precision). Hint: compute the sampling variance for simple random sampling by computing the population variance of AGB, and divide this by the total sample size.

## Stratification with multiple covariates {#kmeansstratification}

If we have multiple variables that are possibly related to the study variable, we may want to use them all or a subset of them as stratification variables. Using the quantitative variables one-by-one in *cum-root-f* stratification, followed by overlaying the maps with univariate strata, may lead to numerous cross-classification strata.

A simple solution is to construct homogeneous groups, referred to as clusters, of population units (raster cells). The units within a cluster are more similar to each other than to the units in other clusters. Various clustering techniques are available. Here, I use hard k-means.

This is illustrated again with the Xuancheng case study. Five quantitative covariates are used for constructing the strata. Besides elevation, which was used as a single stratification variable in the previous section, now also temperature, slope, topographic wetness index (twi), and profile curvature are used to construct clusters that are used as strata in stratified simple random sampling. To speed up the computations, a subgrid with a spacing of 0.4 km is selected, using function `spsample` of package **sp**, see Chapter \@ref(SY) [@Bivand2013].

```{r}
library(sp)
gridded(grdXuancheng) <- ~ s1 + s2
subgrd <- spsample(
  grdXuancheng, type = "regular", cellsize = 400, offset = c(0.5, 0.5))
subgrd <- data.frame(coordinates(subgrd), over(subgrd, grdXuancheng))
```

```{r, echo = FALSE}
grdXuancheng <- as.data.frame(grdXuancheng)
```

Five clusters are computed with k-means using as clustering variables the five covariates mentioned above. The scale of these covariates is largely different, and for this reason they must be scaled before being used in clustering. The k-means algorithm is a deterministic algorithm, i.e., the same initial clustering will end in the same final, optimised clustering. This final clustering can be suboptimal, and therefore it is recommended to repeat the clustering as many times as feasible, with different initial clusterings. Argument `nstart` is the number of initial clusterings. The best clustering, i.e., the one with the smallest within-cluster sum-of-squares, is kept.

```{r}
x <- c("dem", "temperature", "slope", "profile.curvature", "twi")
set.seed(314)
myClusters <- kmeans(
  scale(subgrd[, x]), centers = 5, iter.max = 1000, nstart = 100)
subgrd$cluster <- myClusters$cluster
```

Figure \@ref(fig:kmeansstrataXuancheng) shows the five clusters obtained by k-means clustering\index{k-means clustering} of the raster cells. These clusters can be used as strata in random sampling.

(ref:kmeansstrataXuanchenglabel) Five clusters obtained  by k-means clustering of the raster cells of Xuancheng, using five scaled covariates in clustering.

```{r kmeansstrataXuancheng, echo = FALSE, out.width = "100%", fig.cap = "(ref:kmeansstrataXuanchenglabel)"}
ggplot(subgrd) +
  geom_raster(mapping = aes(x = x1 / 1000, y = x2 / 1000, fill = as.character(cluster))) +
  scale_fill_viridis_d(name = "Stratum") +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  coord_fixed()
```

The size of the clusters used as strata is largely different (Table \@ref(tab:tablekmeansstrata)). This table also shows means of the unscaled covariates used in clustering.

```{r tablekmeansstrata, echo = FALSE}
N_h <- tapply(subgrd$dem, INDEX = list(subgrd$cluster), FUN = length)
mean.dem <- tapply(subgrd$dem, INDEX = list(subgrd$cluster), FUN = mean)
mean.temp <- tapply(subgrd$temperature, INDEX = list(subgrd$cluster), FUN = mean)
mean.slope <- tapply(subgrd$slope, INDEX = list(subgrd$cluster), FUN = mean)
mean.pc <- tapply(subgrd$profile.curvature, INDEX = list(subgrd$cluster), FUN = mean)
mean.twi <- tapply(subgrd$twi, INDEX = list(subgrd$cluster), FUN = mean)

df <- data.frame(Stratum = seq(1:5), Nh = formatC(N_h, 0, format = "f", big.mark = ","), Elevation = round(mean.dem, 0), Temperature = round(mean.temp, 2), Slope = round(mean.slope, 2), Profilecurv = round(mean.pc, 5), Twi = round(mean.twi, 2))
knitr::kable(
  df, caption = "Size (Nh) and means of clustering variables of the five strata of Xuancheng obtained with k-means clustering of raster cells.",
  booktabs = TRUE,
  linesep = ""
) %>%
  kable_classic()
```
Categorical variables can be accommodated in clustering using the technique proposed by @Huang1998, implemented in package **clustMixType** [@clustMixType]. 

In the situation that we already have some data of the study variable, an alternative solution is to calibrate a model for the study variable, for instance a multiple linear regression model, using the covariates as predictors, and to use the predictions of the study variable as a single stratification variable in *cum-root-f* stratification or in optimal spatial stratification, see Section \@ref(Ospats).

## Geographical stratification {#geostrata}

When no covariate is available, we may still decide to apply a  *geographical stratification*\index{Geographical stratification}. For instance, a square study area can be divided into 4 $\times$ 4 equal-sized subsquares that are used as strata. When we select one or two points per subsquare, we avoid strong spatial clustering of the sampling points. Geographical stratification improves the *spatial coverage*\index{Spatial coverage}. When the study variable is spatially structured, think for instance of a spatial trend, then geographical stratification will lead to more precisely estimated means (smaller sampling variances).

A simple method for constructing geographical strata is k-means clustering [@bru99]. See Section \@ref(SpatialCoverage) for a simple illustrative example of how geographical strata are computed with k-means clustering. In this approach, the study area is discretised by a large number of grid cells. These grid cells are the objects that are clustered. The clustering variables are simply the spatial coordinates of the centres of the grid cells. This method leads to compact geographical strata\index{Compact geographical stratum}, briefly referred to as geostrata\index{Geostratum|see{Compact geographical stratum}}. Geostrata can be computed with function `kmeans`, as shown in Section \@ref(kmeansstratification). The two clustering variables have the same scale, so they should not be scaled because this would lead to an arbitrary distortion of geographical distances.  The geostrata generally will not have the same number of grid cells. Geostrata of equal size can be attractive, as then the sample becomes selfweighting, i.e., the sample mean is an unbiased estimator of the population mean. 

Geostrata of the same size can be computed with function `stratify` of the package **spcosa** (@spcosa, @walvoort2010), with argument `equalArea = TRUE`.

If the total number of grid cells divided by the number of strata is an integer, the stratum sizes are exactly equal; otherwise, the difference is one grid cell. @walvoort2010 describe the k-means algorithms implemented in this package in detail. Argument `object` of function `stratify` specifies a spatial object of the population units. In the **R** code below the subgrid of `grdXuancheng` generated in Section \@ref(kmeansstratification) is converted to a `SpatialPixelsDataFrame` with function `gridded` of the package **sp**. The spatial object can also be of class `SpatialPolygons`. In that case, either argument `nGridCells` or argument `cellSize` must be set, so that the vector map in `object` can be discretised by a finite number of grid cells. Argument `nTry` specifies the number of initial stratifications in k-means clustering, and therefore is comparable with argument `nstart` of function `kmeans`. For more details on spatial stratification using k-means clustering, see Section \@ref(SpatialCoverage). The k-means algorithm used with `equalArea = TRUE` takes much more computing time than the one used with `equalArea = FALSE`. 

```{r spcosa, eval = FALSE}
library(spcosa)
library(sp)
set.seed(314)
gridded(subgrd) <- ~ x1 + x2
mygeostrata <- stratify(
  object = subgrd, nStrata = 50, nTry = 1, equalArea = TRUE)
```

```{r, eval = FALSE, echo = FALSE}
write_rds(mygeostrata, file = "results/geostrata_Xuancheng.rds")
```

```{r, echo = FALSE}
subgrd <- as(subgrd, "data.frame")
mygeostrata <- read_rds(file = "results/geostrata_Xuancheng.rds")
```

Function `spsample`  of package **spcosa** is used to select from each geostratum a simple random sample of two points.

```{r}
set.seed(314)
mysample <- spcosa::spsample(mygeostrata, n = 2)
```

Figure \@ref(fig:GeoStrata) shows fifty compact geostrata of equal size for Xuancheng with the selected sampling points. Note that the sampling points are reasonably well spread throughout the study area^[The compact geostrata and the sample are plotted with package **ggplot2**. A simple alternative is to use method `plot` of **spcosa**: `plot(mygeostrata, mysample)`.].

```{r GeoStrata, echo = FALSE, out.width = "100%", fig.cap = "Compact geostrata of equal size for Xuancheng and stratified simple random sample of two points per stratum."}
ggplot(as(mygeostrata, "data.frame")) +
    geom_raster(mapping = aes(x = x1 / 1000, y = x2 / 1000, fill = factor(stratumId))) +
    scale_fill_viridis_d(name = "geostratum") +
    geom_point(data = as(mysample, "data.frame"), mapping = aes(x = x1 / 1000, y = x2 / 1000), size = 2, colour = "orange") +
    coord_fixed() +
    scale_x_continuous(name = "Easting (km)") +
    scale_y_continuous(name = "Northing (km)") +
    theme(legend.position = "none")
```

Once the observations are done, the population mean can be estimated with function `estimate`. For Xuancheng I simulated data from a normal distribution, just to illustrate estimation with function `estimate`. Various statistics can be estimated, among which the population mean (spatial mean), the standard error, and the CDF. The CDF is estimated by transforming the data into indicators (Subsection \@ref(CDF)).

```{r}
library(spcosa)
mysample <- spcosa::spsample(mygeostrata, n = 2)
mydata <- data.frame(z = rnorm(100, mean = 10, sd = 2))
mean <- estimate("spatial mean", mygeostrata, mysample, data = mydata)
se <- estimate("standard error", mygeostrata, mysample, data = mydata)
cdf <- estimate("scdf", mygeostrata, mysample, data = mydata)
```

The estimated population mean equals `r formatC(mean, 1, format = "f")` with an estimated standard error of `r formatC(se, 1, format = "f")`.

#### Exercises {-}  

6. Why is it attractive to select at least two points per geostratum?  
7. The alternative to 50 geostrata and two points per geostratum is 100 geostrata and one point per geostratum. Which sampling strategy will be more precise?  
8. The geostrata in Figure \@ref(fig:GeoStrata) have equal size (area), which can be enforced by argument `equalArea = TRUE`. Why are equal sizes attractive? Work out the estimator of the population mean for strata of equal size.  
9. Write an **R** script to construct 20 compact geographical strata of equal size for agricultural field Leest. The geopackage file of this field can be read with **sf** function `read_sf(system.file("extdata/leest.gpkg", package = "sswr"))`.  Remove the projection attributes with `st_set_crs(NA_crs_)`, and convert the simple feature object to a spatial object with method `as_Spatial`. Select two points per geostratum, using function `spsample` of package **spcosa**. Repeat this with 40 strata of equal size, and randomly select one point per stratum.  
    + If only one point per stratum is selected, the sampling variance can be approximated by the collapsed strata\index{Collapsed strata} estimator. In this method, pairs of strata are formed, and the two strata of a pair are joined. In each new stratum we now have two points. With an odd number of strata there will be one group of three strata and three points. The sample is then analysed as if it were a random sample from the new collapsed strata. Suppose we group the strata on the basis of the measurements of the study variable. Do you think this is a proper way of grouping?
    + In case you think this is not a proper way of grouping the strata, how would you group the strata?
    + Is the sampling variance estimator unbiased? If not, is the sampling variance overestimated or underestimated?  
10. Laboratory costs for measuring the study variable can be saved by bulking the soil aliquots\index{Soil aliquot} (composite sampling\index{Composite sampling}). There are two options: bulking all soil aliquots from the same stratum (bulking within strata) or bulking by selecting one aliquot from each stratum (bulking across strata). In **spcosa** bulking across strata is implemented. Write an **R** script to construct 20 compact geographical strata for study area Voorst. Use argument `equalArea = TRUE`. Select four points per stratum using argument `type = "composite"`, and convert the resulting object to `SpatialPoints`. Extract the $z$-values in `grdVoorst` at the selected sampling points using  function `over`. Add a variable to the resulting data frame indicating the composite (points 1 to 4 are from the first stratum, points 5 to 8 from the second stratum, etc.), and estimate the means for the four composites using function `tapply`. Finally, estimate the population mean and its standard error.  
    + Can the sampling variance of the estimator of the mean be estimated for bulking within the strata?
    + The alternative to analysing the concentration of four composite samples obtained by bulking across strata  is to analyse all 20 $\times$ 4 aliquots separately. The strata have equal size, so the inclusion probabilities are equal. As a consequence, the sample mean is an unbiased estimator of the population mean. Is the precision of this estimated population mean equal to that of the estimated population mean with composite sampling? If not, is it smaller or larger, and why?
    + If you use argument `equalArea = FALSE` in combination with argument `type = "composite"`, you get an error message. Why does this combination of arguments not work?


## Multiway stratification

In Section \@ref(kmeansstratification) multiple continuous covariates are used to construct clusters of raster cells using k-means. These clusters are then used as strata. This section considers the case where we have multiple categorical and/or continuous variables that we would like to use as stratification variables. The continuous stratification variables are first used to compute strata based on that stratification variable, e.g., using the *cum-root-f* method. What could be done then is to compute the cross-classification of each unit and use these cross-classifications as strata in random sampling. However, this may lead to numerous strata, maybe even more than the intended sample size. To reduce the total number of strata, we may aggregate cross-classification strata\index{Cross-classification stratum} with similar means of the study variable, based on our prior knowledge.

An alternative to aggregation of cross-classification strata is to use the separate strata, i.e., the strata based on an individual stratification variable, as *marginal* strata in random sampling.  How this works is explained in Subsection \@ref(Multiwaystratification).

## Multivariate stratification {#MultivariateStratification}

Another situation is where we have multiple study variables and would like to optimise the stratification and allocation for estimating the population means of all study variables. Optimal stratification for multiple study variables is only relevant if we would like to use different stratification variables for the study variables\index{Multivariate stratification}. In many cases, we do not have reliable prior information about the different study variables justifying the use of multiple stratification variables. We are already happy to have one stratification variable that may serve to increase the precision of the estimated means of all study variables.

However, in case we do have multiple stratification variables tailored to different study variables, the objective is to partition the population in strata, so that for a given allocation, the total sampling costs, assuming a linear costs model (Equation \@ref(eq:linearcostmodel)), are minimised given a constraint on the precision of the estimated mean for each study variable. 

Package **SamplingStrata** [@Barcaroli2020] can be used to optimise multivariate strata. @Barcaroli2014 gives details about the objective function and the algorithm used for optimising the strata. Sampling units are allocated to the strata by Bethel allocation\index{Allocation!Bethel allocation} [@Bethel1989]. The required precision is specified in terms of a coefficient of variation\index{Coefficient of variation}, one per study variable. 

Multivariate stratification is illustrated with the Meuse data set of package **gstat** [@peb04]. The prior data of heavy metal concentrations of Cd and Zn are used in spatial prediction to create maps of these two study variables.

The maps of natural logs of the two metal concentrations are created by kriging with an external drift, using the square root of the distance to the Meuse river as a predictor for the mean, see Section \@ref(IntroKED) for how this spatial prediction method works.

```{r, echo = FALSE}
library(sp)
library(gstat)
data("meuse")
coordinates(meuse) <- ~x + y
lcdr_vgm <- variogram(log(cadmium) ~ sqrt(dist), meuse)
lcdr_fit <- fit.variogram(lcdr_vgm, model = vgm(1, "Exp", 300, 1))
```

```{r, echo = FALSE}
lznr_vgm <- variogram(log(zinc) ~ sqrt(dist), meuse)
lznr_fit <- fit.variogram(lznr_vgm, model = vgm(1, "Exp", 300, 1))
```

Figure \@ref(fig:PredictedCd) shows the map with the predicted log Cd and log Zn concentrations.

```{r, echo = FALSE, results = 'hide'}
data(meuse.grid)
gridded(meuse.grid) <- ~ x + y
#kriging of Cd
lcd_kriged <- krige(
  formula = log(cadmium) ~ sqrt(dist),
  locations = meuse,
  newdata = meuse.grid,
  model = lcdr_fit,
  debug.level = 0)  %>% as("data.frame")
#kriging of Zn
lzn_kriged <- krige(
  log(zinc) ~ sqrt(dist),
  meuse, meuse.grid,
  model = lznr_fit,
  debug.level = 0) %>% as("data.frame")
```


```{r PredictedCd, echo = FALSE, out.width = "100%", fig.cap = "Kriging predictions of natural logs of Cd and Zn concentrations in the study area Meuse, used as stratification variables in bivariate stratification."}
df <- data.frame(x = lcd_kriged$x, y = lcd_kriged$y, lnCd = lcd_kriged$var1.pred, lnZn = lzn_kriged$var1.pred)

plt1 <- ggplot(data = df) +
  geom_raster(mapping = aes(x = x / 1000, y = y / 1000, fill = lnCd)) +
  scale_fill_viridis_c(name = "lnCd") +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  coord_fixed()

plt2 <- ggplot(data = df) +
  geom_raster(mapping = aes(x = x / 1000, y = y / 1000, fill = lnZn)) +
  scale_fill_viridis_c(name = "lnZn") +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  coord_fixed()

grid.arrange(plt1, plt2, nrow = 1)
```

The predicted log concentrations of the two heavy metals are used as stratification variables in designing a new sample for design-based estimation of the population means of Cd and Zn. For the log of Cd, there are negative predicted concentrations (Figure \@ref(fig:PredictedCd)). This leads to an error when running function ` optimStrata`. The minimum predicted log Cd concentration is -1.7, so I added 2 to the predictions. A variable indicating the domains of interest is added to the data frame. The value of this variable is 1 for all grid cells, so that a sample is designed for estimating the mean of the entire population. As a first step, function  ` buildFrameDF` is used to create a data frame that can be handled by function `optimStrata`. Argument `X` specifies the stratification variables, and argument `Y` the study variables. In our case, the stratification variables and the study variables are the same. This is typical for the situation where the stratification variables are obtained by mapping the study variables.

```{r}
library(SamplingStrata)
df <- data.frame(cd = lcd_kriged$var1.pred + 2,
                 zn = lzn_kriged$var1.pred,
                 dom = 1,
                 id = seq_len(nrow(lcd_kriged)))
frame <- buildFrameDF(
  df = df, id = "id",
  X = c("cd", "zn"), Y = c("cd", "zn"),
  domainvalue = "dom")
```

Next, a data frame with the precision requirements for the estimated means is created. The precision requirement is given as a coefficient of variation, i.e., the standard error of the estimated population mean, divided by the estimated mean. The study variables as specified in `Y` are used to compute the estimated means and the standard errors for a given stratification and allocation.
 
```{r}
cv <- as.data.frame(list(DOM = "DOM1", CV1 = 0.02, CV2 = 0.02, domainvalue = 1))
```

Finally, the multivariate stratification is optimised by searching for the optimal stratum bounds using a genetic algorithm [@ger99].

```{r, eval = FALSE}
set.seed(314)
res <- optimStrata(
  method = "continuous",  errors = cv, framesamp = frame, nStrata = 5,
  iter = 50, pops = 20, showPlot = FALSE)
```

```{r, echo = FALSE, eval = FALSE}
save(res, file = "results/multivariatestrata_Meuse.rda")
```

```{r, echo = FALSE}
load(file = "results/multivariatestrata_Meuse.rda")
```

A summary of the strata can be obtained with function `summaryStrata`.

```{r}
smrstrata <- summaryStrata(res$framenew, res$aggr_strata, progress = FALSE)
```

```{r, echo = FALSE}
smrstrata <- smrstrata[, c(2, 3, 4, 6, 7, 8, 9)]
smrstrata[, c(4, 5, 6, 7)] <- round(smrstrata[, c(4, 5, 6, 7)], 3)
print(smrstrata)
```

Column `Population` contains the sizes of the strata, i.e., the number of grid cells. The total sample size equals `r sum(smrstrata$Allocation)`. The sample size per stratum is computed with Bethel allocation, see Section \@ref(STSIallocation). The last four columns contain the lower and upper bounds of the orthogonal intervals.

Figure \@ref(fig:2dplotbivariatestrata) shows a 2D-plot of the bivariate strata. The strata can be plotted as a series of nested rectangles. All population units in the smallest rectangle belong to stratum 1; all units in the one-but-smallest rectangle that are not in the smallest rectangle belong to stratum 2, etc. If we have more than two stratification variables, the strata form a series of nested hyperrectangles or boxes. The strata are obtained as the Cartesian product of orthogonal intervals.  

```{r 2dplotbivariatestrata, fig.cap = "2D-plot of optimised bivariate strata of the study area Meuse."}
plt <- plotStrata2d(res$framenew, res$aggr_strata,
  domain = 1, vars = c("X1", "X2"), labels = c("Cd", "Zn"))
```

```{block2, type = 'rmdnote'}
It may happen that after the optimisation of the stratum bounds in some resulting strata, no units are contained. If the stratification with a smaller number of strata requires fewer sampling units so that the sampling costs are lower (and still the precision requirement is met), then this is retained as the optimal stratification.
```

Figure \@ref(fig:OptimisedstrataMeuse) shows a map of the optimised strata.

```{r OptimisedstrataMeuse, echo = FALSE, fig.width = 5,  fig.cap = "Map of optimised bivariate strata of the study area Meuse."}
dfnew <- data.frame(coordinates(meuse.grid), res$framenew)
ggplot(dfnew) +
    geom_raster(mapping = aes(x = x / 1000, y = y / 1000, fill = factor(STRATO))) +
    scale_fill_viridis_d(name = "Stratum") +
    scale_x_continuous(name = "Easting (km)") +
    scale_y_continuous(name = "Northing (km)") +
    coord_fixed()
```

The expected coefficient of variation can be extracted with function `expected_CV`.

```{r}
expected_CV(res$aggr_strata)
```

The coefficient of variation of Cd is indeed equal to the desired level of 0.02, for Zn it is smaller. So, in this case Cd is the study variable that determines the total sample size of 26 units. 

Note that these coefficients of variation are computed from the stratification variables, which are predictions of the study variable. Errors in these predictions are not accounted for. It is well known that kriging is a smoother, so that the variance of the predicted values within a stratum is smaller than the variance of the true values. As a consequence, the coefficients of variation of the predictions underestimate the coefficients of variation of the study variables. See Section \@ref(Ospats) for how prediction errors and spatial correlation of prediction errors can be accounted for in optimal stratification. An additional problem is that I added a value of 2 to the log Cd concentrations. This does not affect the standard error of the estimated mean, but does affect the estimated mean, so that also for this reason the coefficient of variation of the study variable Cd is underestimated.

```{r, echo = FALSE}
rm(list = ls())
```