5-regression.Rmd

---
title: "ETC3550: Applied forecasting for business and economics"
author: "Ch5. Regression models"
date: "OTexts.org/fpp2/"
fontsize: 14pt
output:
  beamer_presentation:
    theme: metropolis
    fig_height: 4.5
    fig_width: 7
    highlight: tango
    includes:
      in_header: header.tex
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, cache=TRUE, warning=FALSE, message=FALSE,
  dev.args=list(bg=grey(0.9), pointsize=11))
library(fpp2)
```

# The linear model with time series

## Multiple regression and forecasting

\fontsize{13}{15}\sf

\begin{block}{}\vspace*{-0.3cm}
\[
  y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_kx_{k,t} + \varepsilon_t.
\]
\end{block}

* $y_t$ is the variable we want to predict: the ``response'' variable
* Each $x_{j,t}$ is numerical and is called a ``predictor''.
 They are usually assumed to be known for all past and future times.

* The coefficients $\beta_1,\dots,\beta_k$ measure the effect of each
predictor after taking account of the effect of all other predictors
in the model.

That is, the coefficients measure the **marginal effects**.

* $\varepsilon_t$ is a white noise error term


## Example: US consumption expenditure

\fontsize{11}{13}\sf

```{r uschangedata, echo=FALSE}
quarters <- rownames(.preformat.ts(uschange))
```

```{r ConsInc, echo=TRUE, cache=TRUE, fig.height=3.5}
autoplot(uschange[,c("Consumption","Income")]) +
  ylab("% change") + xlab("Year")
```

## Example: US consumption expenditure

```{r ConsInc2, echo=FALSE, cache=TRUE}
fit.cons <- tslm(Consumption ~ Income, data=uschange)
uschange %>%
  as.data.frame %>%
  ggplot(aes(x=Income, y=Consumption)) +
    ylab("Consumption (quarterly % change)") +
    xlab("Income (quarterly % change)") +
    geom_point() +
    geom_smooth(method="lm", se=FALSE)
```

## Example: US consumption expenditure

\fontsize{9}{9}\sf

```{r, echo=TRUE, cache=TRUE}
tslm(Consumption ~ Income, data=uschange) %>% summary
```

## Example: US consumption expenditure

```{r MultiPredictors, echo=FALSE, cache=TRUE}
autoplot(uschange[,3:5], facets = TRUE, colour=TRUE) +
  ylab("") + xlab("Year") +
  guides(colour="none")
```

## Example: US consumption expenditure

```{r ScatterMatrix, echo=FALSE, cache=TRUE}
uschange %>%
  as.data.frame %>%
  GGally::ggpairs()
```

## Example: US consumption expenditure

\fontsize{7.4}{7.4}\sf

```{r usestim, echo=TRUE}
fit.consMR <- tslm(
  Consumption ~ Income + Production + Unemployment + Savings,
  data=uschange)
summary(fit.consMR)
```

## Example: US consumption expenditure

```{r usfitted1, echo=FALSE, cache=TRUE}
autoplot(uschange[,'Consumption'], series="Data") +
  autolayer(fitted(fit.consMR), series="Fitted") +
  xlab("Year") + ylab("") +
  ggtitle("Percentage change in US consumption expenditure") +
  guides(colour=guide_legend(title=" "))
```

## Example: US consumption expenditure

```{r usfitted2, echo=FALSE, cache=TRUE, message=FALSE, warning=FALSE}
cbind(Data=uschange[,"Consumption"], Fitted=fitted(fit.consMR)) %>%
  as.data.frame %>%
  ggplot(aes(x=Data, y=Fitted)) +
    geom_point() +
    xlab("Fitted (predicted values)") +
    ylab("Data (actual values)") +
    ggtitle("Percentage change in US consumption expenditure") +
    geom_abline(intercept=0, slope=1)
```

## Example: US consumption expenditure

```{r}
checkresiduals(fit.consMR, test=FALSE)
```

# Residual diagnostics

##  Multiple regression and forecasting
For forecasting purposes, we require the following assumptions:

* $\varepsilon_t$ are uncorrelated and zero mean

* $\varepsilon_t$ are uncorrelated with each $x_{j,t}$.
\pause

It is **useful** to also have $\varepsilon_t \sim \text{N}(0,\sigma^2)$ when producing prediction intervals or doing statistical tests.

## Residual plots

Useful for spotting outliers and whether the linear model was
appropriate.

* Scatterplot of residuals $\varepsilon_t$ against each predictor $x_{j,t}$.

* Scatterplot residuals against the fitted values $\hat y_t$

* Expect to see scatterplots resembling a horizontal band with
no values too far from the band and no patterns such as curvature or
increasing spread.

## Residual patterns

* If a plot of the residuals vs any predictor in the model shows a pattern, then the relationship is nonlinear.

* If a plot of the residuals vs any predictor **not** in the model shows a pattern, then the predictor should be added to the model.

* If a plot of the residuals vs fitted values shows a pattern, then there is heteroscedasticity in the errors. (Could try a transformation.)


## Breusch-Godfrey test

**OLS regression:**
$$
y_{t}=\beta_{0}+\beta_{1}x_{t,1}+\dots+\beta_{k}x_{t,k}+u_{t}
$$
**Auxiliary regression:**
$$
{\hat {u}}_{t}=\beta_{0}+\beta_{1}x_{t,1}+\dots+\beta_{k}x_{t,k}+\rho_{1}{\hat {u}}_{t-1}+\dots +\rho_{p}{\hat {u}}_{t-p}+\varepsilon_{t}
$$

###
If $R^{2}$ statistic is calculated for this model, then
$$
  (T-p)R^{2}\,\sim \,\chi_{p}^{2},
$$
when there is no serial correlation up to lag $p$, and $T=$ length of series.


* Breusch-Godfrey test better than Ljung-Box for regression models.

## US consumption again

\fontsize{9}{13}\sf

```{r}
checkresiduals(fit.consMR, plot=FALSE)
```

### If the model fails the Breusch-Godfrey test \dots

* The forecasts are not wrong, but have higher variance than they need to.
* There is information in the residuals that we should exploit.
* This is done with a regression model with ARMA errors.


# Some useful predictors for linear models

## Trend

**Linear trend**
\[
  x_t = t
\]

* $t=1,2,\dots,T$
* Strong assumption that trend will continue.

## Dummy variables

\begin{textblock}{6}(0.4,1.5)
If a categorical variable takes only two values (e.g., `Yes'
or `No'), then an equivalent numerical variable can be constructed
taking value 1 if yes and 0 if no. This is called a \textbf{dummy variable}.
\end{textblock}

\placefig{7.7}{1.}{width=5cm}{dummy2}

## Dummy variables

\begin{textblock}{5}(0.4,1.5)
If there are more than two categories, then the variable can
be coded using several dummy variables (one fewer than the total
number of categories).

\end{textblock}

\placefig{5.5}{1.5}{width=7.3cm}{dummy3}

## Beware of the dummy variable trap!
* Using one dummy for each category gives too many dummy variables!

* The regression will then be singular and inestimable.

* Either omit the constant, or omit the dummy for one category.

* The coefficients of the dummies are relative to the omitted category.

## Uses of dummy variables

\fontsize{13}{15}\sf

**Seasonal dummies**

* For quarterly data: use 3 dummies
* For monthly data: use 11 dummies
* For daily data: use 6 dummies
* What to do with weekly data?

\pause

**Outliers**

* If there is an outlier, you can use a dummy variable (taking value 1 for that observation and 0 elsewhere) to remove its effect.

\pause

**Public holidays**

* For daily data: if it is a public holiday, dummy=1, otherwise dummy=0.


## Beer production revisited

```{r, echo=FALSE}
beer <- window(ausbeer, start=1992)
autoplot(beer) + xlab("Year") + ylab("megalitres") +
  ggtitle("Australian quarterly beer production")
```

## Beer production revisited

\begin{block}{Regression model}
\centering
$y_t = \beta_0 + \beta_1 t + \beta_2d_{2,t} + \beta_3 d_{3,t} + \beta_4 d_{4,t} + \varepsilon_t$
\end{block}

* $d_{i,t} = 1$ if $t$ is quarter $i$ and 0 otherwise.

## Beer production revisited
\fontsize{8}{8}\sf

```{r, echo=TRUE}
fit.beer <- tslm(beer ~ trend + season)
summary(fit.beer)
```
## Beer production revisited

```{r}
autoplot(beer, series="Data") +
  autolayer(fitted(fit.beer), series="Fitted") +
  xlab("Year") + ylab("Megalitres") +
  ggtitle("Quarterly Beer Production")
```

## Beer production revisited

```{r, echo=FALSE}
cbind(Data=beer, Fitted=fitted(fit.beer)) %>%
  as.data.frame %>%
  ggplot(aes(x=Data, y=Fitted, colour=as.factor(cycle(beer)))) +
    geom_point() +
    ylab("Fitted") + xlab("Actual values") +
    ggtitle("Quarterly beer production") +
    scale_colour_brewer(palette="Dark2", name="Quarter") +
    geom_abline(intercept=0, slope=1)
```

## Beer production revisited

```{r}
checkresiduals(fit.beer, test=FALSE)
```

## Beer production revisited

```{r}
fit.beer %>% forecast %>% autoplot
```

## Fourier series

Periodic seasonality can be handled using pairs of Fourier terms:
$$
s_{k}(t) = \sin\left(\frac{2\pi k t}{m}\right)\qquad c_{k}(t) = \cos\left(\frac{2\pi k t}{m}\right)
$$
$$
y_t = a + bt + \sum_{k=1}^K \left[\alpha_k s_k(t) + \beta_k c_k(t)\right] + \varepsilon_t$$

* Every periodic function can be approximated by sums of sin and cos terms for large enough $K$.
* Choose $K$ by minimizing AICc.
* Called "harmonic regression"

```r
fit <- tslm(y ~ trend + fourier(y, K))
```

## Harmonic regression: beer production

\fontsize{8}{8}\sf

```{r fourierbeer, echo=TRUE, cache=TRUE}
fourier.beer <- tslm(beer ~ trend + fourier(beer, K=2))
summary(fourier.beer)
```

## Intervention variables

**Spikes**

* Equivalent to a dummy variable for handling an outlier.
\pause

**Steps**

* Variable takes value 0 before the intervention and 1 afterwards.
\pause

**Change of slope**

* Variables take values 0 before the intervention and values $\{1,2,3,\dots\}$ afterwards.

##  Holidays

**For monthly data**

* Christmas: always in December so part of monthly seasonal effect
* Easter: use a dummy variable $v_t=1$ if any part of Easter is in that month, $v_t=0$ otherwise.
* Ramadan and Chinese new year similar.

## Trading days

With monthly data, if the observations vary depending on how many different types of days in the month, then trading day predictors can be useful.

\begin{align*}
z_1 &= \text{\# Mondays in month;} \\
z_2 &= \text{\# Tuesdays in month;} \\
&\vdots \\
z_7 &= \text{\# Sundays in month.}
\end{align*}


## Distributed lags

Lagged values of a predictor.

Example: $x$ is advertising which has a delayed effect

\begin{align*}
  x_{1} &= \text{advertising for previous month;} \\
  x_{2} &= \text{advertising for two months previously;} \\
        & \vdots \\
  x_{m} &= \text{advertising for $m$ months previously.}
\end{align*}


## Nonlinear trend

**Piecewise linear trend with bend at $\tau$**
\begin{align*}
x_{1,t} &= t \\
x_{2,t} &= \left\{ \begin{array}{ll}
  0 & t <\tau\\
  (t-\tau) & t \ge \tau
\end{array}\right.
\end{align*}
\pause\vspace*{1cm}

**Quadratic or higher order trend**
\[
  x_{1,t} =t,\quad x_{2,t}=t^2,\quad \dots
\]
\pause\vspace*{-0.5cm}

\centerline{\textcolor{orange}{\textbf{NOT RECOMMENDED!}}}


## Example: Boston marathon winning times

\fontsize{11}{11}\sf

```{r, fig.height=3.5, echo=TRUE}
autoplot(marathon) +
  xlab("Year") +  ylab("Winning times in minutes")
fit.lin <- tslm(marathon ~ trend)
```

## Example: Boston marathon winning times

```{r marathonLinear, echo=FALSE, message=TRUE, warning=FALSE, cache=TRUE}
autoplot(marathon) +
  autolayer(fitted(fit.lin), series = "Linear trend") +
  xlab("Year") +  ylab("Winning times in minutes") +
  guides(colour=guide_legend(title=" ")) +
  theme(legend.position = "none")
```

## Example: Boston marathon winning times
\fontsize{11}{11}\sf

```{r, echo=TRUE, fig.height=3.5}
autoplot(residuals(fit.lin)) +
    xlab("Year") + ylab("Residuals from a linear trend")
```

## Example: Boston marathon winning times

\fontsize{8}{8}\sf

```{r, echo=TRUE, message=TRUE, warning=FALSE, cache=TRUE}
# Linear trend
fit.lin <- tslm(marathon ~ trend)
fcasts.lin <- forecast(fit.lin, h=10)

# Exponential trend
fit.exp <- tslm(marathon ~ trend, lambda = 0)
fcasts.exp <- forecast(fit.exp, h=10)

# Piecewise linear trend
t.break1 <- 1940
t.break2 <- 1980
t <- time(marathon)
t1 <- ts(pmax(0, t-t.break1), start=1897)
t2 <- ts(pmax(0, t-t.break2), start=1897)
fit.pw <- tslm(marathon ~ t + t1 + t2)
t.new <- t[length(t)] + seq(10)
t1.new <- t1[length(t1)] + seq(10)
t2.new <- t2[length(t2)] + seq(10)
newdata <- cbind(t=t.new, t1=t1.new, t2=t2.new) %>%
  as.data.frame
fcasts.pw <- forecast(fit.pw, newdata = newdata)
```


## Example: Boston marathon winning times

```{r, echo=FALSE, message=TRUE, warning=FALSE, cache=TRUE}
autoplot(marathon) +
  autolayer(fitted(fit.lin), series = "Linear") +
  autolayer(fitted(fit.exp), series="Exponential") +
  autolayer(fitted(fit.pw), series = "Piecewise") +
  autolayer(fcasts.pw, series="Piecewise") +
  autolayer(fcasts.lin$mean, series = "Linear") +
  autolayer(fcasts.exp$mean, series="Exponential") +
  xlab("Year") +  ylab("Winning times in minutes") +
  ggtitle("Boston Marathon") +
  guides(colour=guide_legend(title=" "))
```

## Example: Boston marathon winning times

```{r residPiecewise, message=FALSE, warning=FALSE, cache=TRUE}
fit.pw$method <- "Piecewise linear regression model"
checkresiduals(fit.pw, test=FALSE)
```

## Interpolating splines

\fullwidth{draftspline}

## Interpolating splines

\fullwidth{spline}

## Interpolating splines

\fullwidth{spline2}

## Interpolating splines

\begin{block}{}
A spline is a continuous function $f(x)$ interpolating all
points ($\kappa_j,y_j$) for $j=1,\dots,K$ and consisting of polynomials between each consecutive pair of `knots' $\kappa_j$ and $\kappa_{j+1}$.
\end{block}

\pause

* Parameters constrained so that $f(x)$ is continuous.
* Further constraints imposed to give continuous derivatives.

##  General linear regression splines

* Let $\kappa_1<\kappa_2<\cdots<\kappa_K$ be ``knots'' in interval $(a,b)$.
* Let $x_1=x$, $x_j = (x-\kappa_{j-1})_+$ for $j=2,\dots,K+1$.
* Then the regression is piecewise linear with bends at the knots.

## General cubic splines

* Let $x_1=x$, $x_2 = x^2$, $x_3=x^3$, $x_j =
(x-\kappa_{j-3})_+^3$ for $j=4,\dots,K+3$.

* Then the regression is piecewise cubic, but smooth at the knots.

* Choice of knots can be difficult and arbitrary.

* Automatic knot selection algorithms very slow.


## Example: Boston marathon winning times

\fontsize{7}{7}\sf

```{r, echo=TRUE, message=TRUE, warning=FALSE, cache=TRUE}
# Spline trend
library(splines)
t <- time(marathon)
fit.splines <- lm(marathon ~ ns(t, df=6))
summary(fit.splines)
fits <- ts(fitted(fit.splines))
tsp(fits) <- tsp(marathon)
```

## Example: Boston marathon winning times

\fontsize{8}{8}\sf

```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE}
library(splines)
fc <- predict(fit.splines, newdata=data.frame(t=time(fcasts.pw$mean)),
              interval='prediction', level=0.95)
fcasts.spline <- fcasts.pw
fcasts.spline$mean <- fc[,"fit"]
se <- (fc[,'upr'] - fc[,'lwr'])/(2*qnorm(0.975))
fcasts.spline$lower[,2] <- fc[,'fit'] - qnorm(.975)*se
fcasts.spline$lower[,1] <- fc[,'fit'] - qnorm(.90)*se
fcasts.spline$upper[,2] <- fc[,'fit'] + qnorm(.975)*se
fcasts.spline$upper[,1] <- fc[,'fit'] + qnorm(.90)*se

autoplot(marathon) +
  autolayer(fitted(fit.pw), series = "Piecewise linear") +
  autolayer(fcasts.pw, series="Piecewise linear") +
  autolayer(fits, series = "Cubic spline") +
  autolayer(fcasts.spline, series="Cubic spline") +
  xlab("Year") +  ylab("Winning times in minutes") +
  ggtitle("Boston Marathon") +
  guides(colour=guide_legend(title=" "))
```

## splinef

\fontsize{11}{13}\sf

A slightly different type of spline is provided by `splinef`

```{r, echo=TRUE, fig.height=3.3}
fc <- splinef(marathon)
autoplot(fc)
```

## splinef

 * Cubic **smoothing** splines (rather than cubic regression splines).
 * Still piecewise cubic, but with many more knots (one at each observation).
 * Coefficients constrained to prevent the curve becoming too "wiggly".
 * Degrees of freedom selected automatically.
 * Equivalent to ARIMA(0,2,2) and Holt's method.


# Selecting predictors and forecast evaluation

## Selecting predictors

* When there are many predictors, how should we choose which ones to use?

* We need a way of comparing two competing models.
\pause

**What not to do!**

* Plot $y$ against a particular predictor ($x_j$) and if it shows no noticeable relationship, drop it.

* Do a multiple linear regression on all the predictors and disregard all variables whose  $p$ values are greater than 0.05.

* Maximize $R^2$ or minimize MSE

## Comparing regression models

Computer output for regression will always give the $R^2$ value. This is a
useful summary of the model.

* It is equal to the square of the correlation between $y$ and $\hat y$.
* It is often called the ``coefficient of determination''.
* It can also be calculated as follows:
$$R^2 = \frac{\sum(\hat{y}_t - \bar{y})^2}{\sum(y_t-\bar{y})^2}
$$
* It is the proportion of variance accounted for (explained) by the predictors.

## Comparing regression models

However \dots

* $R^2$  does not allow for ``degrees of freedom''.

* Adding *any* variable tends to increase the value of $R^2$, even if that variable is irrelevant.
\pause

To overcome   this problem, we can use \emph{adjusted $R^2$}:

\begin{block}{}
\[
\bar{R}^2 = 1-(1-R^2)\frac{T-1}{T-k-1}
\]
where $k=$ no.\ predictors and $T=$ no.\ observations.
\end{block}

\pause

\begin{alertblock}{Maximizing $\bar{R}^2$ is equivalent to minimizing $\hat\sigma^2$.}
\centerline{$\displaystyle
\hat{\sigma}^2 = \frac{1}{T-k-1}\sum_{t=1}^T \varepsilon_t^2$
}
\end{alertblock}

## Cross-validation

**Cross-validation for regression**

(Assuming future predictors are known)

* Select one observation for test set, and use \emph{remaining} observations in training set. Compute error on test observation.
* Repeat using each possible observation as the test set.
* Compute accuracy measure over all errors.

## Cross-validation

**Traditional evaluation**

```{r traintest0, fig.height=.6, echo=FALSE, cache=TRUE}
train = 1:16
test = 17:20
par(mar=c(0,0,0,0))
plot(0,0,xlim=c(0,22),ylim=c(0,2),xaxt="n",yaxt="n",bty="n",xlab="",ylab="",type="n")
arrows(0,0.5,21,0.5,0.05)
points(train, train*0+0.5, pch=19, col="blue")
points(test,  test*0+0.5,  pch=19, col="red")
text(22,0.5,"time")
text(10,1,"Training data",col="blue")
text(21,1,"Test data",col="red")
```

\pause

**Leave-one-out cross-validation**

```{r, fig.height=3.6}
kcvplot <- function(k,n)
{
  par(mar=c(0,0,0,0))
  each <- round(n/k)
  ord <- sample(1:n,n)
  plot(0,0, xlim=c(0,n+2), ylim=c(1-k/n,1-1/n),
    xaxt="n", yaxt="n", bty="n", xlab="", ylab="", type="n")
  for(j in 1:k)
  {
    test <- (1:n)[ord[(j-1)*each + (1:each)]]
    train <- (1:n)[-test]
    arrows(0, 1-j/n, n+1, 1-j/n, 0.05)
    points(train,rep(1-j/n,length(train)),pch=19,col="blue")
    points(test, rep(1-j/n,length(test)), pch=19, col="red")
    #text(28,1-j/11,"time")
  }
}
kcvplot(20,20)
```


## Cross-validation

Leave-one-out cross-validation for regression can be carried out using the following steps.

* Remove observation $t$ from the data set, and fit the model using the remaining data. Then compute the error ($e_t^*=y_t-\hat{y}_t$) for the omitted observation.
* Repeat step 1 for $t=1,\dots,T$.
* Compute the MSE from $\{e_1^*,\dots,e_T^*\}$. We shall call this the CV.

###
The best model is the one with minimum CV.


## Cross-validation

**Five-fold cross-validation**

 * 20 observations. 4 test observations per fold

```{r, fig.height=1.}
kcvplot(5,20)
```

\pause

**Ten-fold cross-validation**

 * 20 observations. 2 test observations per fold

```{r, fig.height=2.}
kcvplot(10,20)
```


## Cross-validation

\structure{Ten-fold cross-validation}

* Randomly split data into 10 parts.
* Select one part for test set, and use remaining parts as training set. Compute accuracy measures on test observations.
* Repeat for each of 10 parts
* Average over all measures.

## Akaike's Information Criterion

\begin{block}{}
\[
\text{AIC} = -2\log(L) + 2(k+2)
\]
\end{block}
where $L$ is the likelihood and $k$ is the number of predictors in the model.\pause

* This is a \emph{penalized likelihood} approach.
* \emph{Minimizing} the AIC gives the best model for prediction.
* AIC penalizes terms more heavily than $\bar{R}^2$.
* Minimizing the AIC is asymptotically equivalent to minimizing MSE via leave-one-out cross-validation.

## Corrected AIC

For small values of $T$, the AIC tends to select too many predictors, and so a bias-corrected version of the AIC has been developed.

\begin{block}{}
\[
\text{AIC}_{\text{C}} = \text{AIC} + \frac{2(k+2)(k+3)}{T-k-3}
\]
\end{block}

As with the AIC, the AIC$_{\text{C}}$ should be minimized.

## Bayesian Information Criterion

\begin{block}{}
\[
\text{BIC} = -2\log(L) + (k+2)\log(T)
\]
\end{block}
where $L$ is the likelihood and $k$ is the number of predictors in the model.\pause

* BIC penalizes terms more heavily than AIC

* Also called SBIC and SC.

* Minimizing BIC is asymptotically equivalent to leave-$v$-out cross-validation when $v = T[1-1/(log(T)-1)]$.

## Choosing regression variables

**Best subsets regression**

* Fit all possible regression models using one or more of the predictors.

* Choose the best model based on one of the measures of predictive ability (CV, AIC, AICc).
\pause

**Warning!**

* If there are a large number of predictors, this is not possible.

* For example, 44 predictors leads to 18 trillion possible models!

## Choosing regression variables

**Backwards stepwise regression**

* Start with a model containing all variables.
* Try subtracting one variable at a time. Keep the model if it
has lower CV or AICc.
* Iterate until no further improvement.
\pause

**Notes**

* Stepwise regression is not guaranteed to lead to the best possible
model.
* Inference on coefficients of  final model will be wrong.


## Cross-validation

\fontsize{8}{8}\sf


```{r, echo=TRUE}
tslm(Consumption ~ Income + Production + Unemployment + Savings,
  data=uschange) %>% CV()
tslm(Consumption ~ Income + Production + Unemployment,
  data=uschange) %>% CV()
tslm(Consumption ~ Income + Production + Savings,
  data=uschange) %>% CV()
tslm(Consumption ~ Income + Unemployment + Savings,
  data=uschange) %>% CV()
tslm(Consumption ~ Production + Unemployment + Savings,
  data=uschange) %>% CV()
```

# Forecasting with regression

## Ex-ante versus ex-post forecasts

 * *Ex ante forecasts* are made using only information available in advance.
    - require forecasts of predictors
 * *Ex post forecasts* are made using later information on the predictors.
    - useful for studying behaviour of forecasting models.

 * trend, seasonal and calendar variables are all known in advance, so these don't need to be forecast.

## Scenario based forecasting

 * Assumes possible scenarios for the predictor variables
 * Prediction intervals for scenario based forecasts do not include the uncertainty associated with the future values of the predictor variables.

## Building a predictive regression model {-}

 * If getting forecasts of predictors is difficult, you can use lagged predictors instead.
$$y_{t}=\beta_0+\beta_1x_{1,t-h}+\dots+\beta_kx_{k,t-h}+\varepsilon_{t}$$

 * A different model for each forecast horizon $h$.


## Beer production
\fontsize{10}{10}\sf

```{r beeryetagain, echo=TRUE, fig.height=3.5}
beer2 <- window(ausbeer, start=1992)
fit.beer <- tslm(beer2 ~ trend + season)
fcast <- forecast(fit.beer)
autoplot(fcast) +
  ggtitle("Forecasts of beer production using regression") +
  xlab("Year") + ylab("megalitres")
```

## US Consumption
\fontsize{10}{10}\sf

```{r usconsumptionf, echo=TRUE}
fit.consBest <- tslm(
  Consumption ~ Income + Savings + Unemployment,
  data = uschange)
h <- 4
newdata <- data.frame(
    Income = c(1, 1, 1, 1),
    Savings = c(0.5, 0.5, 0.5, 0.5),
    Unemployment = c(0, 0, 0, 0))
fcast.up <- forecast(fit.consBest, newdata = newdata)
newdata <- data.frame(
    Income = rep(-1, h),
    Savings = rep(-0.5, h),
    Unemployment = rep(0, h))
fcast.down <- forecast(fit.consBest, newdata = newdata)
```

## US Consumption
\fontsize{10}{10}\sf

```{r usconsumptionf2, echo=TRUE, fig.height=3.5}
autoplot(uschange[, 1]) +
  ylab("% change in US consumption") +
  autolayer(fcast.up, PI = TRUE, series = "increase") +
  autolayer(fcast.down, PI = TRUE, series = "decrease") +
  guides(colour = guide_legend(title = "Scenario"))
```

# Matrix formulation

## Matrix formulation

\begin{block}{}
$$y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_kx_{k,t} + \varepsilon_t.$$
\end{block}
\pause

Let $\bm{y} = (y_1,\dots,y_T)'$, $\bm{\varepsilon} = (\varepsilon_1,\dots,\varepsilon_T)'$, $\bm{\beta} = (\beta_0,\beta_1,\dots,\beta_k)'$ and
\[
\bm{X} = \begin{bmatrix}
1      & x_{1,1} & x_{2,1} & \dots & x_{k,1}\\
1      & x_{1,2} & x_{2,2} & \dots & x_{k,2}\\
\vdots & \vdots  & \vdots  &       & \vdots\\
1      & x_{1,T} & x_{2,T} & \dots & x_{k,T}
  \end{bmatrix}.
\]\pause

Then

###
$$\bm{y} = \bm{X}\bm{\beta} + \bm{\varepsilon}.$$

## Matrix formulation

**Least squares estimation**

Minimize: $(\bm{y} - \bm{X}\bm{\beta})'(\bm{y} - \bm{X}\bm{\beta})$\pause

Differentiate wrt $\bm{\beta}$ gives

\begin{block}{}
\[
\hat{\bm{\beta}} = (\bm{X}'\bm{X})^{-1}\bm{X}'\bm{y}
\]
\end{block}

\pause
(The ``normal equation''.)\pause

\[
\hat{\sigma}^2 = \frac{1}{T-k-1}(\bm{y} - \bm{X}\hat{\bm{\beta}})'(\bm{y} - \bm{X}\hat{\bm{\beta}})
\]

\structure{Note:} If you fall for the dummy variable trap, $(\bm{X}'\bm{X})$ is a singular matrix.

## Likelihood

If the errors are iid and normally distributed, then
\[
\bm{y} \sim \text{N}(\bm{X}\bm{\beta},\sigma^2\bm{I}).
\]\pause
So the likelihood is
\[
L = \frac{1}{\sigma^T(2\pi)^{T/2}}\exp\left(-\frac1{2\sigma^2}(\bm{y}-\bm{X}\bm{\beta})'(\bm{y}-\bm{X}\bm{\beta})\right)
\]\pause
which is maximized when $(\bm{y}-\bm{X}\bm{\beta})'(\bm{y}-\bm{X}\bm{\beta})$ is minimized.\pause

\centerline{\textcolor{orange}{So \textbf{MLE = OLS}.}}

## Multiple regression forecasts

\begin{block}{Optimal forecasts}\vspace*{-0.2cm}
\[
\hat{y}^* =
\text{E}(y^* | \bm{y},\bm{X},\bm{x}^*) =
\bm{x}^*\hat{\bm{\beta}} = \bm{x}^*(\bm{X}'\bm{X})^{-1}\bm{X}'\bm{y}
\]
\end{block}
where $\bm{x}^*$ is a row vector containing the values of the predictors for the forecasts (in the same format as $\bm{X}$).

\pause

\begin{block}{Forecast variance}\vspace*{-0.2cm}
\[
\text{Var}(y^* | \bm{X},\bm{x}^*) = \sigma^2 \left[1 + \bm{x}^* (\bm{X}'\bm{X})^{-1} (\bm{x}^*)'\right]
\]
\end{block}
\pause

* This ignores any errors in $\bm{x}^*$.
* 95% prediction intervals assuming normal errors:
$$\hat{y}^* \pm 1.96 \sqrt{\text{Var}(y^*| \bm{X},\bm{x}^*)}.$$

## Multiple regression forecasts

\begin{block}{Fitted values}
$$
  \hat{\bm{y}} = \bm{X}\hat{\bm{\beta}} = \bm{X}(\bm{X}'\bm{X})^{-1}\bm{X}'\bm{y} = \bm{H}\bm{y}
$$
\end{block}
where $\bm{H} = \bm{X}(\bm{X}'\bm{X})^{-1}\bm{X}'$ is the ``hat matrix''.\pause

**Leave-one-out residuals**

Let $h_1,\dots,h_T$ be the diagonal values of $\bm{H}$, then the cross-validation statistic is

\begin{block}{}
$$
\text{CV} = \frac1T\sum_{t=1}^T[e_t/(1-h_t)]^2,
$$
\end{block}

where $e_t$ is the residual obtained from fitting the model to all $T$ observations.

# Correlation, causation and forecasting

## Correlation is not causation
* When $x$ is useful for predicting $y$, it is not necessarily causing $y$.

* e.g., predict number of drownings $y$ using number of ice-creams sold $x$.

* Correlations are useful for forecasting, even when there is no causality.

* Better models usually involve causal relationships (e.g., temperature $x$ and people $z$ to predict drownings $y$).

## Multicollinearity
In regression analysis, multicollinearity occurs when:

*  Two  predictors are highly  correlated (i.e., the correlation between them is close to $\pm1$).
* A linear combination of some of the predictors is highly correlated  with another predictor.
*  A linear combination of one subset of predictors is highly correlated with a linear combination of another
  subset of predictors.

## Multicollinearity

If multicollinearity exists\dots

* the numerical estimates of coefficients may be wrong (worse in Excel than in a statistics package)
* don't rely on the $p$-values to determine significance.
* there is no problem with model *predictions* provided the predictors used for forecasting are within the range used for fitting.
* omitting variables can help.
* combining variables can help.

## Outliers and influential observations

**Things to watch for**

* *Outliers*: observations that produce large residuals.
* *Influential observations*: removing them would markedly change the coefficients.  (Often outliers in the $x$ variable).
* *Lurking variable*: a predictor not included in the regression but which has an important effect on the response.
* Points should not normally be removed without a good explanation of why they are different.