Forecasting 3-3 - ARMA Model Structure.Rmd

---
output:
  xaringan::moon_reader:
    css: "my-theme.css"
    lib_dir: libs
    nature:
      highlightStyle: github
      highlightLines: true
---

layout: true

.hheader[<a href="index.html">`r fontawesome::fa("home", fill = "steelblue")`</a>]

---

```{r setup, include=FALSE, message=FALSE}
options(htmltools.dir.version = FALSE, servr.daemon = TRUE)
library(huxtable)
```

class: center, middle, inverse
# ARIMA Models
## Model Structure

.futnote[Eli Holmes, UW SAFS]

.citation[eeholmes@uw.edu]

---

```{r load_data, echo=FALSE, message=FALSE, warning=FALSE}
load("landings.RData")
landings$log.metric.tons = log(landings$metric.tons)
landings = subset(landings, Year <= 1987)
anchovy = subset(landings, Species=="Anchovy")$log.metric.tons
sardine = subset(landings, Species=="Sardine")$log.metric.tons

library(ggplot2)
library(gridExtra)
library(reshape2)
library(tseries)
library(forecast)
```

## Model Structure

We are now at step A3 and A4 of the Box-Jenkins Method.  Note we did not address seasonality since we are working with yearly data.

A. Model form selection

  1. Evaluate stationarity and seasonality
  2. Selection of the differencing level (d)
  3. **Selection of the AR level (p)**
  4. **Selection of the MA level (q)**
    
B. Parameter estimation

C. Model checking

*Much of this will be automated when we use the forecast package*

---

## Terminology: AR and MA levels

Step A3 is to determine the number of $p$ lags in the AR part of the model:

$$x_t = \phi_1 x_{t-1} + \phi_2 x_{t-2} + ... + \phi_p x_{t-p} + e_t$$

Step A4  is to determine the number of $q$ lags in the MA part of the model: 

$$e_t = \eta_t + \theta_1 \eta_{t-1} + \theta_2 \eta_{t-2} + ... + \theta_q \eta_{t-q},\quad \eta_t \sim N(0, \sigma)$$

---

## Terminology: model order

For an ARIMA model, the number of AR lags, number of differences, and number of MA lags is called the **model order** or just **order**.

Examples.  Note $e_t \sim N(0,\sigma)$

- order (0,0,0) white noise 

$$x_t = e_t$$

- order (1,0,0) AR-1 process 

$$x_t = \phi x_{t-1} + e_t$$

- order (0,0,1) MA-1 process

$$x_t = e_t + \theta e_{t-1}$$

- order (1,0,1) AR-1 MA-1 process

$$x_t = \phi x_{t-1} + e_t + \theta e_{t-1}$$

---

- order (0,1,0) random walk

$$x_t - x_{t-1} = e_t$$

which is the same as

$$x_t = x_{t-1} + e_t$$

---

## How to choose the AR and MA levels

### Method #1 use the ACF and PACF functions

The ACF plot shows you how the correlation between $x_t$ and $x_{t+p}$ decrease as $p$ increases.  The PACF plot shows you the same but removes the autocorrelation due to lags less that $p$.

---

```{r fig.acf.pacf, fig.height = 6, fig.width = 8, fig.align = "center", echo=FALSE}
par(mfrow=c(3,2))
#####
ys = arima.sim(n=1000, list(ar=.7))
acf(ys, main="")
title("AR-1 (p=1, q=0)")
pacf(ys, main="")

ys = arima.sim(n=1000, list(ma=.7))
acf(ys, main="")
title("MA-1 (p=0, q=1)")
pacf(ys, main="")

ys = arima.sim(n=1000, list(ar=.7, ma=.7))
acf(ys, main="")
title("ARMA (p=1, q=1)")
pacf(ys, main="")
```

---

If your ACF and PACF look like the top panel, it is AR-p.  The first lag where the PACF is below the dashed lines is the $p$ lag for your model.

```{r fig.acf, fig.height = 5, fig.width = 8, fig.align = "center", echo=FALSE}
par(mfrow=c(2,2))
#####
ys = arima.sim(n=1000, list(ar=.7))
acf(ys, main="")
title("AR-1 (p=1, q=0)")
pacf(ys, main="")

ys = arima.sim(n=1000, list(ar=c(.7,.2)))
acf(ys, main="")
title("AR-2 (p=2, q=0)")
pacf(ys, main="")
```

---

If it looks like the middle panel, it is MA-p.  The first lag where the ACF is below the dashed lines is the $q$ lag for your model.

```{r fig.pacf, fig.height = 5, fig.width = 8, fig.align = "center", echo=FALSE}
par(mfrow=c(2,2))
#####
ys = arima.sim(n=1000, list(ma=.7))
acf(ys, main="")
title("MA-1 (p=0, q=1)")
pacf(ys, main="")

ys = arima.sim(n=1000, list(ma=c(.7,.2)))
acf(ys, main="")
title("MA-2 (p=0, q=2)")
pacf(ys, main="")
```

---

If it looks like the bottom panel, it is ARMA and this approach doesn't work.

```{r fig.pacf.acf.model, fig.height = 5, fig.width = 8, fig.align = "center", echo=FALSE}
par(mfrow=c(2,2))
#####
ys = arima.sim(n=1000, list(ar=.7, ma=.7))
acf(ys, main="")
title("ARMA (p=1, q=1)")
pacf(ys, main="")

ys = arima.sim(n=1000, list(ar=c(.7,.2), ma=c(.7,.2)))
acf(ys, main="")
title("ARMA (p=2, q=2)")
pacf(ys, main="")
```

---

## How to choose the AR and MA levels

### Method #2 Use formal model selection

This weighs how well the model fits against how many parameters your model has.  We will use this approach. 

The `auto.arima()` function in the forecast package in R allows you to easily estimate the $p$ and $q$ for your ARMA model.

We will use the first difference of the anchovy data since our stationarity diagnostics indicated that a first difference makes our time series stationary.

```{r auto.arima, eval=FALSE}
require(forecast)
anchovy.diff1 = diff(anchovy)
auto.arima(anchovy.diff1)
```

---

```{r auto.arima2}
require(forecast)
anchovy.diff1 = diff(anchovy)
auto.arima(anchovy.diff1)
```

The output indicates that the 'best' model is a MA-1 with a non-zero mean.  "non-zero mean" means that the mean of our data (`anchovy.diff1`) is not zero.

---

`auto.arima()` will also estimate the amount of differencing needed.

```{r auto.arima3}
auto.arima(anchovy)
```

The output indicates that the 'best' model is a MA-1 with first difference.  "with drift" means that the mean of our data (`anchovy`) is not zero.  This is the same model but the jargon regarding the mean is different.

---

## More fitting examples

Let's try fitting to some simulated data.  We will simulate with `arima.sim()`.  We will specify no differencing.

```{r fitting.example.1}
set.seed(100)
a1 = arima.sim(n=100, model=list(ar=c(.8,.1)))
auto.arima(a1, seasonal=FALSE, max.d=0)
```

The 'best-fit' model is simpler than the model used to simulate the data. 

---

## Let's fit 100 and see how often the 'true' model is chosen

By far the correct type of model is selected, AR-p, but usually a simpler model of AR-1 is chosen over AR-2 (correct) most of the time.

```{r fit.1000}
save.fits = rep(NA,100)
for(i in 1:100){
  a1 = arima.sim(n=100, model=list(ar=c(.8,.1)))
  fit = auto.arima(a1, seasonal=FALSE, max.d=0, max.q=0)
  save.fits[i] = paste0(fit$arma[1], "-", fit$arma[2])
}
table(save.fits)
```

---

## Trace = TRUE

You can see what models that `auto.arima()` tried using `trace=TRUE`.  The models are selected on AICc by default and the AICc value is shown next to the model.

```{r}
auto.arima(anchovy, trace=TRUE)
```

---

## stepwise=FALSE

By default, step-wise selection is used and an approximation is used for the models tried in the model selection step.  For a final model selection, you should turn these off.

```{r}
auto.arima(anchovy, stepwise=FALSE, approximation=FALSE)
```

---
## Summary

- Once you have dealt with stationarity, you need to determine the order of the model: the AR part and the MA part.

- Although you could simply use `auto.arima()`, it is best to run `acf()` and `pacf()` on your data to understand it better.

  - Does it look like a pure AR process?
    
- Also evaluate if there are reasons to assume a particular structure.  

  - Are you using an established model form, from say another paper?
  - Are you fitting to a process that is fundamentally AR only or AR + MA?