35-lm.Rmd

# Linear Models {#lm}

## Problem Setup

```{example, label='cap-experiment', name="Bottle Cap Production"}
Consider a randomized experiment designed to study the effects of temperature and pressure on the  diameter of manufactured a bottle cap.
```

```{example, label='rental', name="Rental Prices"}
Consider the prediction of rental prices given an appartment's attributes. 
```

Both examples require some statistical model, but they are very different.
The first is a _causal inference_ problem: we want to design an intervention so that we need to recover the causal effect of temperature and pressure.
The second is a [prediction](https://en.wikipedia.org/wiki/Prediction) problem, a.k.a. a [forecasting](https://en.wikipedia.org/wiki/Forecasting) problem, in which we don't care about the causal effects, we just want good predictions.

In this chapter we discuss the causal problem in Example \@ref(exm:cap-experiment).
This means that when we assume a model, we assume it is the actual _data generating process_, i.e., we assume the _sampling distribution_ is well specified.
In the econometric literature, these are the [structural equations](https://en.wikipedia.org/wiki/Structural_equation_modeling). 
The second type of problems is discussed in the Supervised Learning Chapter \@ref(supervised).

Here are some more examples of the types of problems we are discussing.

```{example, name="Plant Growth"}
Consider the treatment of various plants with various fertilizers to study the fertilizer's effect on growth.
```

```{example, name="Return to Education"}
Consider the study of return to education by analyzing the incomes of individuals with different education years. 
```

```{example, name="Drug Effect"}
Consider the study of the effect of a new drug for hemophilia, by analyzing the level of blood coagulation after the administration of various amounts of the new drug. 
```

Let's present the linear model.
We assume that a response^[The "response" is also known as the "dependent" variable in the statistical literature, or the "labels" in the machine learning literature.] variable is the sum of effects of some factors^[The "factors" are also known as the "independent variable", or  "the design", in the statistical literature, and the "features", or "attributes" in the machine learning literature.].
Denoting the response variable by $y$, the factors by $x=(x_1,\dots,x_p)$, and the effects by $\beta:=(\beta_1,\dots,\beta_p)$ the linear model assumption implies that the expected response is the sum of the factors effects:

\begin{align}
  E[y]=x_1 \beta_1 + \dots + x_p \beta_p = \sum_{j=1}^p x_j \beta_j = x'\beta .
  (\#eq:linear-mean)
\end{align}
Clearly, there may be other factors that affect the the caps' diameters. 
We thus introduce an error term^[The "error term" is also known as the "noise", or the "common causes of variability".], denoted by $\varepsilon$, to capture the effects of all unmodeled factors and measurement error^[You may philosophize if the measurement error is a mere instance of unmodeled factors or not, but this has no real implication for our purposes.]. 
The implied generative process of a sample of $i=1,\dots,n$ observations it thus
\begin{align}
  y_i = x_i'\beta + \varepsilon_i = \sum_j x_{i,j} \beta_j + \varepsilon_i , i=1,\dots,n .
  (\#eq:linear-observed)
\end{align}
or in matrix notation
\begin{align}
  y = X \beta + \varepsilon .
  (\#eq:linear-matrix)
\end{align}


Let's demonstrate Eq.\@ref(eq:linear-observed).
In our bottle-caps example [\@ref(exm:cap-experiment)], we may produce bottle caps at various temperatures. 
We design an experiment where we produce bottle-caps at varying temperatures. 
Let $x_i$ be the temperature at which bottle-cap $i$ was manufactured. 
Let $y_i$ be its measured diameter. 
By the linear model assumption, the expected diameter varies linearly with the temperature: $\mathbb{E}[y_i]=\beta_0 + x_i \beta_1$. 
This implies that $\beta_1$ is the (expected) change in diameter due to a unit change in temperature. 

```{remark}
In [Galton's](https://en.wikipedia.org/wiki/Regression_toward_the_mean) classical regression problem, where we try to seek the relation between the heights of sons and fathers then $p=1$, $y_i$ is the height of the $i$'th father, and $x_i$ the height of the $i$'th son.
This is a prediction problem, more than it is a causal-inference problem. 
```

There are many reasons linear models are very popular:

1. Before the computer age, these were pretty much the only models that could actually be computed^[By "computed" we mean what statisticians call "fitted", or "estimated", and computer scientists call "learned".]. 
The whole Analysis of Variance (ANOVA) literature is an instance of linear models, that relies on sums of squares, which do not require a computer to work with.

1. For purposes of prediction, where the actual data generating process is not of primary importance, they are popular because they simply work. 
Why is that? 
They are simple so that they do not require a lot of data to be computed. 
Put differently, they may be biased, but their variance is small enough to make them more accurate than other models.

1. For non continuous predictors, __any__ functional relation can be cast as a linear model.

1. For the purpose of _screening_, where we only want to show the existence of an effect, and are less interested in the magnitude of that effect, a linear model is enough.

1. If the true generative relation is not linear, but smooth enough, then the linear function is a good approximation via Taylor's theorem.


There are still two matters we have to attend: 
(i) How to estimate $\beta$? 
(ii) How to perform inference?

In the simplest linear models the estimation of $\beta$ is done using the method of least squares. A linear model with least squares estimation is known as Ordinary Least Squares (OLS).
The OLS problem:

\begin{align}
  \hat \beta:= argmin_\beta \{ \sum_i (y_i-x_i'\beta)^2 \},
  (\#eq:ols)
\end{align}
and in matrix notation
\begin{align}
  \hat \beta:= argmin_\beta \{ \Vert y-X\beta \Vert^2_2 \}.
  (\#eq:ols-matrix)
\end{align}

```{remark}
Personally, I prefer the matrix notation because it is suggestive of the geometry of the problem. 
The reader is referred to @friedman2001elements, Section 3.2, for more on the geometry of OLS.
```

Different software suits, and even different R packages, solve Eq.\@ref(eq:ols) in different ways so that we skip the details of how exactly it is solved. 
These are discussed in Chapters \@ref(algebra) and \@ref(convex).

The last matter we need to attend is how to do inference on $\hat \beta$.
For that, we will need some assumptions on $\varepsilon$.
A typical set of assumptions is the following:

1. __Independence__: we assume $\varepsilon_i$ are independent of everything else. 
Think of them as the measurement error of an instrument: it is independent of the measured value and of previous measurements. 
1. __Centered__: we assume that $E[\varepsilon]=0$, meaning there is no systematic error, sometimes it called The "Linearity assumption". 
1. __Normality__: we will typically assume that $\varepsilon \sim \mathcal{N}(0,\sigma^2)$, but we will later see that this is not really required. 

We emphasize that these assumptions are only needed for inference on $\hat \beta$ and not for the estimation itself, which is done by the purely algorithmic framework of OLS.

Given the above assumptions, we can apply some probability theory and linear algebra to get the distribution of the estimation error:
\begin{align}
  \hat \beta - \beta \sim \mathcal{N}(0, (X'X)^{-1} \sigma^2).
  (\#eq:ols-distribution)
\end{align}

The reason I am not too strict about the normality assumption above, is that Eq.\@ref(eq:ols-distribution) is approximately correct even if $\varepsilon$ is not normal, provided that there are many more observations than factors ($n \gg p$).


## OLS Estimation in R

We are now ready to estimate some linear models with R.
We will use the `whiteside` data from the __MASS__ package, recording the outside temperature and gas consumption, before and after an apartment's insulation.
```{r, cache=TRUE}
library(MASS) # load the package
library(data.table) # for some data manipulations
data(whiteside) # load the data
head(whiteside) # inspect the data
```

We do the OLS estimation on the pre-insulation data with `lm` function (acronym for Linear Model), possibly the most important function in R.
```{r}
library(data.table)
whiteside <- data.table(whiteside)
lm.1 <- lm(Gas~Temp, data=whiteside[Insul=='Before']) # OLS estimation 
```
Things to note:

- We used the tilde syntax `Gas~Temp`, reading "gas as linear function of temperature".
- The `data` argument tells R where to look for the variables Gas and Temp.
We used `Insul=='Before'` to subset observations before the insulation.
- The result is assigned to the object `lm.1`.

Like any other language, spoken or programmable, there are many ways to say the same thing. Some more elegant than others...
```{r, eval=FALSE}
lm.1 <- lm(y=Gas, x=Temp, data=whiteside[whiteside$Insul=='Before',]) 
lm.1 <- lm(y=whiteside[whiteside$Insul=='Before',]$Gas,x=whiteside[whiteside$Insul=='Before',]$Temp)  
lm.1 <- whiteside[whiteside$Insul=='Before',] %>% lm(Gas~Temp, data=.)
```

The output is an object of class `lm`.
```{r}
class(lm.1)
```

Objects of class `lm` are very complicated. 
They store a lot of information which may be used for inference, plotting, etc.
The `str` function, short for "structure", shows us the various elements of the object.
```{r}
str(lm.1)
```

In RStudio it is particularly easy to extract objects. Just write `your.object$` and press `tab` after the `$` for auto-completion.

If we only want $\hat \beta$, it can also be extracted with the `coef` function.
```{r}
coef(lm.1)
```

Things to note:

- R automatically adds an `(Intercept)` term. 
This means we estimate $Gas=\beta_0 + \beta_1 Temp + \varepsilon$ and not $Gas=\beta_1 Temp + \varepsilon$. 
This makes sense because we are interested in the contribution of the temperature to the variability of the gas consumption about its __mean__, and not about zero.

- The effect of temperature, i.e., $\hat \beta_1$, is `r round(coef(lm.1)[[2]],2)`. 
The negative sign means that the higher the temperature, the less gas is consumed. 
The magnitude of the coefficient means that for a unit increase in the outside temperature, the gas consumption decreases by `r abs(round(coef(lm.1)[[2]],2))` units. 

We can use the `predict` function to make predictions, but we emphasize that if the purpose of the model is to make predictions, and not interpret coefficients, better skip to the Supervised Learning Chapter \@ref(supervised).
```{r, results='hold'}
# Gas predictions (b0+b1*temperature) vs. actual Gas measurements, ideal slope should be 1.
plot(predict(lm.1)~whiteside[Insul=='Before',Gas])
# plots identity line (slope 1), lty=Line Type, 2 means dashed line.
abline(0,1, lty=2)
```

The model seems to fit the data nicely.
A common measure of the goodness of fit is the _coefficient of determination_, more commonly known as the $R^2$.

(ref:R2) $R^2$.

```{definition, name='R2'}
The coefficient of determination, denoted $R^2$, is defined as
\begin{align}
  R^2:= 1-\frac{\sum_i (y_i - \hat y_i)^2}{\sum_i (y_i - \bar y)^2},
\end{align}
where $\hat y_i$ is the model's prediction, $\hat y_i = x_i \hat \beta$.
```

It can be easily computed
```{r}
library(magrittr)
R2 <- function(y, y.hat){
  numerator <- (y-y.hat)^2 %>% sum
  denominator <- (y-mean(y))^2 %>% sum
  1-numerator/denominator
}
R2(y=whiteside[Insul=='Before',Gas], y.hat=predict(lm.1))
```

This is a nice result implying that about $94\%$ of the variability in gas consumption can be attributed to changes in the outside temperature. 

Obviously, R does provide the means to compute something as basic as $R^2$, but I will let you find it for yourselves. 


## Inference
To perform inference on $\hat \beta$, in order to test hypotheses and construct confidence intervals, we need to quantify the uncertainly in the reported $\hat \beta$.
This is exactly what Eq.\@ref(eq:ols-distribution) gives us.

Luckily, we don't need to manipulate multivariate distributions manually, and everything we need is already implemented. 
The most important function is `summary` which gives us an overview of the model's fit.
We emphasize that fitting a model with `lm` is an assumption free algorithmic step. 
Inference using `summary` is __not__ assumption free, and requires the set of assumptions leading to Eq.\@ref(eq:ols-distribution).
```{r}
summary(lm.1)
```

Things to note:

- The estimated $\hat \beta$ is reported in the `Coefficients' table, which has point estimates, standard errors, t-statistics, and the p-values of a two-sided hypothesis test for each coefficient $H_{0,j}:\beta_j=0, j=1,\dots,p$.
- The $R^2$ is reported at the bottom. The "Adjusted R-squared" is a variation that compensates for the model's complexity.
- The original call to `lm` is saved in the `Call` section.
- Some summary statistics of the residuals ($y_i-\hat y_i$) in the `Residuals` section.
- The "residuals standard error"^[Sometimes known as the Root Mean Squared Error (RMSE).] is $\sqrt{(n-p)^{-1} \sum_i (y_i-\hat y_i)^2}$. The denominator of this expression is the _degrees of freedom_, $n-p$, which can be thought of as the hardness of the problem. 

As the name suggests, `summary` is merely a summary. The full `summary(lm.1)` object is a monstrous object. 
Its various elements can be queried using `str(sumary(lm.1))`.

Can we check the assumptions required for inference?
Some.
Let's start with the linearity assumption.
If we were wrong, and the data is not arranged about a linear line, the residuals will have some shape. We thus plot the residuals as a function of the predictor to diagnose shape. 
```{r, results='hold'}
# errors (epsilons) vs. temperature, should oscillate around zero.
plot(residuals(lm.1)~whiteside[Insul=='Before',Temp])
abline(0,0, lty=2)
```

I can't say I see any shape.
Let's fit a __wrong__ model, just to see what "shape" means.
```{r}
lm.1.1 <- lm(Gas~I(Temp^2), data=whiteside[Insul=='Before',])
plot(residuals(lm.1.1)~whiteside[Insul=='Before',Temp]); abline(0,0, lty=2)
```

Things to note:

- We used `I(Temp^2)` to specify the model $Gas=\beta_0 + \beta_1 Temp^2+ \varepsilon$.
- The residuals have a "belly". 
Because they are not a cloud around the linear trend, and we have the wrong model.

To the next assumption.
We assumed $\varepsilon_i$ are independent of everything else.
The residuals, $y_i-\hat y_i$ can be thought of a sample of $\varepsilon_i$. 
When diagnosing the linearity assumption, we already saw their distribution does not vary with the $x$'s, `Temp` in our case. 
They may be correlated with themselves; a positive departure from the model, may be followed by a series of positive departures etc.
Diagnosing these _auto-correlations_ is a real art, which is not part of our course. 

The last assumption we required is normality. 
As previously stated, if $n \gg p$, this assumption can be relaxed. 
If $n$ is in the order of $p$, we need to verify this assumption.
My favorite tool for this task is the _qqplot_.
A qqplot compares the quantiles of the sample with the respective quantiles of the assumed distribution.
If quantiles align along a line, the assumed distribution is OK.
If quantiles depart from a line, then the assumed distribution does not fit the sample.
```{r}
qqnorm(resid(lm.1))
```

Things to note:

- The `qqnorm` function plots a qqplot against a normal distribution. For non-normal distributions try `qqplot`. 
- `resid(lm.1)` extracts the residuals from the linear model, i.e., the vector of $y_i-x_i'\hat \beta$.

Judging from the figure, the normality assumption is quite plausible. 
Let's try the same on a non-normal sample, namely a uniformly distributed sample, to see how that would look.
```{r}
qqnorm(runif(100))
```


### Testing a Hypothesis on a Single Coefficient
The first inferential test we consider is a hypothesis test on a single coefficient. 
In our gas example, we may want to test that the temperature has no effect on the gas consumption.
The answer for that is given immediately by `summary(lm.1)`
```{r}
summary.lm1 <- summary(lm.1)
coefs.lm1 <- summary.lm1$coefficients
coefs.lm1
```
We see that the p-value for $H_{0,1}: \beta_1=0$ against a two sided alternative is effectively `r round(coefs.lm1[2,4],2)` (row 2 column 4), so that $\beta_1$ is unlikely to be $0$ (the null hypothesis can be rejected).


### Constructing a Confidence Interval on a Single Coefficient
Since the `summary` function gives us the standard errors of $\hat \beta$, we can immediately compute $\hat \beta_j \pm 2 \sqrt{Var[\hat \beta_j]}$ to get ourselves a (roughly) $95\%$ confidence interval.
In our example the interval is
```{r}
coefs.lm1[2,1] + c(-2,2) * coefs.lm1[2,2]
```

Things to note:

- The function `confint(lm.1)` can calculate it. Sometimes it's more simple to write 20 characters of code than finding a function that does it for us. 

### Multiple Regression

```{remark}
_Multiple regression_ is not to be confused with _multivariate regression_ discussed in Chapter \@ref(multivariate).
```

The `swiss` dataset encodes the fertility at each of Switzerland's 47 French speaking provinces, along other socio-economic indicators. Let's see if these are statistically related:
```{r}
head(swiss)
lm.5 <- lm(data=swiss, Fertility~Agriculture+Examination+Education+Catholic+Infant.Mortality)
summary(lm.5)
```

Things to note:

- The `~` syntax allows to specify various predictors separated by the `+` operator. 
- The summary of the model now reports the estimated effect, i.e., the regression coefficient, of each of the variables. 

Clearly, naming each variable explicitly is a tedious task if there are many. The use of `Fertility~.` in the next example reads: "Fertility as a function of all other variables in the `swiss` data.frame".

```{r}
lm.5 <- lm(data=swiss, Fertility~.)
summary(lm.5)
```


### ANOVA (*)

Our next example^[The example is taken from http://rtutorialseries.blogspot.co.il/2011/02/r-tutorial-series-two-way-anova-with.html] contains a hypothetical sample of $60$ participants who are divided into three stress reduction treatment groups (mental, physical, and medical) and three age groups groups. 
The stress reduction values are represented on a scale that ranges from 1 to 10. 
The values represent how effective the treatment programs were at reducing participant's stress levels, with larger effects indicating higher effectiveness.

```{r, cache=TRUE}
twoWay <- read.csv('data/dataset_anova_twoWay_comparisons.csv')
head(twoWay)
```

How many observations per group?
```{r}
table(twoWay$Treatment, twoWay$Age)
```

Since we have two factorial predictors, this multiple regression is nothing but a _two way ANOVA_.
Let's fit the model and inspect it.
```{r}
lm.2 <- lm(StressReduction~.,data=twoWay)
summary(lm.2)
```
Things to note:

- The  `StressReduction~.` syntax is read as "Stress reduction as a function of everything else".

- All the (main) effects and the intercept seem to be significant. 

- Mid age and medical treatment are missing, hence it is implied that they are the baseline, and this model accounts for the departure from this baseline.

- The data has 2 factors, but the coefficients table has 4 predictors. This is because `lm` noticed that `Treatment` and `Age` are factors. Each level of each factor is thus encoded as a different (dummy) variable.
The numerical values of the factors are meaningless.
Instead, R has constructed a dummy variable for each level of each factor. 
The names of the effect are a concatenation of the factor's name, and its level.
You can inspect these dummy variables with the `model.matrix` command. 
```{r}
model.matrix(lm.2) %>% lattice::levelplot()
```
If you don't want the default dummy coding, look at `?contrasts`.


If you are more familiar with the ANOVA literature, or that you don't want the effects of each level separately, but rather, the effect of __all__ the levels of each factor, use the `anova` command.
```{r}
anova(lm.2)
```

Things to note:

- The ANOVA table, unlike the `summary` function, tests if __any__ of the levels of a factor has an effect, and not one level at a time. 
- The significance of each factor is computed using an F-test. 
- The degrees of freedom, encoding the number of levels of a factor, is given in the `Df` column.
- The StressReduction seems to vary for different ages and treatments, since both factors are significant.

If you are extremely more comfortable with the ANOVA literature, you could have replaced the `lm` command with the `aov` command all along.
```{r}
lm.2.2 <- aov(StressReduction~.,data=twoWay)
class(lm.2.2)
summary(lm.2.2)
```

Things to note:

- The `lm` function has been replaced with an `aov` function. 
- The output of `aov` is an `aov` class object, which extends the `lm` class.
- The summary of an `aov` is not like the summary of an `lm` object, but rather, like an ANOVA table. 


As in any two-way ANOVA, we may want to ask if different age groups respond differently to different treatments. 
In the statistical parlance, this is called an _interaction_, or more precisely, an _interaction of order 2_.
```{r}
lm.3 <- lm(StressReduction~Treatment+Age+Treatment:Age-1,data=twoWay)
```
The syntax `StressReduction~Treatment+Age+Treatment:Age-1` tells R to include main effects of Treatment, Age, and their interactions. The -1 removes the intercept.
Here are other ways to specify the same model.
```{r, eval=FALSE}
lm.3 <- lm(StressReduction ~ Treatment * Age - 1,data=twoWay)
lm.3 <- lm(StressReduction~(.)^2 - 1,data=twoWay)
```
The syntax `Treatment * Age` means "main effects with second order interactions".
The syntax `(.)^2` means "everything with second order interactions", this time we don't have I() as in the temperature example because here we want the second order interaction and not the square of each variable.

Let's inspect the model
```{r}
summary(lm.3)
```
Things to note:

- There are still $5$ main effects, but also $4$ interactions. 
This is because when allowing a different average response for every $Treatment*Age$ combination, we are effectively estimating $3*3=9$ cell means, even if they are not parametrized as cell means, but rather as main effect and interactions. 
- The interactions do not seem to be significant.
- The assumptions required for inference are clearly not met in this example, which is there just to demonstrate R's capabilities. 

Asking if all the interactions are significant, is asking if the different age groups have the same response to different treatments. 
Can we answer that based on the various interactions? 
We might, but it is possible that no single interaction is significant, while the combination is. 
To test for all the interactions together, we can simply check if the model without interactions is (significantly) better than a model with interactions. I.e., compare `lm.2` to `lm.3`.
This is done with the `anova` command.
```{r}
anova(lm.2,lm.3, test='F')
```
We see that `lm.3` is __not__  (significantly) better than `lm.2`, so that we can conclude that there are no interactions: different ages have the same response to different treatments.


### Testing a Hypothesis on a Single Contrast (*)

Returning to the model without interactions, `lm.2`.
```{r}
coef(summary(lm.2))
```
We see that the effect of the various treatments is rather similar. 
It is possible that all treatments actually have the same effect.
Comparing the effects of factor levels is called a _contrast_.
Let's test if the medical treatment, has in fact, the same effect as the physical treatment.
```{r}
library(multcomp)
my.contrast <- matrix(c(-1,0,1,0,0), nrow =  1)
lm.4 <- glht(lm.2, linfct=my.contrast)
summary(lm.4)
```

Things to note:

- A contrast is a linear function of the coefficients. In our example $H_0:\beta_1-\beta_3=0$, which justifies the construction of `my.contrast`.
- We used the `glht` function (generalized linear hypothesis test) from the package __multcomp__.
- The contrast is significant, i.e., the effect of a medical treatment, is different than that of a physical treatment.


## Extra Diagnostics

### Diagnosing Heteroskedasticity
Textbook assumptions for inference on $\hat \beta_{OLS}$ include the _homoskedasticiy_ assumption, i.e., $Var[\varepsilon]$ is fixed and independent of everything. 
This comes from viewing $\varepsilon$ as a measurement error.
It may not be the case when viewing $\varepsilon$ as "all other effect not included in the model". 
In technical terms, homoskedastocity implies that $Var[\varepsilon]$ is a scaled identity matrix. 
Heteroskedasticity means that $Var[\varepsilon]$ is a diagonal matrix.
Because a scaled identify matrix implies that the quantiles of a multivariate Gaussian are spheres, testing for heteroskedasticity is also known as a _Sphericity Test_. 

Can we verify homoskedasticity, and do we need it?

To verify homoskedasticity we only need to look at the residuals of a model. If they seem to have the same variability for all $x$ we are clear. 
If $x$ is multivariate, and we cannot visualise residuals, $y_i-\hat y_i$ as a function of $x$, then visualising it as a function of $\hat y_i$ is also good. 

Another way of dealing with heteroskedasticity is by estimating variances for groups of observations separately. 
This is the _Cluster Robust Standard Errors_ discussed in \@ref(cr-se).

Can use perform a test to infer homoskedasticity? 
In the frequentist hypotesis testing framework we can only reject homoskedasticity, not accept it. 
In the [Bayesian hypotesis testing](https://en.wikipedia.org/wiki/Bayesian_inference) framework we can indeed infer homoskedasticity, but one would have to defend his/her priors. 

For some tests that detect heteroskedasticity see the [olsrr](https://cran.r-project.org/web/packages/olsrr/vignettes/heteroskedasticity.html) package. 
For an econometric flavored approach to the problem, see the [plm](https://cran.r-project.org/package=plm) package, and its excellent [vignette](https://cran.r-project.org/web/packages/plm/vignettes/plmPackage.html).


### Diagnosing Multicolinearity

When designing an experiment (e.g. [RCTs](https://en.wikipedia.org/wiki/Randomized_controlled_trial)) we will assure treatments are "balanced", so that one effect estimates are not correlated. 
This is not always possible, especially not in observational studies. 
If various variables capture the same effect, the certainty in the effect will "spread" over these variables. 
Formally: the standard errros of effect estimates will increase. 
Perhaps more importantly- causal inference with correlated predictors is very hard to interpret, because changes in outcome may be attibuted to any on of the (correlated) predictors. 

We will eschew the complicated philosophical implication of causal inference with correlated predictors, and merely refer the reader to the package [olsrr](https://cran.r-project.org/web/packages/olsrr/vignettes/regression_diagnostics.html) for some popular tools to diagnose multicolinearity. 


## How Does R Encode Factor Variables?
TODO.

https://cran.r-project.org/web/packages/codingMatrices/vignettes/codingMatrices.pdf


## Bibliographic Notes
Like any other topic in this book, you can consult @venables2013modern for more on linear models.
For the theory of linear models, I like @greene2003econometric. 


## Practice Yourself

1. Inspect women's heights and weights with `?women`.
    1. What is the change in weight per unit change in height? Use the `lm` function.
    1. Is the relation of height on weight significant? Use `summary`.
    1. Plot the residuals of the linear model with `plot` and `resid`.
    1. Plot the predictions of the model using `abline`. 
    1. Inspect the normality of residuals using `qqnorm`.
    1. Inspect the design matrix using `model.matrix`.
1. Write a function that takes an `lm` class object, and returns the confidence interval on the first coefficient. Apply it on the height and weight data. 
1. Use the `ANOVA` function to test the significance of the effect of height.

1. Read about the “mtcars” dataset using `? mtcars`. Inspect the dependency of the fuel consumption (mpg) in the weight (wt) and the 1/4 mile time (qsec).
    1. Make a pairs scatter plot with `plot(mtcars[,c("mpg","wt","qsec")])`
Does the connection look linear?
    1. Fit a multiple linear regression with `lm`. Call it `model1`.
    1. Try to add the transmission (am) as independent variable. Let R know this is a categorical variable with `factor(am)`. Call it `model2`.
    1. Compare the “Adjusted R-squared” measure of the two models (we can’t use the regular R2 to compare two models with a different number of variables). 
    1. Are the coefficients significant?
    1. Inspect the normality of residuals and the linearity assumptions. 
    1. Now Inspect the hypothesis that the effect of weight is different between the transmission types with adding interaction to the model `wt*factor(am)`. 
    1. According to this model, what is the addition of one unit of weight in a manual transmission to the fuel consumption (-2.973-4.141=-7.11)?