-
Notifications
You must be signed in to change notification settings - Fork 1
/
LinearModelling.Rmd
1158 lines (798 loc) · 49.1 KB
/
LinearModelling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Introduction to Linear Modelling in R"
output:
learnr::tutorial:
css: "css/style.css"
progressive: true
allow_skip: true
runtime: shiny_prerendered
---
```{r setup, include=FALSE}
# Author: Al Morgan
# Date: Oct 2024
# R Version: 4.4.2
# Learnr Package Documentation: https://rstudio.github.io/learnr/
# Packages
library(learnr)
library(gradethis)
library(dplyr)
library(janitor)
library(stringr)
library(openxlsx)
library(ggplot2)
library(phsstyles)
library(modelr)
library(ggfortify)
library(GGally)
knitr::opts_chunk$set(echo = FALSE)
```
```{r phs-logo, fig.align='right', out.width="40%"}
knitr::include_graphics("images/phs-logo.png")
```
## Introduction
Welcome to an introduction to linear modelling in R. This course is designed as a self-directed introduction to statistical modelling for anyone in Public Health Scotland.
<br>
::: info_box
<h4>Course Info</h4>
<ul>
<li>
This course is built to flow through sections and build on previous knowledge. If you're comfortable with a particular section, you can skip it.
</li>
<li>
Some sections have multiple parts to them. Navigate the course by using the buttons at the bottom of the screen to Continue or go to the Next Topic.
</li>
<li>
The course will also show progress through sections, a green tick will appear on sections you've completed, and it will remember your place if you decide to close your browser and come back later.
</li>
</ul>
:::
```{r echo=FALSE, fig.align='center', out.width="75%"}
knitr::include_graphics("images/illustrations/presenting_monster.png")
```
## What is modelling?
Modelling is a broad term, encompassing many methods. In particular, statistical modelling consists of using mathematical equations and algorithms to represent, analyse and make predictions from data.
By using data we already have, we can make estimates about results we don't have. For example, if you sell 100 ice creams on a 20°C day and 300 ice creams on a 30°C day - you could use modelling to predict how many ice creams you would sell on a 25°C day.
This learning material will cover linear models, i.e. modelling relationships that can be expressed as a straight line. Imagine temperature on the x-axis and ice creams sold on the y-axis. A straight line on this plot could give us a prediction of how many ice creams we would sell at any given temperature.
```{r, echo=FALSE, eval=TRUE, warning=FALSE, fig.align='center', fig.alt="A line plot showing temperature (degrees Celcius) on the x-axis and 'number of ice creams sold' on the y-axis. A vertical dashed line at 25 degrees meets a horizontal dashed line at 150 ice creams, indicating the prediction mentioned in the text."}
# created a dataframe with the x and y values
data <- tibble(
temperature_celsius = c(10, 20, 30),
n_ice_creams_sold = c(0, 100, 200)
)
# plot the points and line
data %>%
ggplot(aes(x = temperature_celsius, y = n_ice_creams_sold)) +
geom_line(colour = phs_colours("phs-rust"), linewidth = 2) +
geom_point(size = 3) +
# Vertical dashed line up to y_intercept
annotate("segment",
x = 25, xend = 25,
y = min(data$n_ice_creams_sold), yend = 150,
linetype = "dashed", linewidth = 1.5, color = phs_colours("phs-blue")) +
# Horizontal dashed line up to x_intercept
annotate("segment",
x = min(data$temperature_celsius), xend = 25,
y = 150, yend = 150,
linetype = "dashed", linewidth = 1.5, color = phs_colours("phs-blue")) +
theme_minimal() +
labs(x = "\nTemperature (\u00B0C)",
y = "Number of ice creams sold\n")
```
<br>
We will begin by covering **correlations** and **linear algebra** (i.e. relating to lines) before moving on to building actual models, as these topics can aid understanding of linear modelling greatly. If you're familiar with these topics, though, feel free to skip ahead.
More advanced modelling techniques (those that we might call **machine learning**) such as decision trees and neural networks are not covered here.
<br>
## Correlation
Let's start with correlation, a term you may be familiar with. Two numeric variables in our data are correlated when, as one increases in value, the other noticeably increases or decreases in value. We can often spot this by glancing at a scatter plot and seeing a 'direction' to the data points.
```{r echo=FALSE, fig.align='centre', out.width="100%", fig.alt="Three scatterplots are shown: the first showing a positive correlation, the second a negative correlation, and the third no correlation. Lines-of-best-fit are included."}
knitr::include_graphics("images/intro_modelling/correlations.png")
```
As the image above shows:
- A *positive correlation* is when the y-variable increases as the x-variable increases.
- A *negative correlation* is when the y-variable decreases as the x-variable increases.
- When there is no discernible 'direction' to the data, we describe this as having *no correlation*.
<br>
If we notice that our data is linear (i.e. trends roughly in a line) correlated, we could begin building a model to see if one of our variables is capable of **predicting** the other.
<br>
We can assign a value to the direction and magnitude of the correlation by using the `cor()` function inside a `summarise()` to calculate **Pearson's correlation coefficient (**$r$**)**.
### How do we interpret correlation coefficients?
$r$ is bound between $-1$ and $1$, with a positive value indicating a positive correlation and a negative value indicating a negative correlation.
As for the magnitude (i.e. size) of $r$, J. D. Evans (1996) suggests the following scale:
| magnitude of $r_{xy}$ | strength of correlation |
|-----------------------|-------------------------|
| 0 | none |
| 0.01 - 0.19 | very weak |
| 0.20 - 0.39 | weak |
| 0.40 - 0.59 | moderate |
| 0.60 - 0.79 | strong |
| 0.80 - 0.99 | very strong |
| 1 | perfect |
<br>
Let's calculate the correlation coefficient for the `wt` (weight) and `mpg` (miles-per-gallon) variables in the `mtcars` dataset:
```{r echo=TRUE, eval=TRUE}
mtcars %>%
summarise(cor(mpg, wt))
```
This value of -0.87 tells us that these variables are strongly negatively correlated. (Note: the order of the variables does not matter when using the `cor()` function)
<br>
A word of warning: you can use `cor()` on any two variables, regardless of what the real trend (if any) may be! The four plots below each have the same correlation coefficient - even though you can see that most of them are not linear trends.
```{r echo=FALSE, fig.align='centre', out.width="100%", fig.alt="The Anscombes quartet of scatterplots is shown. All four are scatterplots with the same correlation coefficient, but only the first one is linear and therefore appropriate to measured using this method."}
knitr::include_graphics("images/intro_modelling/anscombes_quartet.png")
```
The lesson here is: always visualise your data first to see if a correlation coefficient is a suitable metric to use! This same logic applies to whether we should build a linear regression model from our data or not, but we will get onto that.
```{r corr-quiz}
quiz(
question("If we use `cor()` to calculate $r$ for our linear data and get a value of 0.43, how should we interpret this?",
answer("A moderate positive correlation", correct = TRUE),
answer("A strong negative correlation"),
answer("A weak positive correlation"),
answer("No correlation"),
incorrect = "Not quite, have another go!",
allow_retry = TRUE,
random_answer_order = TRUE
)
)
```
<br>
#### Correlation is not causation
This often-used phrased simply means that just because two of our variables may be correlated does not automatically mean a change in one *causes* a change in the other.
- There may be a 'hidden variable' (also known as a confounding variable) that is causing the observed changes to both of our variables of interest. For example, it may appear as though increased ice cream sales and increased sunglasses sales are linked, but temperature (or, e.g. number of sunny days) is the likely hidden variable affecting both.
- We don't usually know the direction of any effects without rigorous experimentation. For example, does more exercise lead to increased happiness, or do happier people exercise more?
Accidental correlations, or those with hidden/confounding variables, crop up in all kinds of places and can often be quite entertaining. See [Data Science Central - spurious correlation examples](https://www.datasciencecentral.com/spurious-correlations-15-examples/) to see some examples of when correlation almost certainly does not mean causation.
<br>
## Lines
Linear models are very useful tools for explaining and predicting our data. For them to be useful, however, it is important to understand the building blocks of a **line**.
The general equation for a line is:
::: {style="text-align: center;"}
$y$ = $a \times x$ + $b$
:::
We call $a$ the **gradient** (or slope) and $b$ the **intercept**. With these two values, we can plug in any value of $x$ and calculate the corresponding value of $y$. Note: you may be familiar with the equation $mx + c$, which denotes the same thing.
We call $y$ a function of $x$. For example, a gradient of 2 and an intercept of 1 would give us the following equation (written as an R function):
```{r, echo=TRUE}
line_function <- function(x){
y <- 2*x + 1
print(y)
}
```
- Therefore, an $x$ of 0 would give a $y$ of 1:
```{r, echo=TRUE}
line_function(0)
```
- An $x$ of 1 would give a $y$ of 3:
```{r, echo=TRUE}
line_function(1)
```
- An $x$ of 2 would give a $y$ of 5:
```{r, echo=TRUE}
line_function(2)
```
We can look at these as (x,y) coordinates to be plotted:
(0, 1), (1, 3), (2, 5)
Plotting them would look like this:
```{r, echo=TRUE, fig.align='center', fig.alt="A plot with the coordinates above plotted and a line joining these."}
# load in the necessary packages
library(dplyr)
library(ggplot2)
# created a dataframe with the x and y values
data <- tibble(
x_values = c(0, 1, 2),
y_values = c(1, 3, 5)
)
# plot the points and line
data %>%
ggplot(aes(x = x_values, y = y_values)) +
geom_point(size = 2) +
geom_line(colour = "red") + # plot x,y values
geom_vline(xintercept = 0) + # add a vertical line at x=0
geom_hline(yintercept = 0) + # add a horizontal line at y=0
theme_minimal()
```
This is the line that the equation $y = 2 \times x + 1$ represents! It's linear because each $x$ value plugged into the equation produces a corresponding $y$ value - joining these $x$-$y$ coordinates up gives us the line.
<br>
### Gradient
The gradient of the equation, referred to above as $a$, can be thought of as the answer to the question "How many units of $y$ do we travel if we travel one $x$ unit to the right?
- A large positive gradient means a steep, upward-sloping line.
- e.g. $a = 6$ means +6 units upward along $y$ for every +1 unit rightward along $x$
- A large negative gradient means a steep, downward-sloping line.
- e.g. $a = -6$ means -6 units downward along $y$ for every +1 unit rightward along $x$
<br>
```{r gradient-quiz}
quiz(
question("Which one of these gradients would produce a line with a shallow descent?",
answer("$a = -0.5$", correct = TRUE),
answer("$a = -8$"),
answer("$a = 5$"),
answer("$a = 1$"),
incorrect = "Not quite, have another go!",
allow_retry = TRUE,
random_answer_order = TRUE
)
)
```
<br>
### Intercepts
We've covered the gradient, $a$, so now we will look at the intercept, $b$. This value simply tells us where the line will 'cross' the y-axis. In other words, when $x = 0$, what value is $y$?
Our line from before had an intercept of 1, and you can see from the plot above that the line crosses the y-axis at $y = 1$.
```{r intercept-quiz}
quiz(
question("From examining the line plot below, what is the intercept?",
answer("$b = 3$", correct = TRUE),
answer("$b = 0$"),
answer("$b = -5$"),
answer("$b = 4$"),
incorrect = "Not quite, have another go!",
allow_retry = TRUE,
random_answer_order = TRUE
)
)
```
```{r, echo=FALSE, fig.align='center', fig.alt="A simple line plot with 3 coordinates."}
# created a dataframe with the x and y values
data <- tibble(
x_values = c(0, 1, 2),
y_values = c(3, 7, 11)
)
# plot the points and line
data %>%
ggplot(aes(x = x_values, y = y_values)) +
geom_point() +
geom_line(colour = "red") +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
theme_minimal()
```
<br>
Try playing around with the **gradient** and **intercept** values of the line function below to see the effects on the line that is subsequently plotted.
```{r line-building, exercise=TRUE, warning=FALSE, message=FALSE}
# Build the function - no need to change anything here for the exercise
line_function <- function(x, gradient, intercept){
# linear equation
y <- gradient*x + intercept
# create dataframe of x and y values
data <- tibble(
x, y
)
# plot line
plot <- data %>%
ggplot(aes(x, y)) +
geom_point() +
geom_line(colour = "red") +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
theme_minimal()
print(plot)
}
# Change values below:
line_function(x = c(0:10), gradient = 1, intercept = 0)
```
<br>
## Linear regression fundamentals
You may be wondering, why all the algebra? Well, linear regression is really all about creating an equation for a line that best 'fits' your data. This equation is our **model** which, if good enough, we can use to make further predictions.
<br>
### Fitting a line to your data
Earlier in this course, we saw that the weight (`wt`) and miles-per-gallon (`mpg`) values in the `mtcars` data set were strongly negatively correlated.
```{r, echo=FALSE, warning=FALSE, fig.align='center', fig.alt="A scatterplot of the `wt` snd `mpg` variables from the `mtcars` dataset."}
mtcars %>%
ggplot(aes(x = wt, y = mpg)) +
geom_point() +
theme_minimal()
```
If you were to draw a line-of-best-fit that was as close to as many data points as possible, it would look something like this:
```{r mt-cars-plotter, exercise=TRUE, exercise.eval=TRUE, warning=FALSE, message=FALSE}
mtcars %>%
ggplot(aes(x = wt, y = mpg)) +
geom_point() +
coord_cartesian(ylim = c(0, NA)) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()
```
<br>
By telling `geom_smooth()` that we want the `lm` method (i.e. linear model), we can add this best-fitting line to our plot. (Note: `se` stands for [standard error](https://www.statology.org/standard-error-regression/) and setting this to FALSE removes the fairly redundant error bar from the plot)
If we know the equation of this line, we can plug in any `wt` value and calculate the corresponding `mpg` value - thereby producing a prediction from our model. The linear regression model-building process is what created this line, and that's what we're now going to learn.
<br>
### Regression basics
Up until now, we've been examining lines with the form $y = a \times x + b$. In the world of linear regression, however, this equation is represented in the form:
::: {style="text-align: center;"}
$y$ = $b_0$ + $b_1x$
:::
Don't panic, this means exactly the same thing, except the intercept $b_0$ is shown first. The gradient $b_1$ is next to $x$ without spaces to indicate multiplication.
Note: the act of applying a line-of-best-fit to our data points is known as **fitting a model to data**.
<br>
Let's make some sample data that we might want to fit a model to. Our numerical variables will be `height` (cm) and `weight` (kg).
```{r, echo=TRUE, eval=TRUE, fig.align='center', fig.alt="A scatterplot showing height and weight values from the created sample."}
# create variables
weight <- c(82, 65, 85, 76, 90, 92, 68, 83, 94, 74)
height <- c(176, 164, 181, 168, 195, 185, 166, 180, 188, 174)
# place them into a dataframe
measurements_sample <- tibble(
weight,
height
)
# create a scatter plot
measurements_sample %>%
ggplot(aes(x = weight, y = height)) +
geom_point() +
theme_minimal()
```
We've visualised our data, seen that it looks linear and correlated (we could calculate $r$ if we wanted) - perfect for building a linear regression model!
Our model will use an intercept $(b_0)$, gradient $(b_1)$, and predictor ($x$, here: $weight$) to predict values of $height$ ($y$):
::: {style="text-align: center;"}
$height$ = $b_0$ + $b_1weight$
:::
<br>
### Measuring 'best fit'
This sub-section covers the 'behind the scenes' calculations of building a linear regression model. In practice, we will simply use the `lm()` function (and we will get to that in the next sub-section!), but the concepts are very useful to know.
So we have our data frame with weight and height values, let's create another line function, and predict height based on weight using this model.
1. Create a line function
```{r, eval=TRUE, echo=TRUE}
line <- function(x, b0, b1){
return(b0 + b1*x)
}
```
<br>
2. Let's pick an intercept of 95 and a gradient of 1 and use the line() function to get **fitted heights** for our data
```{r eval=TRUE, echo=TRUE}
measurements_sample <- measurements_sample %>%
mutate(fit_height = line(weight, b0 = 95, b1 = 1))
measurements_sample
```
(see how we now have actual heights and fitted/predicted heights in our data)
<br>
3. Plot these fitted heights, along with the fitted line, and the sample data we initially created
```{r, message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE, fig.align='center', fig.alt="A scatterplot showing height and weight values from the sample as filled circles, fitted values as open circles, and a red line through these fitted values to represent the linear model."}
measurements_sample %>%
ggplot(aes(x = weight, y = height)) +
geom_point() +
geom_point(aes(y = fit_height), shape = 1) +
geom_abline(slope = 1, intercept = 95, col = "red") +
geom_segment(aes(xend = weight, yend = fit_height), alpha = 0.5) +
theme_minimal()
```
On this graph, the solid circles are the sample data, the open circles are the fitted heights, and the fitted line is shown in red.
We also show the **residuals** as vertical line segments.
A **residual** is the difference between an actual outcome value and a fitted outcome value. These are also sometimes called **errors**, but 'residuals' is better, as the term 'error' gets used everywhere in statistics!
<br>
Task: If we wanted to add a `residual` column to our data, what calculation would we require? Enter the code into the `mutate()` below.
```{r setup-2, include=FALSE}
# need to recreate `sample` here for use in subsequent exercise code block
library(dplyr)
# create variables
weight <- c(82, 65, 85, 76, 90, 92, 68, 83, 94, 74)
height <- c(176, 164, 181, 168, 195, 185, 166, 180, 188, 174)
# place them into a dataframe
measurements_sample <- tibble(
weight,
height
)
# line function
line <- function(x, b0, b1){
return(b0 + b1*x)
}
# add predicted heights
measurements_sample <- measurements_sample %>%
mutate(fit_height = line(weight, b0 = 95, b1 = 1))
```
```{r residual-input, exercise=TRUE, exercise.setup="setup-2"}
measurements_sample <- measurements_sample %>%
mutate() # enter code here
```
```{r residual-input-check}
grade_this({
# Use `.result` to access the user's output
if (!"residual" %in% colnames(.result)) {
fail("The 'residual' column is missing. Did you use `mutate()`?")
}
# Check if the 'residual' column is calculated correctly
if (!all.equal(.result$residual, .result$height - .result$fit_height, check.attributes = FALSE)) {
fail("The 'residual' column is not calculated correctly - try again.")
}
# If all checks pass
pass("Great job! The 'residual' column is correctly calculated.")
})
```
<br>
### Residuals
To calculate the residuals created by our line/model, we would just subtract the fitted/predicted height values from the actual height values in our data (or the reverse is fine).
```{r, echo=TRUE, eval=TRUE}
measurements_sample <- measurements_sample %>%
mutate(residual = height - fit_height)
measurements_sample
```
When we **fit** the linear model to the data, we vary $b_0$ and $b_1$ to make the residuals **as small as possible**. We don't need to worry about doing this manually, as the `lm()` function will handle this - it's more important to understand the concept!
<br>
Let's calculate the sum of the residuals for an overall measure of how our model is fitting:
```{r, echo=TRUE, eval=TRUE}
measurements_sample %>%
summarise(sum_residuals = sum(residual))
```
<br>
```{r residuals-quiz2}
quiz(
question("Scroll up and look at the residuals column - can you spot the problem with taking the sum of these?",
answer("Some values are positive and some are negative.", correct = TRUE),
answer("Some values are much larger than others."),
answer("Some values are odd and some are even."),
incorrect = "Not quite, have another go!",
allow_retry = TRUE,
random_answer_order = TRUE
)
)
```
<br>
### Sum of squares
If we simply summed the residuals, the positive and negative values would **cancel out**. For this reason, we tend to **square the residuals** before we sum them. This ensures no cancellation!
```{r, echo=TRUE, eval=TRUE}
measurements_sample <- measurements_sample %>%
mutate(sq_residual = residual^2)
measurements_sample
```
```{r, echo=TRUE, eval=TRUE}
measurements_sample %>%
summarise(sum_sq_residuals = sum(sq_residual))
```
Our fitting algorithms try to **minimise** this sum of squares, so the method is called **least squares**.
<br>
## Build a linear regression model
Hopefully you're now familiar with way in which the fitting algorithm used by R is seeking to **minimise the sum of squared residuals**.
At last, here is how we fit a simple linear regression model in R:
```{r, echo=TRUE, eval=TRUE}
model <- lm(formula = height ~ weight, data = measurements_sample)
```
Our **outcome variable**, height, goes *before* the tilde (\~) and our **predictor variable**, weight, goes *after* the tilde.
```{r, echo=TRUE, eval=TRUE}
model
```
Taking a look at the `model` object, we see that it initially echos the fit formula. It then provides us with the intercept ($b_0$) and gradient ($b_1$) values that it has calculated as giving the **smallest** sum of squared residuals!
With this information, we could write out our model as such:
::: {style="text-align: center;"}
$height$ = $102.1758$ + $0.9336$ × $weight$
:::
<br>
### The `{modelr}` package
Base R provides some fiddly functions for dealing with our model, but the `{modelr}` package offers functions that allow us to deal with our `lm()` output in a `tidyverse` way.
```{r, echo=TRUE, eval=FALSE}
install.packages("modelr")
library(modelr)
```
For example, we can add predicted values and residuals using the `add_predictions()` and `add_residuals()` functions (first we remove our earlier columns, leaving just `height` and `weight`):
```{r, echo=TRUE, eval=TRUE}
measurements_sample <- measurements_sample %>%
select(-c(fit_height, residual, sq_residual)) %>%
add_predictions(model, var = "predicted_height") %>% # adds the fitted values
add_residuals(model) # adds the residuals
measurements_sample
```
We can then use the predicted outcomes to plot the best fit line:
```{r, echo=TRUE, eval=TRUE, fig.align='center', fig.alt="A scatterplot showing weight and height from the vample, with the linear model overlaid."}
measurements_sample %>%
ggplot(aes(x = weight)) +
geom_point(aes(y = height)) +
geom_line(aes(y = predicted_height), col = "red") +
theme_minimal()
```
Note that we could also use `geom_smooth()` or `geom_abline()` to plot this line. However, the method above using the `add_predictions()` values is much more general and can be used when we move onto multiple regression.
We can also use `add_predictions()` to get predicted outcome values for unseen data, or to generate outcome values for a sequence of data points. e.g. to get predicted `height` values for a `weight` in the range from 50 to 120kg, we could do:
```{r, echo=TRUE, eval=TRUE}
# create a one-column data frame of weight values
weights_predict <- tibble(
weight = 50:120
)
# use our model to predict corresponding height values
weights_predict %>%
add_predictions(model, var = "predicted_height")
```
<br>
### Interpreting the slope
::: {style="text-align: center;"}
$height$ = $102.1758$ + $0.9336$ × $weight$
:::
The gradient $b_1$ (or we could refer to it as $b_{weight}$ as it affects the $weight$ variable) is often called the **regression coefficient**. This term appears a lot in linear regression and it sounds scarier than it is.
The regression coefficient here links the outcome and predictor variables in the following way:
::: {style="text-align: center;"}
**A 1kg increase in weight changes the predicted height by 0.9336cm**
:::
<br>
## Model diagnostics: part 1
So we've built a model that helps us explain our data and predict unseen values. But is it a good model? And what do we mean by a model being 'good'?
In this section, we'll cover how to examine the output from our `lm()` model and see if it's a good fit for our data (and a good prediction tool!).
<br>
### Using `summary()`
The base R function, `summary()`, provides an overview of the inner workings of the model:
```{r, echo=TRUE, eval=TRUE}
summary(model)
```
1. **Call:** the model formula, with outcome variable \~ predictor variable.
2. **Residuals:** the minimum, medium, maximum and quartile values of the residuals. We don't need to worry about these too much.
3. **Coefficients:**
- The `Estimate` column shows us the intercept and the coefficient (gradient) for the `weight` variable, i.e. $b_0$ and $b_1$.
- The `Std. Error` and `t value` columns, which we don't need to worry about for our purposes.
- The `Pr(>|t|)` column shows us the $p$-values for the intercept and coefficient. If you're unfamiliar with this term, we will cover them shortly don't worry. For now: if $p$ is very small, we consider this feature of the model to have a significant effect on the outcome variable.
4. **Signif. codes:** standard $p$ significance thresholds and their associated symbols, e.g. a $p$ < 0.001 is denoted with "***".
5. **Residual standard error:** is a value we can think about as being the 'average error' in the model's estimated response variable values - i.e. by how far on average do our fitted values differ from the actual values? Therefore, small is better. Handily, it is measured in the same units as our outcome variable.
6. **Multiple r-squared:** the proportion of the variation in the outcome variable that can be **explained** by variation in the predictor variable (or variables plural for multiple regression). The closer to 1 this is the 'better' the model.
- *Adjusted* r-squared includes a penalty based on the number of predictor variables and the the size of our sample to prevent overestimating of explained variance. Feel free to do some more research on this.
- Just make sure to correctly state the R-squared value you're reporting to keep things clear.
7. **F-statistic:** tests whether there is a meaningful relationship between the outcome variable and predictor variable(s). A high value is better, as is a low $p$ associated with the statistic. This is another value we don't need to worry about too much.
```{r echo=FALSE, out.width="100%", fig.align='center', fig.alt="A screenshot of the `summary()` function output, highlighting the intercept, coefficient, p-value of this coefficient, and the R-squared value."}
knitr::include_graphics("images/intro_modelling/diagnostics.PNG")
```
The highlighted components of the model summary above are the main ones to be aware of for now. We can quickly see our coefficient and whether it is considered statistically significant, as well as r-squared at the bottom.
Remember that R-squared is:
::: {style="text-align: center;"}
*The proportion of the variation in the outcome variable that can be* **explained** *by variation in the predictor variable (or variables plural for multiple regression).*
:::
So, in the case above, we have an r-squared of 0.853. This tells us that 85.3% of the variation in `height` in the sample data can be explained by variation in `weight`.
(Note that 'explained' does not equal 'caused by'!)
The higher this value, the better the line fits the data (and the more useful our model is at predicting).
<br>
### Coefficients and $p$-values
We looked at the $p$-value related to our `weight` coefficient earlier: 0.000135. As this is so small, i.e. lower than 0.001, R assigned it '\*\*\*'.
Without getting too complex, this $p$-value is the product of a hypothesis test of the coefficient. This tests whether the coefficient is **significantly different** from zero. A low $p$-value indicates a very low probability that the coefficient is zero, and therefore changes in the predictor variable are very likely to affect our outcome variable.
We typically set a $p$-value threshold of \<0.05 or \<0.01 for coefficient significance, but these are fairly arbitrary.
<br>
#### When to trust $p$-values
As much as we may want to trust our $p$-value and run off to tell the world that, based on our sampled data and the model we created, weight is a statistically-significant predictor of height, it's important that we check that the **residuals of our model fulfill certain conditions**. This is what we will cover now.
<br>
## Model diagnostics: part 2
We need to check that our residuals:
- are independent of each other
- are normally distributed
- show no systematic increase or decrease in variation across the range of data
This might sound technical and intimidating, but it's really not that bad. Spending a few minutes learning how to read diagnostic plots will make you a much better model builder and interpreter!
<br>
We will need to install the `{ggfortify}` package so we can use the `autoplot()` function on our model to quickly view all the residuals-based plots we're interested in.
```{r, eval=FALSE, echo=TRUE}
install.packages("ggfortify")
library(ggfortify)
```
```{r, eval=TRUE, echo=TRUE, fig.align='center', fig.alt="The output of using the `autoplot()` function on our model is shown. These are four scatterplots representing different diagnostic elements of the model."}
autoplot(model)
```
This produces **four** diagnostic plots - we're concerned with the first three.
<br>
#### **1. Residuals vs Fitted: are the residuals independent of each other?**
We know both of these terms from earlier: the **fitted** values are the y values predicted by our model, and the **residuals** are the differences between these and the actual y values from our data sample.
```{r, eval=TRUE, echo=FALSE, fig.width=10, fig.align='center', fig.alt="The first of the diagnostic plots: 'Residuals vs Fitted values'."}
a <- autoplot(model)
a[1]
```
This plot tests the **independence** of residuals - we want to see them randomly scattered around 0 as opposed to e.g. increasing in size as the size of our fitted values increase. The blue line can help us - as long as it hovers somewhat close to 0 it's fine.
This is acceptable in our case, with so few data points.
#### **2. Normal Q-Q: are the residuals normally distributed?**
This **quantile-quantile** plot tests the **normality** of the residuals. For context: a normal variable is a numeric variable that, if we plotted it on a histogram, is distributed evenly around 0 in a bell shape.
```{r, eval=TRUE, echo=FALSE, fig.width=10, fig.align='center', fig.alt="The second of the diagnostic plots: 'Normal Q-Q plot'."}
a[2]
```
Here, we just want to see the data points lie close to the dashed line. If so, our residuals are well-described as normally-distributed. It's normal to see a couple of outliers at the extreme ends.
This looks fine here.
#### **3. Scale-Location: do the residuals show no systematic increase or decrease in variation across the range of data?**
This plot tests the **constancy of variation** of the residuals. It shows fitted values on the x-axis, like with plot 1, and the standardised, square-rooted residuals on the y-axis.
```{r, eval=TRUE, echo=FALSE, fig.width=10, fig.align='center', fig.alt="The third of the diagnostic plots: 'Scale-Location plot'."}
a[3]
```
We don't have a great many data points here, but what we're looking for is a 'scattershot' distribution of data points. With more points, we can view the data as a 'band' from left to right, and we don't want to see the band widening or narrowing. The blue line can help us: we want it to stay close to a constant value (or wobble around a constant value if we have little data).
This looks fine in our case.
Feel free to learn about the fourth plot, Residuals vs Leverage, in your own time but we don't need to worry about it here.
<br>
### Summary
What have we done here?
1. We checked our data and saw the two numeric variables, `height` and `weight`, show a linear relationship and are correlated.
2. Based on this, we built a linear regression model using `lm()`.
3. We used `summary()` to examine the $p$-value of the model's weight coefficient and the r-squared value. Both indicated a useful model with predictive power.
4. We used `autoplot()` to check out the diagnostic plots of the model to make sure it met the necessary assumptions.
5. These checks passed, meaning we can have confidence in using our model.
<br>
Task: when creating our model object using `lm()`, how do we structure the function call? Try on the dataset below.
```{r lm-input, exercise=TRUE}
# create a dataframe that you will use to build a model
fake_data <- tibble(
predictor_variable = c(0:10),
outcome_variable = c(20:30)
)
# create the regression model using the `lm()` function
model <- lm(formula = , data = ) # enter code here
```
```{r lm-input-check}
grade_this({
# Check if the user's code produced an `lm` object
fail_if(
!inherits(.result, "lm"),
"Your code did not produce a linear model. Did you use the `lm()` function?"
)
# Verify the formula used in the mode
model_formula <- formula(.result)
correct_formula <- outcome_variable ~ predictor_variable
fail_if(
# !identical(model_formula, correct_formula),
!isTRUE(all.equal(model_formula, correct_formula)),
"The formula in your `lm()` is incorrect."
)
# Verify the data used in the model
fail_if(
!identical(.result$call$data, quote(fake_data)),
"The `data` argument in the `lm()` function is incorrect. Did you use `fake_data`?"
)
# If all checks pass
pass("Great job! Your model is correctly specified.")
})
```
<br>
```{r lm-quiz}
quiz(
question("What does r-squared tell us about our model?",
answer("The proportion of variance in the outcome variable explained by the predictor variable.", correct = TRUE),
answer("The probability that, if we ran this on a new sample, our coefficient would be 0."),
answer("The intercept multiplied by the coefficient."),
incorrect = "Not quite, have another go!",
allow_retry = TRUE,
random_answer_order = TRUE
)
)
```
<br>
### What if the checks fail?
If the assumptions about the residuals required to use a linear regression model don't hold up (i.e. the diagnostic plots look iffy), then there are several alternative models you could use.
These include [decision trees](https://medium.com/@MrBam44/decision-trees-91f61a42c724) and [generalised linear models (GLMs)](https://towardsdatascience.com/generalized-linear-models-9ec4dfe3dc3f), although these aren't covered here.
<br>
## Multiple linear regression (MLR)
You now have the tools and knowledge needed to effectively to build a **simple linear regression model**, i.e. a model that simply has one predictor variable. However, we often will have several predictors to insert into our models...
This is where **multiple linear regression** comes in.
### Gathering our data
Let's create a data frame to practice multiple regression on. We wouldn't normally do this but for our purposes in this course this is fine. You don't need to understand all the code below, it's just creating fictional data based on smoothie sales and several variables we will use as predictors.
```{r, echo=TRUE, eval=TRUE}
# Set seed for reproducibility
set.seed(123)
# Number of observations
n <- 50
# Generate fun independent variables
advertisements <- sample(10:50, n, replace = TRUE) # Random number of smoothie ads each month
fitness_trend <- sample(50:100, n, replace = TRUE) # Monthly fitness trend score out of 100
sunshine_hours <- sample(100:300, n, replace = TRUE) # Monthly sunshine hours
# Generate dependent variable (smoothie sales) with a relationship to the independent variables
smoothie_sales_gbp <- 5000 + 15 * advertisements + 10 * fitness_trend + 5 * sunshine_hours + rnorm(n, mean = 0, sd = 500)
# Create a data frame
smoothie_data <- tibble(smoothie_sales_gbp, advertisements, fitness_trend, sunshine_hours)
# View the first few rows of the dataset
head(smoothie_data)
```
So, with each row representing a month, we have our smoothie sales in £, the number of smoothie adverts shown, a score representing people's fitness interest, and the number of hours of sunshine.
In reality, you may have to/want to clean the data, create some new variables based on the ones you have (known as **feature engineering**), and remove variables that are intrinsically linked to other variables so as to avoid what we call **multicollinearity** in our model.
The problem with multicollinearity of our predictor variables is that *"it becomes difficult for the model to estimate the relationship between each [predictor] variable and the [outcome] variable independently because the [predictor] variables tend to change in unison".* ([source](https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/))
We won't cover these topics in any more detail here, though.
<br>
### Using `ggpairs()`
Before we start trying to build a model, let's see if we can spot any correlations. Whereas before we used the `cor()` function and examined a single scatter plot, with numerous variables we can use the `ggpairs()` function from the `{GGally}` package to see this information for all variable combinations in one go.
```{r, eval=FALSE, echo=TRUE}
install.packages("GGally")
library("GGally")
```
Simply insert the data frame object into the function:
```{r, eval=TRUE, echo=TRUE, fig.align='center', fig.alt="The output of the `ggpairs()` function on the data. This is a collection of scatterplots, correlation coefficients and variable distributions."}
ggpairs(smoothie_data)
```
Don't panic. This may look like a horrendous mega-plot but if we take it part by part it is a lot easier to digest. Notice that each column and row is a variable in our data.
- Several panels show the **correlation coefficient** we discussed earlier. This gives us a simple way of assessing the correlation of each combination of variables.
- The scatter plots produced by all combinations of variables are also shown. Due to the repeated combinations shown in the plot, these scatter plots correspond to their associated correlation values on the other half of the whole plot.
- The top-left to bottom-right diagonal is comprised of matching variable combinations (e.g. `smoothie_sales` vs `smoothie_sales`). In these cases, the distribution of the variable is shown. You can think of these as histograms represented by curves.
As `smoothie_sales` is our outcome variable of interest, examining the top row and/or left column is of most interest for us here. It seems as though all three predictors are somewhat correlated with `smoothie_sales` and will make good candidates as predictors in our model.
Luckily there is also very little correlation between our predictors so we don't have to worry about multicollinearity.
<br>
### Building our model
Like with our simple linear regression model, we can use `lm()` to build a multiple linear regression model - only this time we add in our multiple predictors, separated by `+`.
Note: all of our potential predictor variables are feasibly related to `smoothie_sales_gpb` here - so we're justified in adding them to our model. In your own modelling, try not to add strange assortments of variables that you'd struggle to explain (even if it makes your model more accurate).
```{r, eval=TRUE, echo=TRUE}
smoothie_model <- lm(formula = smoothie_sales_gbp ~ advertisements + fitness_trend + sunshine_hours, data = smoothie_data)
```
Note that the + in this formula doesn't mean 'add the two variables together' here. Instead, it means e.g. '*add predictor `fitness_trend` into the model with its own coefficient*'.
We can begin by examining the model with `summary()` (although it doesn't matter if we look at this or the diagnostic plots first).
```{r, eval=TRUE, echo=TRUE}
summary(smoothie_model)
```
Let's dig into this summary output now...
### What does the model look like as an equation?
If we wanted to write out the regression model we can use the intercept and coefficients (`Estimates`) from this summary to do so. Whereas before we only had $b_0$ and $b_1$, now we have $b_2$ and $b_3$:
::: {style="text-align: center;"}
$smoothie\_sales\_gbp$ = $b_0$ + $b_1advertisements$ + $b_2fitness\_trend$ + $b_3sunshine\_hours$