-
Notifications
You must be signed in to change notification settings - Fork 23
/
support.qmd
339 lines (261 loc) · 10.8 KB
/
support.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
---
title: "SUPPORT: Study to Understand Prognoses Outcomes Preferences and Risks of Treatments"
author:
- name: Frank Harrell
affiliation: Department of Biostatistics<br>Vanderbilt University School of Medicine<br>MSCI Biostatistics II
date: last-modified
format:
html:
self-contained: true
number-sections: true
number-depth: 3
anchor-sections: true
smooth-scroll: true
theme: journal
toc: true
toc-depth: 3
toc-title: Contents
toc-location: left
code-link: false
code-tools: true
code-fold: show
code-block-bg: "#f1f3f5"
code-block-border-left: "#31BAE9"
reference-location: margin
fig-cap-location: margin
execute:
warning: false
message: false
major: regression modeling; rms
minor: ordinary least squares
---
```{r setup}
require(rms)
require(qreport) # provides missChk
require(ggplot2)
options(prType='html')
```
# Problem
* SUPPORT
* [Description](http://hbiostat.org/data/repo/SupportDesc.html)
* N observations: 1000
* Goal: Assess relationship between sex, mean arterial blood pressure at baseline, and study length of stay (days in hospital from the date of qualification for the study)
# Data Import
```{r import}
getHdata(support)
d <- upData(support, keep=.q(slos, meanbp, sex, age, dzgroup, num.co, edu, income, scoma, hrt, resp, temp, totcst))
print(contents(d), levelType='table') # data dictionary for imported dataset
```
# Descriptive Statistics
```{r desc}
describe(d)
```
Baseline is day 3. `meanbp=0` means cardiac arrest happened on that day. This happened in `r sum(d$meanbp == 0, na.rm=TRUE)` patients.
Try transformations in `slos` to find one that makes the distribution more symmetric.
```{r}
#| fig.height: 2.75
y <- d$slos
w <- rbind(data.frame(trans='identity', y = y),
data.frame(trans='sqrt', y = sqrt(y)),
data.frame(trans='cube root', y = y^(1/3)),
data.frame(trans='log', y = log(y)),
data.frame(trans='reciprocal', y = 1 / y))
ggplot(w, aes(x=y)) + geom_histogram(bins=100) +
facet_wrap(~ trans, scales='free_x') # let x scale vary
```
We'll use the log transformation.
# Setting Knot Locations
* Linear spline: must be specified
* Restricted cubic spline (rcs): can specify or use default quantiles
Let's specify knots manually and use the same knot location for linear and cubic splines. Knots are specified as the vector `knots` as below.
```{r}
knots <- c(40, 60, 70, 80, 100, 110, 140)
```
# Plot Raw Data
Include knot locations as vertical reference lines on the plot.
Note the bimodality of the `meanbp` distribution.
```{r}
if(FALSE) hlab <- function(x) {
x <- as.character(substitute(x))
label(d[[x]], plot=TRUE, default=x)
}
ggplot(d, aes(x=meanbp, y=slos, color=sex)) + geom_point() +
scale_y_continuous(trans="log",
breaks=c(3, seq( 5, 45, by=5), seq(50, 90, by=10),
seq(100, 250, by=25))) +
xlab(hlab(meanbp)) + ylab(hlab(slos)) +
geom_vline(xintercept=knots, alpha=0.3, color='blue')
```
# Linear Spline in `meanbp`
## Fit
```{r}
dd <- datadist(d); options(datadist='dd')
# x=TRUE is used for plotting basis functions
f <- ols(log(slos) ~ lsp(meanbp, knots) + sex, data=d, x=TRUE)
f
latex(f)
```
```{r}
specs(f, long=TRUE) # show detailed specs for predictors
```
## Plot Spline Basis Functions
These are the basis for curve fitting, which puts weights ($\hat{\beta}$) on each of the basis functions and then adds them together. The first basis function (linear everywhere) in a linear or restricted spline function in the original variable. Linear spline basis functions are zero until the particular knot is cross.
```{r}
X <- f$x
X <- X[, - ncol(X)] # remove last column (sex)
matplot(X[, 1], X) # first column against all columns
```
## Tests of Flatness and Linearity
The tests can also be obtained by comparing to simpler models, but the `rms` `anova` function is easier.
```{r}
anova(f)
```
## Plot Predicted Log LOS
Include raw data on the plot.
```{r}
ggplot(Predict(f, meanbp, sex)) +
geom_point(aes(meanbp, log(slos)), d)
```
# Restricted Cubic Spline Model
A restricted cubic spline is continuous in the slope and slope of the slope (acceleration) so is more realistic.
## Fit
```{r}
f <- ols(log(slos) ~ rcs(meanbp, knots) + sex, data=d, x=TRUE)
f
latex(f)
```
```{r}
specs(f, long=TRUE)
```
## Plot Spline Basis Functions
```{r}
X <- f$x
X <- X[, - ncol(X)] # remove last column (sex)
matplot(X[, 1], X) # first column against all columns
```
## Tests of Flatness and Linearity
```{r}
anova(f)
```
## Plot Predicted Log LOS
```{r}
ggplot(Predict(f, meanbp, sex))
```
* Why could the sex difference be misleading?
# Missing Data
We will be predicting total cost. There is one patient with a total cost of zero, which we remove.
```{r}
d <- subset(d, totcst != 0 | is.na(totcst))
```
## Summarize Patterns of Missings
Use the `qreport` package's `missChk` function to summarize patterns.
```{r results='asis'}
missChk(d)
```
We see that about 45% of the observations have at least one missing variable. This will inform the number of multiple imputations.
## Multiple Imputation of `edu, income, totcst`
Use the `Hmisc` package `aregImpute` function to create multiple imputations of missing variables. The method used is predictive mean matching, based on predicting each sometimes-missing variable from all the others and finding donor observations based on how close the predicted value of the target variable for the observation with that variable missing is to the predicted target variable on all the observations for which it was not missing. Use 45 imputations per missing value. Continuous variables are automatically splined when predicting the target variable. 3 burn-in iterations are discarded.
```{r}
set.seed(1) # set random number seed so can reproduce calculations
# Force the discrete variable scoma to be treated as linear to
# avoid problems with excessive ties when trying to compute knots
a <- aregImpute(~ age + sex + slos + dzgroup + num.co + edu +
income + I(scoma) + totcst + meanbp + hrt + resp + temp,
data=d, n.impute=45, pr=FALSE)
a
# Show the first 10 imputed values for each sometimes-missing
# variable, for the first 5 patients needed to be imputed
a$imputed$edu[1:5, 1:10]
a$imputed$income[1:5, 1:10]
round(a$imputed$totcst[1:5, 1:10])
```
## Pooled Fit After Multiple Imputation
The `Hmisc` `fit.mult.impute` function makes 45 completed datasets, fits 45 models, averages their coefficient estimates to get final parameter estimates, and uses Rubin's rule to get standard errors and approximate Wald test statistics assuming approximate normality.
```{r}
f <- fit.mult.impute(log(totcst) ~ rcs(age, 4) + rcs(meanbp, 4) + rcs(hrt, 4) +
sex + dzgroup + num.co + scoma + income, ols, a, data=d)
```
```{r}
f
anova(f)
```
# Variable Clustering
* Can lead to data reduction
* This can allow you to live within the parameter budget
* Go back to the original imported `support` dataset which has more variables and didn't remove an observation
```{r}
v <- varclus(~ age + sex + dzclass + race + meanbp + wblc + hrt +
resp + temp + alb + bili + crea + sod + ph + glucose +
bun + urine + adlsc, data=support)
plot(v)
```
* What are the near redundancies?
# Ordinal Predictor
If an ordinal predictor has few levels, we generally include it into a regression model as a nominal categorical variable. If an ordinal variable has many levels, we generally treat it as a continuous variable and include it into the regression as a restricted cubic spline. However, what are the options for an ordinal variable with a middle number of levels?
* Consider number of comorbidities: `num.co`
* Predict `log(slos)`
* This predicts mean log `slos` which under a normality assumption for the residuals is also the median log `slos`
* Anti-log of predicted value is median `slos`
* No missing data on the only two variables being analyzed (`num.co, slos`)
* Compare a variety of ways of modeling `num.co`; focus on $R^2_\text{adj}$
* Only 1 patient has `num.co=7`
* Model with and without curtailing `num.co` at 6
* Add a categorical version of `num.co` to the dataset
* In R `pmin(x, 6)` ("parallel minimum") takes for every observation the minimum of `x` and 6
* For count variables with not many categories, ties prevent us from rationally choosing knots for splines
* Quadratic relationship often works well enough (`pol(x, 2)`)
* Sometimes zero has a special meaning $\rightarrow$ fit a final model that is quadratic in `num.co` but with a discontinuity at `num.co`=0
```{r}
d <- upData(d, num.co7 = factor(num.co),
num.co6 = factor(pmin(num.co, 6)))
dd <- datadist(d)
```
```{r}
# In the code below (x) means "execute x and print its result"
ols(log(slos) ~ num.co, data=d)
ols(log(slos) ~ pmin(num.co, 6), data=d)
(f3 <- ols(log(slos) ~ pol(num.co, 2), data=d))
ols(log(slos) ~ pol(pmin(num.co, 6), 2), data=d)
(f5 <- ols(log(slos) ~ num.co7, data=d))
ols(log(slos) ~ num.co6, data=d)
# Define a quadratic function with discontuity at zero for `gTrans`
g <- function(x) cbind(zero = x == 0, num.co = x, 'num.co^2' = x ^ 2)
ols(log(slos) ~ gTrans(num.co, g), data=d)
```
* For the 0-6 categorical model look at linearity by seeing how the $\hat{\beta}$ progress
* Other interpretations
* Conclusions
## Compare Estimates
Let's compare estimates of median `slos` as a function of `num.co`:
* stratified sample medians (lower precision but less bias if log normal distribution doesn't fit)
* saturated linear model (`num.co` categorical with all levels)
* quadratic model
```{r}
# Get stratified sample medians
# Note that sample medians are hurt by ties in raw data
s <- summary(slos ~ num.co, fun=median, data=d)
# For more information run ?summary.formula.response (in Hmisc)
print(s, markdown=TRUE)
smeds <- with(d, tapply(slos, num.co, median))
# To use data.table istead:
# require(data.table)
# setDT(d) # convert d to a data.table
# data.table needs all groups to return the same value type
# g <- function(x) as.double(median(x, na.rm=TRUE))
# smeds <- d[, g(slos), keyby=num.co] # keyby: sort into ascending order
m3 <- Predict(f3, num.co, fun=exp)$yhat
m5 <- Predict(f5, num.co7, fun=exp)$yhat
w <- data.frame('Sample<br>Medians' = smeds,
'Quadratic<br>Model' = round(m3, 1),
'Categorical<br>Model' = round(m5, 1),
check.names=FALSE)
# kabl (almost the same as knitr::kable) is in reptools.r
# It nicely formats tables in html using markdown
kabl(w)
```
* The categorical model is saturated (maximally flexible) on the right hand side of the equation. Why do estimates from it differ from the sample medians?
* Do you trust the sample median and categorical model estimates for 7 comorbidities?
# Computing Environment
```{r echo=FALSE,results='asis'}
markupSpecs$html$session()
```