forked from PeterKDunn/SRM-Textbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
11-DescribingVariables.Rmd
executable file
·533 lines (388 loc) · 21.7 KB
/
11-DescribingVariables.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
# (PART) Describing and summarising data {-}
# Describing data {#DescribingVars}
<!-- Introductions; easier to separate by format -->
```{r, child = if (knitr::is_html_output()) {'./introductions/11-DescribingVariables-HTML.Rmd'} else {'./introductions/11-DescribingVariables-LaTeX.Rmd'}}
```
## Quantitative and qualitative data
Understanding the *type* of data collected is essential before starting any analysis, because the *type* of data determines how to proceed with summaries and analyses.
Broadly, data may be described as either **quantitative** data (Sect.\ \@ref(QuantData)) or **qualitative** data (Sect.\ \@ref(QualData)).
The *data* are the recorded *values* of the variables, so sometimes we talk about quantitative and qualitative *variables*.
Quantitative variables record quantitative data; qualitative variables record qualitative data.
::: {.example #VariablesData name="Variables and data"}
'Age' is a *variable* because age varies from individual to individual.
The *data* would be values like 13 months, 21 years and 76 years.
:::
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
*Quantitative research* summarises and analyses data using numerical methods (Sect.\ \@ref(TypesOfResearch)).
*Quantitative research* uses both *quantitative* and *qualitative* variables, because both can be summarised numerically (Chaps.\ \@ref(NumericalQuant) and\ \@ref(NumericalQual) respectively) and analysed numerically.
:::
### Quantitative data: discrete and continuous data {#QuantData}
**Quantitative** data are *mathematically* numerical.
Most data arising from counting or measuring will be quantitative.
Quantitative data are often (but not always) measured with measurement units (such as *kg* or *cm*).
Be careful: Numerical data are not necessarily quantitative.
Only *mathematically numerical data are quantitative*; that is, numbers with numerical *meanings*.
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Pics/iconmonstr-calculator-9-240.png" width="50px"/>
</div>
::: {.definition #QuantitativeData name="Quantitative data"}
*Quantitative data* is *mathematically* numerical: the numbers have numerical meaning, and represent quantities or amounts.
Quantitative data arising from counting or measuring.
:::
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/pexels-carlos-cuadros-979959.jpg" width="180px"/>
</div>
::: {.example #QuantitativePostcodes name="Quantitative data"}
The weight of numbats, the thickness of sheet metal, and blood pressure are all measured, and are quantitative variables.
The number of power failures per year, the number of solar panels per home, and the number of tangelos per tree are all counts, and are quantitative variables.
Australian postcodes are four-digit numbers, but are *not* quantitative; the numbers are just labels.
A postcode of 4556 isn't one 'better' or one 'more' than a postcode of 4555.
The values do not have numerical *meanings*.
Indeed, alphabetic postcodes could have been chosen.
For example, the postcode of Caboolture is 4510, but could have been QCAB.
:::
Quantitative data may be further classified as *discrete* or *continuous*.
*Discrete* quantitative data has possible values that can be counted, at least in theory.
Sometimes, the possible values may have no theoretical upper limit, yet are still considered 'countable'.
*Continuous* quantitative data has values that cannot, at least in theory, be recorded exactly: another value can always be found between any two given values of the variable, if we measure to a greater number of decimal places.
In practice, though, the values need to be rounded to a reasonable number of decimal places.
::: {.definition #DiscreteData name="Discrete data"}
*Discrete* quantitative data has a countable number of possible values between any two given values of the variable.
:::
::: {.example #QuantDiscrete name="Discrete quantitative data"}
These quantitative variables are *discrete*:
* The *number* of heart attacks in the previous year experienced by Croatian women over 40.
Possible values: 0, 1, 2, ...
* The *number* of cracked eggs in a carton of 12.
Possible values: 0, 1, 2, ... 12.
* The *number* of orthotic devices a person has used.
Possible values: 0, 1, 2, ...
* The *number* of turbine cracks after 750 hours use.
Possible values: 0, 1, 2, ...
:::
::: {.definition #ContinuousData name="Continuous data"}
*Continuous* quantitative data have (at least in theory) an infinite number of possible values between any two given values.
:::
Height is continuous: between the heights of 179cm and 180cm, many heights exist, depending on how many decimal places are used to record height.
In practice, however, heights are usually rounded to the nearest centimetre for convenience.
All continuous data are rounded.
::: {.example #QuantContinuous name="Continuous quantitative data"}
These quantitative variables are *continuous*:
* The *weight* of 6-year-old Fijian children.
Values exist between any two given values of weight, by measuring to more decimal places of a kilogram.
However, weights are usually reported to the nearest kilogram.
* The *energy consumption* of houses in London.
Values exist between any two given values of energy consumption, by measuring to more and more decimal places of a kiloWatt-hour (kWh).
Consumption would usually be given to the nearest kWh.
* The *time* spent in front of a computer each day for employees in a given industry.
Values exist between any two given times, by measuring to more decimal places of a second.
The values may be reported to the nearest minute, or the nearest 15 minutes.
:::
Sometimes, discrete quantitative data with a very large number of possible values may be treated as continuous.
:::{.example #DiscreteAsContinuous name="Treating discrete data as continuous"}
Annual income is discrete, since no income is between \$80,000.00 and \$80,000.01.
However, annual incomes are usually much larger than cents, and vary at scales much greater than cents, and so are usually treated as continuous.
:::
### Qualitative data: nominal and ordinal data {#QualData}
**Qualitative** data has distinct labels or categories, and are not mathematically numerical.
Be careful: *numerical* data may be qualitative, provided the numbers don't have numerical *meanings*.
The categories of a qualitative variable are called the *levels* or the *values* of the variable.
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Pics/iconmonstr-construction-35-240.png" width="50px"/>
</div>
::: {.definition #QualitativeData name="Qualitative data"}
*Qualitative data* is not *mathematically* numerical data: it consists of categories or labels.
:::
::: {.definition #Levels name="Levels"}
The *levels* (or the *values*) of a qualitative variable refer to the names of the distinct categories.
:::
::: {.example #DefinitionsClarity name="Clarity in definitions"}
'Age' is a *continuous quantitative* variable, since age could be measured to many decimal places of a second.
Age is usually rounded down to the number of completed years, for convenience.
However, the age of young children may be given as '3 days' or '10 months'.
Sometimes *Age group* is used instead (such as Under 20; 20 to under 50; 50 or over) instead of Age.
'Age group' is *qualitative*.
Ensure you are clear about which is used!
:::
::: {.example #QualData name="Qualitative data"}
'Brand of mobile phone' is a qualitative variable.
Many levels are possible (that is, many possible brands), but could be simplified by defining the levels as 'Apple', 'Samsung', 'Google' and 'Other'.
:::
::: {.example #QualData2 name="Qualitative data"}
Australian postcodes are numbers, but are *qualitative* (Example\ \@ref(exm:QuantitativePostcodes)).
:::
::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
Consider these two qualitative variables.
What features of the data collected from the two questions are similar?
What features are different?
1. *Blood type*, with levels: Type A; Type B; Type AB; Type O.
2. *Age group*, with levels: Under 20; 20 to under 50; 50 or over.
:::
Qualitative data can be further classified as *nominal* or *ordinal*.
*Nominal* variables are qualitative variables where the levels *have no natural order*.
*Ordinal* variables are qualitative variables where the levels *do have a natural order*.
In the question above, 'Blood type' is qualitative *nominal*, while 'Age group' is qualitative *ordinal*.
::: {.definition #Nominal name="Nominal qualitative variables"}
A *nominal* qualitative variable is a qualitative variable where the levels *do not* have a natural order.
:::
::: {.definition #Ordinal name="Ordinal qualitative variables"}
An *ordinal* qualitative variable is a qualitative variable where the levels *do* have a natural order.
:::
::: {.example #NominalData name="Nominal data"}
The variable 'How students get to university' is *nominal*; the levels may be: Car (driver or passenger); Bus; Ride bicycle; Walk; Other.
The data will be *nominal* with five levels.
The levels can appear in any order: from largest group to smallest, or in alphabetical order.
Since there is no *natural* order, the order used should be carefully considered: what is the most useful order when summarising the data?
:::
::: {.example #OrdinalData name="Ordinal data"}
A questionnaire question where respondents are asked to select from options like
Strongly disagree; Disagree; Neither agree or disagree; Agree; Strongly agree.
will produce *ordinal* data.
For example, the responses to the following question will be *ordinal* with five levels:
> Please indicate the extent to which you agree or disagree with this statement:
> 'Permeable pavements technology will revolutionise green building practices'.
Giving the levels in the given order (or the reverse order) makes sense; giving the levels in alphabetical order, for example, would not make sense.
:::
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/pexels-tristan-le-1642883.jpg" width="200px"/>
</div>
::: {.example #TypesVariables name="Types of variables"}
Consider a study to determine if the weight of 500g bags of pasta really weigh 500g or more.
One approach is to record the weight of pasta in each bag (a *quantitative* variable), and compare the *average* weight to the target weight of 500g.
Another approach is to record whether each bag of pasta was underweight or not (perhaps using a balance scale).
This would be a *qualitative* variable, with two *levels* (underweight; not underweight).
The *percentage* of bags that are underweight could be reported.
:::
<iframe src="https://learningapps.org/watch?v=pchrmqw2c22" style="border:0px;width:100%;height:500px" allowfullscreen="true" webkitallowfullscreen="true" mozallowfullscreen="true"></iframe>
::: {.softwareBox .software data-latex="{iconmonstr-laptop-4-240.png}"}
Most statistical software packages, like jamovi and SPSS, require the user to describe the variables.
This enables the software to produce appropriate output and suggest appropriate analyses.
:::
## Summary {#Chap11-Summary}
The *type* of data collected determines the types of summaries and analyses that are needed.
Data and variables can be described as either:
* *quantitative* (either *discrete* or *continuous*) if they are mathematically numerical; or
* *qualitative* (either *nominal* or *ordinal*) if they are not mathematically numerical.
## Quick revision questions {#Chap11-QuickReview}
::: {.webex-check .webex-box}
A study on the bruising of apples [@doosti2016development] explored the relationship between the surface temperature of apple, and the depth of bruising.
The researchers purposefully hit apples with three different *forces* (200, 700 and 1200mJ) to inflict bruises.
This was repeated at three different *locations* of the apple (lower; middle; upper).
The researchers then recorded the *depth* of the bruising, and the *surface temperature* (in ^o^C) at each bruise location.
1. How would the variable 'location on the apple' be best described using the language of this chapter?\tightlist
`r if( knitr::is_html_output() ) {longmcq( c(
"Qualitative nominal",
answer = "Qualitative ordinal",
"Quantitative discrete",
"Quantitative continuous"))}`
1. How would the variable 'depth of bruising' be best described using the language of this chapter?
`r if( knitr::is_html_output() ) {longmcq( c(
"Qualitative nominal",
"Qualitative ordinal",
"Quantitative discrete",
answer = "Quantitative continuous"))}`
1. How would the variable 'temperature of the bruise location' be best described using the language of this chapter??
`r if( knitr::is_html_output() ) {longmcq( c(
"Qualitative nominal",
"Qualitative ordinal",
"Quantitative discrete",
answer = "Quantitative continuous"))}`
1. The variable 'force of hit' could be considered as quantitative continuous variable.
However, since only a small number of forces are used, it should probably be considered qualitative ordinal.
How many *levels* would the variable have?
`r if( knitr::is_html_output() ) {longmcq( c(
answer = "Three levels: 200, 700 and 1200 mJ",
"Four levels: four variables are listed in the study",
"An infinite number of levels: the force could be anything"))}`
:::
## Exercises {#DescribeExercises}
Selected answers are available in Sect.\ \@ref(DescribeAnswer).
::: {.exercise #DescribeClassifying1}
True or false: These variables *quantitative* and *continuous*.
* The knee-flex angle after treatment. \tightlist
`r if( knitr::is_html_output() ) {torf( answer = TRUE ) }`
* Whether or not laser drilling of small holes in concrete is successful.
`r if( knitr::is_html_output() ) {torf( answer = FALSE ) }`
* Length of time between arrival at an emergency department, and admission.
`r if( knitr::is_html_output() ) {torf( answer = TRUE ) }`
* Number of eggs laid by female brush turkeys.
`r if( knitr::is_html_output() ) {torf( answer = FALSE ) }`
* Whether or not a child eats the recommended serving of fruit each day.
`r if( knitr::is_html_output() ) {torf( answer = FALSE ) }`
:::
::: {.exercise #DescribeClassifying2}
True or false: These variables *qualitative* and *nominal*.
* The age group of respondents to a survey. \tightlist
`r if( knitr::is_html_output() ) {torf( answer = FALSE ) }`
* Whether a cyclist is wearing a helmet or not.
`r if( knitr::is_html_output() ) {torf( answer = TRUE ) }`
* The dosage of a medication applied: 40mg per day, 60 mg per day, or 80 mg per day.
`r if( knitr::is_html_output() ) {torf( answer = FALSE ) }`
* The brand of fertilizer being applied.
`r if( knitr::is_html_output() ) {torf( answer = TRUE ) }`
* The approximate age of trees
`r if( knitr::is_html_output() ) {torf( answer = FALSE ) }`
:::
::: {.exercise #DescribeClassifying3}
A study recorded whether or not people (who were not swimming) were wearing head-protection at the beach.
The results were recorded as None; Cap; or Hat.
Which of the following could be used to describe this variable?
`r if (knitr::is_html_output()) {'<!--'}`
Nominal; Qualitative; Continuous; Quantitative; Ordinal.
`r if (knitr::is_html_output()) {'-->'}`
`r if (knitr::is_latex_output()) {'<!--'}`
* Nominal `r torf( answer = TRUE )`
* Qualitative `r torf( answer = TRUE )`
* Continuous `r torf( answer = FALSE )`
* Quantitative `r torf( answer = FALSE )`
* Ordinal `r torf( answer = FALSE )`
`r if (knitr::is_latex_output()) {'-->'}`
:::
::: {.exercise #DescribeClassifyingGraphsLimeTrees}
A study of lime trees (*Tilia cordata*) recorded these variables for 385 lime trees in Russia [@data:ForestBiomass2017; @mypapers:dunnsmyth:glms]: the foliage biomass (in kg); the tree diameter (in cm); the age of the tree (in years); and the origin of the tree (one of Coppice, Natural, or Planted).
Describe the variables in the study using the language of this chapter.
:::
::: {.exercise #DescribeClassifyingVariables1}
Are these variables quantitative (discrete or continuous; what units of measurement), or qualitative (nominal or ordinal, and with what levels?)?
1. Systolic blood pressure.
1. Diet (vegan; vegetarian; neither vegan or vegetarian).
1. Socioeconomic status (low income; middle income; high income).
1. Number of times a person visited the doctor last year.
:::
::: {.exercise #DescribeClassifyingVariables2}
A study of body mass index and its relationship with use of social media [@data:Alley2017:SocialMedia] recorded these variables (among others) from a group of 1140 participants:
1. Age (under 45; 45 to 64; 65 or over).
1. Gender (male; female).
1. Location (urban; rural).
1. Social media use (none; low; high).
1. BMI (body mass index; the body mass in kg, divided by the square of height in cm).
1. Total sitting time, in minutes per day.
For each variable, determine the *type* of variable: quantitative (discrete or continuous, and with what units of measurement?), or qualitative (nominal or ordinal, and with what levels)?
:::
::: {.exercise #DescribeClassifyingOrthoses}
In a study of the influence of using ankle-foot orthoses in children with cerebral palsy [@data:Swinnen2017:orthoses], the data in Table\ \@ref(tab:DescribeAnkleFoot) describe the 15 subjects.
(GMFCS is the
`r if (knitr::is_latex_output()) {
'Gross Motor Function Classification System.)'
} else {
'[Gross Motor Function Classification System](https://en.wikipedia.org/wiki/Gross_Motor_Function_Classification_System).)'
}`
used to describe the impact of cerebral palsy on their motor function; where *lower* levels mean *better* functionality.)
Describe the variables in the study using the language of this chapter.
:::
```{r DescribeAnkleFoot}
NumP <- 15
Gender <- rep("M", NumP)
Gender[ c(11, 14, 15)] <- "F"
Age <- c(9, 7, 7, 12, 11, 5, 6, 8, 8, 6, 7, 11, 7, 9, 8)
Ht <- c(136, 106, 129, 152, 146, 113, 112, 112, 138, 116, 113, 141, 136, 128, 133)
Wt <- c(34.5, 16.2, 21.1, 40.4, 39.3, 18.1, 16.7, 19.1, 28.6, 19.3, 17.6, 34.9, 34.5, 21.9, 23.0)
GMFCS <- c(1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1)
CPalsy <- data.frame(Gender = Gender,
Age = Age,
Height = Ht,
Weight = Wt,
GMFCS = GMFCS
)
if( knitr::is_latex_output() ) {
kable(CPalsy,
format = "latex",
longtable = FALSE,
booktabs = TRUE,
digits = c(0, 0, 0, 1, 0),
caption = "Describing the sample in the orthoses dataset",
col.names = c("Gender",
"Age (years)",
"Height (cm)",
"Weight (kg)",
"GMFCS")) %>%
kable_styling(font_size = 10) %>%
row_spec(0, bold = TRUE)
}
if( knitr::is_html_output() ) {
kable(CPalsy,
format = "html",
longtable = FALSE,
booktabs = TRUE,
digits = c(0, 0, 0, 1, 0),
caption = "Describing the sample in the orthoses dataset",
col.names = c("Gender",
"Age (years)",
"Height (cm)",
"Weight (kg)",
"GMFCS"))
}
```
::: {.exercise #DescribeClassifyingNitrogenInSoil}
A study of fertilizer use [@data:Lane2002:GLMsoilscience; @mypapers:dunnsmyth:glms] recorded the soil nitrogen after applying different fertilizer doses.
These variables were recorded:
* the *fertilizer dose*, in kilograms of nitrogen per hectare;
* the *soil nitrogen*, in kilograms of nitrogen per hectare; and
* the *fertilizer source*; one of 'inorganic' or 'organic'.
Describe the variables in the study.
:::
::: {.exercise #DescribeClassifyingKangaroos}
A study [@brunton2019fright] recorded the response of kangaroos to overhead drones (one of 'No vigilance', 'Vigilance', 'Flee $<10$m', or 'Flee $>10$m') and the altitude of the drone (30m, 60m, 100m or 120m).
The mob size and sex of the kangaroo was also recorded.
Describe the variables in the study.
:::
::: {.exercise #DescribeSelfieDeaths}
A study of people who died while taking selfies [@data:Dokur2018:SelfieDeaths] recorded the location (Table\ \@ref(tab:TableSelfieDeaths)).
Which of the following are the *variables* in the table?
For each that is a variable, describe the variable.
1. The location.
1. The number of people who died at each location.
1. The percentage of people who died at each location.
:::
```{r TableSelfieDeaths}
NumSD <- c(48, 22, 17, 12, 7, 4, 1)
PercentageSD <- c(43.2, 19.9, 15.3, 10.8, 6.3, 3.6, 0.9)
Scenes <- c(
"Nature, associated environments",
"Train, railway, associated structures",
"Buildings, associated structures",
"Road, bridge, associated structures",
"Dams, associated structures",
"Fields, farms, associated structures",
"Others"
)
SelfieDeaths <- data.frame(
Num = NumSD,
PC = PercentageSD
)
rownames(SelfieDeaths) <- Scenes
if( knitr::is_latex_output() ) {
kable(SelfieDeaths,
format = "latex",
longtable = FALSE,
booktabs = TRUE,
linesep = c("", "\\addlinespace", "", "", "\\addlinespace", "", ""), # Otherwise addes a space after five lines...
digits = c(0, 1),
caption = "Locations of people dying while taking selfies",
col.names = c("Number",
"Percentage")) %>%
kable_styling(font_size = 10) %>%
row_spec(0, bold = TRUE)
}
if( knitr::is_html_output() ) {
kable(SelfieDeaths,
format = "html",
longtable = FALSE,
booktabs = TRUE,
digits = c(0, 1),
caption = "Locations of people dying while taking selfies",
col.names = c("Number",
"Percentage"))
}
```
<!-- QUICK REVIEW ANSWERS -->
`r if (knitr::is_html_output()) '<!--'`
::: {.EOCanswerBox .EOCanswer data-latex="{iconmonstr-check-mark-14-240.png}"}
**Answers to in-chapter questions:**
- \textbf{\textit{Quick Revision} questions:}
**1.** Qualitative ordinal.
**2.** Quantitative continuous.
**3.** Quantitative continuous.
**4.** Three levels: 200, 700 and 1200 mJ.
:::
`r if (knitr::is_html_output()) '-->'`