10-Collect.Rmd

# (PART) Collecting data {-}


# Procedures for collecting data {#CollectingDataProcedures}


<!-- Introductions; easier to separate by format -->

```{r, child = if (knitr::is_html_output()) {'./introductions/10-Collect-HTML.Rmd'} else {'./introductions/10-Collect-LaTeX.Rmd'}}
```


## Protocols

If the RQ is well-constructed, terms are clearly defined, and the research design is clear and well explained, then the means for collecting the data should be clear.
However, data collection is often time-consuming, tedious and expensive, so collecting the data correctly first time is important.

*Before* collecting the data, a plan should be established and documented that explains exactly how the data will be obtained, which will include *operational definitions* (Sect. \@ref(OperationDefinitions)).
This plan is called a *protocol*.
Unforeseen complications are not unusual, so often a *pilot study* (or a *practice run*) is conducted before the real data collection, to see if the planned procedure is practical and optimal.
The pilot study may suggest changes to the protocol.


::: {.definition #Protocol name="Protocol"}
A *protocol* is a procedure documenting the details of the design and implementation of studies, and for data collection. 
:::


::: {.definition #PilotStudy name="Pilot study"}
A *pilot study* is a small test run of the study protocol used to check that the protocol seems appropriate and practical, and to identify possible problems with the design or protocol.
:::


A pilot study allows the researcher to:

* determine the feasibility of the data collection protocol.
* identify unforeseen challenges.
* obtain data to determine appropriate sample sizes (Sect. \@ref(EstimatingSampleSize)).
* potentially save time and money.


<div style="float:right; width: 222x; border: 1px; pad/ding:10px">
<img src="Pics/iconmonstr-task-1-240.png" width="50px"/>
</div>


After the pilot study, the planned protocol may need to be refined.
The data can be collected once the protocol has been finalised.
P]rotocols ensure studies can be repeated (Sect. \@ref(ReproducibleResearch)) so others can confirm or compare results, and others can understand exactly what was done, and how.
Protocols should clearly indicate how design aspects (such as blinding the individuals, random allocation of treatments, etc.) will happen.

The final *protocol*, without pedantic detail, should appear be reported.
Someone else should be able to read the protocol and approximately repeat the study (this is ethical: Sect. \@ref(Ethics)).
Diagrams can be useful to support explanations.
All studies should have a well-established protocol for describing how the study was done.

A protocol usually has at least three components that describe:

1. How individuals are chosen from the population (i.e., external validity); and
2. How information is collected from the individuals (i.e., internal validity); and
3. The analyses, and what software (and version) was used.


:::{.example #ProtocolExample name="Protocol"}
To increase the nutritional value of cookies, researchers tested cookies made using pureed green peas in place of margarine [@data:Romanchik2018:cookies].
The researchers wanted to assess the acceptance of these cookies to college students.

The protocol discussed each of the three components above.
The article described *how the individuals were chosen* (p. 4):

> One hundred and three untrained volunteers were recruited through advertisement across campus from students attending a university in the southeastern United States.

This voluntary sample comprised 80.6% women, a higher percentage of women that in the general population, or the college population. 
(Other extraneous variables were also recorded.)

Exclusion criteria were also applied, excluding people (p. 5)

> ...with an allergy or sensitivity to an ingredient used in the preparation of the cookies

The researchers also described *how the data was obtained from the individuals* (p. 5):

> During the testing session, panelists were seated at individual tables. 
> Each cookie was presented one at a time on a disposable white plate. 
> Samples were previously coded and randomized. 
> The presentation order for all samples was 25%, 0%, 50%, 100% and 75% substitution of fat with puree of canned green peas. 
> To maintain standard procedures for sensory analysis [...], panelists cleansed their palates between cookie samples with distilled water (25^$\circ$^C) [...]
> a 9-point hedonic scale in which 9 = like extremely, 5 = neutral, and 1 = dislike extremely, was used to analyze characteristics of color, smell, moistness, flavor, aftertaste, and overall acceptability, for each sample of cookies...

This, internal validity was managed using random allocation, blinding the individuals, and washouts.
Details are also given of how the cookies were prepared, and how objective measurements (such as moisture content) were determined.

The *analyses and software used* were also given.
:::


::: {.example #ProtocolEG name="Protocol"}
A study [@data:Wojcik:ForwardFall] examined the forward-leaning angle from which people could recover and not fall, to determine if this angle was different (on average) for younger and older people.
The paper goes into great detail to explain the protocol (almost 1.5 pages, plus a diagram).
:::


::: {.exampleExtra  data-latex=""}
Consider this partial protocol, which shows honesty in describing a protocol:
  
> Fresh cow dung was obtained from free-ranging, grass fed, and antibiotic-free Milking Shorthorn cows (*Bos taurus*) in the Tilden Regional Park in Berkeley, CA. 
> Resting cows were approached with caution and startled by loud shouting, whereupon the cows rapidly stood up, defecated, and moved away from the source of the annoyance. 
> Dung was collected in ZipLoc bags (1 gallon), snap-frozen and stored at $-80$ C. 
>
> --- @hare2008sepsid, p. 10
:::


<!-- One approach to documenting the data collection process is to use the [STAR method](https://www.cell.com/star-methods): -->

<!-- * S (**Structured**):  -->
<!--   The protocol is organised logically, with attention to detail. -->
<!-- * T (**Transparent**):  -->
<!--   All the necessary information is provided: the protocol is transparent, comprehensive and accurate. -->
<!-- * A (**Accessible**):  -->
<!--   The protocol is easy to access, easy to follow, and easy to comprehend. -->
<!-- * R (**Reporting**): -->
<!--   The protocol is reported in whole and in detail, so it could be replicated. -->


## Collecting data using questionnaires

Data may be collected in many ways (laboratory experiments, field observations, etc.).
For both observational and experimental studies, though, collecting data using *questionnaires* is common.
Questionnaires are difficult to do well: question wording is crucial, and surprisingly difficult to get right [@fink1995survey].
Pilot testing of questionnaires is crucial!


:::{.definition #Questionnaire name="Questionnaire"}
A questionnaire is a set of questions for respondents to answer.
:::


::: {.tipBox .tip data-latex="{iconmonstr-info-6-240.png}"}
A *questionnaire* is a set of question to obtain information from individuals.
A *survey* is an entire methodology, that includes gathering data using a questionnaire, but other components also.
:::

Questions in a questionnaire may be *open-ended* (respondents can write their own answers) or *closed* (respondents select from a small number of possible answers, as in multiple-choice questions).
Open and closed questions both have advantages and disadvantages.
Answers to open questions more easily lend themselves to qualitative analysis.

This section only briefly examines questionnaires:

* writing questions (Sect. \@ref(AskSurveyQuestions)); and
* comparing online and paper questionnaires (Sect. \@ref(OnlinePaperSurveys)).


::: {.example #OpenClosedQuestions name="Open and closed questions"}
German students were asked a series of questions about microplastics [@raab2021conceptions], including:

1. Name sources of microplastics in the household.
2. In which ecosystems are microplastics in Germany? Tick the answer (multiple ticks are possible).
   *Options*: (a) sea; (b) rivers; (c) lakes; (d) groundwater.
3. Assess the potential danger posed by microplastics.
   *Options*: (a) very dangerous; (b) dangerous; (c) hardly dangerous; (d) not dangerous.

The first question is an *open question*, where respondents could provide their own answers.
The second question is *closed*, where multiple option can be selected.
The third question is *closed*, where only one option could be selected
:::


### Writing questions {#AskSurveyQuestions}

Some of the issues to keep in mind when framing questionnaire questions are:

* **Avoid leading questions** which may indicate how respondents are expected to answer.
  Question wording is the usual reason for leading questions.
* **Avoid ambiguity**:
  Avoid unfamiliar terms and unclear questions.
* **Avoid asking the uninformed**, and avoid asking respondents about issues they don't know about.
  Many people will give a response even if they do not understand, but such responses are worthless.
  (For example, people may give directions to places that do not even exist [@collett1976pointing]).
* **Avoid complex and double-barrelled questions**; these are often hard to understand.
* **Avoid problems with ethics**.
  Avoid questions about people breaking laws, or revealing confidential of private information.
  In special cases and with justification, ethics committees may allow such questions. 
* **Ensure** that questions are clearly and precisely worded.
* **Ensure** that options for multiple-choice questions are *mutually exclusive* (answers fit into only one category) and *exhaustive* (the categories cover *all* possible options).


::: {.example #LeadingQns name="Leading question"}
This question is a *leading question*, because the expected response is obvious:
  
* Because bottles from bottled water create enormous amounts of non-biodegradable landfill and hence pose a threat to sensitive native wildlife, do you support a ban on bottled water in Australia?

This question is *ambiguous*, as it is unclear what "faster *now*" is being compared to:

* Do children run faster now?

This question is unlikely to be answerable, as most people will be *uninformed*:

> Is the use of fibre composites for waterside recreational purposes likely to cause the material to swell, discouraging use of the facilities?

Nonetheless, many people will still give an opinion.
This data will be effectively useless, but the researcher may not realise this.

This question is *double-barrelled*, and would be better asked as two separate questions (one asking about jogging, and one about swimming):

* Do you jog and swim for exercise?

This question is unlikely to be given *ethical approval* or to obtain truthful answers, as respondents are unlikely to admit to breaking rules:

* Do you have a water tank installed illegally, without council permission?

This question is *unclear*, since knowing what "agree" or "disagree" means is unclear:

* I don't go out of my way to purchase low-fat food unless they are also low in calories but not necessarily salt.  Do you agree or disagree?
:::

::: {.example #QuestionWording name="Question wording"}
Question *wording* can be important.
These two questions would produce different answers:

* Which is easier to *buy*: cigarettes, beer or marijuana?
* Which is easier to *obtain*: cigarettes, beer or marijuana?
:::


::: {.example #LeadingQuestion2 name="Leading question"}
Consider this question:

> Do you like this new orthotic?

This question is *leading*, since *liking* is the only option presented.
Better would be:

> Do you like or dislike this new orthotic?
:::


::: {.example #UnclearQns name="Unclear wording"}
Consider this question:

> I don't go out of my way to purchase low-fat food unless they are also low in calories but not necessarily salt.  Do you agree or disagree?

It is not clear what 'agree' or 'disagree' means in response to this question.
:::


::: {.example #MutuallyExclusiveQns name="Mutually exclusive options"}
In a study to determine the time doctors spent on patients (from @chan2008exploration), doctors were given the options:

* 0--5 minutes;
* 5--10 minutes; or
* more than 10 minutes.

This is a poor question, because a respondent does not know which option to select for an answer of "5 minutes".
The options are not *mutually exclusive*.
:::


::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
What is the problem with this question?\label{thinkBox:QuestionProblem}

> Would this book that you are currently reading be useful for students and young professionals in the field?
  

`r if (knitr::is_latex_output()) '<!--'`
`r webexercises::hide()`
There are two questions; it is *double-barrelled*.

Asking the two questions separately is better: one about *students*, and one about *young professionals*. 
This separates the two components of the original question.
`r webexercises::unhide()`
`r if (knitr::is_latex_output()) '-->'`
:::

`r if (knitr::is_html_output()){
  'The following (humourous) video shows how questions can be manipulated by those not wanting to be ethical:'
}`

<div style="text-align:center;">
<iframe width="560" height="315" src="https://www.youtube.com/embed/G0ZZJXw4MTA" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture"></iframe>
</div>
<!-- Yes Prime Minister is a British political satire/ comedy that was aired in the 1980s. The original copyright belongs to BBC. Usage of this clip constitutes fair use for the purpose of education. -->


### Online and paper questionnaires {#OnlinePaperSurveys}

Questionnaires may be paper-based or online; both have advantages and disadvantages.

*Paper-based questionnaires* require the information to be manually entered into a computer for later analysis, which is time consuming and expensive, and prone to data-entry errors.

Paper-based questionnaires can also be costly to prepare, especially if physical mailing and photocopying is necessary.
However, people may be more likely to complete paper-based questionnaires if they are presented with a questionnaire face-to-face and someone waits to collect the completed questionnaire.

*Online questionnaires* make data collection and data entry easier: data are entered directly onto a computer.
This means less manual handling and less chance of data entry errors.
Online questionnaires are also easier to share with a geographically-diverse group of people (for example, through email or social media), but only if the relevant contact details are available.

However, online questionnaires may have a lower response rate, as respondents may be reluctant to click on links in emails (especially from unknown sources), may ignore emails, or the emails may be flagged a spam.


## Biases {#Biases}

Surveys are commonly-used to obtain information from people, but not without challenges.

* Non-response bias (Sect. \@ref(SelectionBias)):
  Non-response bias can be present in many studies, but is common for questionnaires, as they are often used with voluntary-response samples.
  The people who *do not* respond to the survey may be different than those who *do* respond. 
* Response bias (Sect. \@ref(SelectionBias)):
  People do not always answer truthfully;
  sometimes this is unintentional (because of poor questions) or due to embarrassment, boredom or because questions are controversial.
* Ecological validity (Sect. \@ref(InterpretApplicability)): 
  What people *say* may not correspond with what people *do*.
* Recall bias:
  People may not be able to accurately recall past events clearly, or recall when they happened.


## Summary {#Chap10-Summary}

Having a detailed procedure for collecting the data (the **protocol**) is important.
Using a **pilot study** to trial the protocol an often reveal unexpected changes necessary for a good protocol.

Sometimes, data can be collected using questionnaires, either on **paper** or **online**.
However, creating good questionnaires questions is difficult.


## Quick review questions {#Chap10-QuickReview}


<!-- bromodosis: smelly feet -->
::: {.webex-check .webex-box}
1. What is the problem with this question: \tightlist
   Do you have bromodosis?
`r if( knitr::is_html_output() ) {longmcq( c(
                     "It is double-barrelled",
                     "It is a leading question",
                     answer = "It uses language that may not be understood",
                     "It is ambiguous") )}`

1. What is the problem with this question:
   Do you spend too much time connected to the internet?
`r if( knitr::is_html_output() ) {longmcq( c(
                     "It is double-barrelled",
                     "It is a leading question",
                     "It uses language that may not be understood",
                     answer = "It is ambiguous") )}`
1. What is the problem with this question:
   Do you eat fruits and vegetables?
`r if( knitr::is_html_output() ) {longmcq( c(
                     answer = "It is double-barrelled",
                     "It is a leading question",
                     "It uses language that may not be understood",
                     "It is ambiguous") )}`
1. Which of these is a purpose of producing a well-defined protocol?

   * It allows the researchers to make the study externally valid.\tightlist 
`r if( knitr::is_html_output() ) {torf( answer = FALSE )}`
   * It ensures that others know exactly what was done.
`r if( knitr::is_html_output() ) {torf( answer = TRUE )}`
   * It ensures that the study is repeatable for others.
`r if( knitr::is_html_output() ) {torf( answer = TRUE )}`

1. Are the following survey questions likely to be *leading* questions?

   * Do you, or do you not, believe that permeable pavements are a viable alternative to traditional pavements?\tightlist
`r if( knitr::is_html_output() ) {torf( answer = FALSE )}`
   * Do you support a ban on bottled water?
`r if( knitr::is_html_output() ) {torf( answer = TRUE )}`
   * Do you believe that double-gloving by paramedics reduces the risk of infection, increases the risk of infection, or makes no difference to the risk of infection?
`r if( knitr::is_html_output() ) {torf( answer = FALSE )}`
   * Should Australia should ban breakfast cereals with unhealthy sugar levels?
`r if( knitr::is_html_output() ) {torf( answer = TRUE )}`
:::


## Exercises {#CollectionExercises}

Selected answers are available in Sect. \@ref(CollectionAnswer).

::: {.exercise #CollectSurveyQuestions1}
What is the problem with this question?

> What is your age? (Select one option)
>
> - Under 18
> - Over 18
:::


::: {.exercise #CollectSurveyQuestions2}
Which of these questionnaire questions is better, and why?

1. Should concerned cat owners vaccinate their pets?
2. Should domestic cats be required to be vaccinated or not?
3. Do you agree that pet-owners should have their cats vaccinated?
:::


::: {.exercise #SunscreenQuestions}
In a study of sunscreen use [@data:Falk2013:SunProtection], participants were asked questions that included these:

* How often do you sun bathe with the intention to tan during the summer in Sweden?  
  (Possible answers: never, seldom, sometimes, often, always).
* How long do you usually stay in the sun between 11am and 3pm, during a typical day-off in the summer (June--August)?  
  (Possible answers: <30 min, 30 min--1 h, 1--2 h, 2--3 h, >3 h).

Critique these questions.
:::


::: {.exercise #KidsEnvironmentQuestions}
In a study of children's knowledge of their natural environment [@moron2021children], primary school children (from Andalusia, Spain) were asked three questions:

   * Do you usually visit Guadaira Park?  
   * No, I don’t like parks.
   * No, I don’t usually visit it.
   * Yes, once per week.
   * Yes, more than once a week
*  How many times have you visited nature (the beach, countryside, mountains, etc.) in the last month?	

   * Never
   * Once
   * Two to three times
   * More than three times
*  Which is your favorite natural place?	
   
   * Write a story
   * Draw a picture

For the questions:

1. What are *open* and which are *closed*?
1. Critique the questions.
:::


<!-- QUICK REVIEW ANSWERS -->
`r if (knitr::is_html_output()) '<!--'`
::: {.EOCanswerBox .EOCanswer data-latex="{iconmonstr-check-mark-14-240.png}"}
**Answers to in-chapter questions:**

- Sect. \ref{thinkBox:QuestionProblem}: It is *double-barrelled*.
Ask two questions: one about *students*, and one about *young professionals*. 

- \textbf{\textit{Quick Revision} questions:}
**1.** *Language*: Most people do not know what 'bromodosis' is, so how can they answer the question truthfully?
**2.** *Ambiguous*: 'Too much', compared to what?
**3.** *Double-barrelled*: Some people may eat fruits but not vegetables, for example.
**4.** The second and third statements are a valid purpose of a protocol.
**5.** The second and fourth are likely to be *leading*.
:::
`r if (knitr::is_html_output()) '-->'`