-
Notifications
You must be signed in to change notification settings - Fork 6
/
03-lesson-03-getting-data-into-r-slides.Rmd
239 lines (148 loc) · 8.48 KB
/
03-lesson-03-getting-data-into-r-slides.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
---
title: "Getting Data into R"
author: "Ben Post"
date: "Last Updated: 23 April 2019"
output: ioslides_presentation
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
<!-- rename file with the lesson name replacing template -->
## Learning Objectives
1. Introducing the data pipeline
2. Get data out of your spreadsheet
3. Importing: Point and click in R studio
4. Importing: Scripting your import (and introducing file paths)
5. Doing a quick nose around data frames in R
<!-- * Getting data into R - (v1) Ahmed (v2) Ed (v3) Steve
keep theme of square tables from excel to csv now to data.frame
how to get these in
/how to do this 'automagically'
- [ ] teach comments
- [ ] teach script files
- [ ] foobar
-->
## The data pipeline
<!-- - [ ] TODO(2016-11-15): produce a graphic for this -->
Learning 'coding' is a pain, and for a 'one off' process there will nearly always be some tempting 'point-and-click' alternative.
Greatest benefits from learning to code:
- Reproducibility
- Documenting as you go along
- Saving time
## Traditional workflow
1. Collect data on paper
2. Transcribe paper forms into Excel or similar
3. Import Excel into SPSS or another stats package
4. Perform analyses and make graphs
5. Copy/paste these into your final document (Word, Powerpoint) etc
## Problems
- One early mistake, means repeating all steps by hand
- You can't remember what you did because you haven't looked at this for 5 days, 5 months or 5 years
- Other people can't collaborate or help because all the files are local, and even if they have them you weren't obsessive enough to write notes about each step
## A data pipeline (aka scripting)
Learn just enough code to write down your 'workflow'. Ensure the workflow starts with the raw data, and then any changes in the workflow or the data will automatically propagate.
1. Collect data electronically
2. Write a script that converts your data into figures and tables
3. Annotate and document your work
4. Knit everything into a single report(?!)
## Getting data out of your spreadsheet: CSV files
Arguably one of the oldest, and most ubiquitous is the 'comma separated values' (`.csv`) file. This is easily exportable from Microsoft Excel, Apple Numbers, Open Office, Google Sheets...etc.
It's a simple format. The top line are the column names, each seperated by a comma. The following lines are the observations in those columns, again, seperated by a comma.
It's strength is in it's simplicity. It only has data, no formulas, no tricks and is very well recognised amongst software packages as it is very easily supported. R has excellent support for `.csv`.
## Export CSV From Excel
![](img/getting_data_1.png)
## Exercise: Export an .xlsx file to .csv
1. Find the cleaned `.xlsx` file that was produced in the Excel Hell lecture and export it as a `.csv` file. If all else fails then we have a copy [here](https://figshare.com/s/28e6b022c0d3fe63909e).
2. Save the `.csv` file into a folder called `data/` nested within your project root folder.
Once a sheet has been exported and saved, it can be imported into R.
## Getting CSV data into R
There are many ways to do this:
1. Point-and-click within RStudio, which we introduced in Lesson 1 (R for Newbies).
2. Use the `read_csv()` function.
## Exercise: Import using Point-and-click
As a quick reminder from Lesson 1. Click the *Import Dataset* button, and select the *From Text (readr)...* option.
![](img/RStudio-import-data-button.png)
Then accept the defaults in the dialog box that appears.
## Point-and-click 2
![](img/getting_data_2.png)
## Scripting
Point and click is lovely and easy but _sadly_ not reproducible. A better option is to write down the location of your data. We are going to do this in 3 steps here.
1. Find the _path_ to your file. We covered file paths in [lesson 1](00-lesson-00-intro.html#files-and-directories), and this is just the formal address of your file on your computer. A quick way to do this in `R` is to use the `file.choose()` function.
2. We will take the _path_ you have generated and name it `myfile`.
3. We will use a _function_ called `read_csv()` to import the data into a data frame (which we in turn name `RCT` for convenience).
> Try right clicking a file in *Finder* (on a Mac) or *Windows Explorer* (on a PC). You'll normally see an option for 'info' or 'properties' that will show you the path to your file.
## Code example
```{r eval=FALSE, echo=TRUE}
install.packages("readr") # install only needed the 1st time
library(readr) # load the readr family of
# functions including read_csv
myfile <- file.choose()
myfile
dataframename <- read_csv(myfile)
```
This could have been done in one step, but it would have made things harder to read. Harder to read, means harder to remember, and we are doing our best to avoid that!
## Other functions
> There is a function `read.csv()` provided by R but `read_csv()` is better. The built-in function has a couple of annoying 'habits'.
> If you wish to use it then don't forget to specify: (1) `header = TRUE` which tells the function to expect column names in row 1 instead of data, and (2) `stringsAsFactors = FALSE` which is a necessary but annoying reminder to R that you want it to leave 'strings' alone and not convert them to 'labelled' factors.
## Exercise: Import the .csv into R
a. Import the .csv file from the Excel Hell lesson as a data frame and call it `RCT`.
## Exercise: Answer
a. Import the .csv file from the Excel Hell lesson as a data frame and call it `RCT`.
`RCT <- read_csv("data/file.csv")`
```{r eval=FALSE, echo=TRUE}
# Alternative solution:
myfile <- file.choose()
RCT <- read_csv(myfile)
```
## Everything is a "data frame"
Ok now you've managed to get your data into `R`! What next?
Spreadsheets in R are called 'data frames'.
A data frame has columns, each identified by a name, and rows for observations. This means that however you import your data into R within your pipeline, those data will end up as a data frame. Let's do a quick tour of data frames in R.
## Let's have a look at the data
We successfully brought in the RCT data and named it `RCT`. How do we look at it?
```{r eval=FALSE, echo=TRUE}
# Have a look at the first few rows
head(RCT)
# Have a look at the last few rows
tail(RCT)
# Have a look at the whole spreadsheet
View(RCT)
```
## Navigating within the data frame
The 'rows' contain measurements from different individuals, and the columns contain one type of measurement. Just like in Excel, any item (cell) can be accessed by its coordinates in the table which are always written as `[row, column]` (i.e. which row, then which column).
```{r eval=FALSE, echo=TRUE}
RCT[1,1] # top left cell
RCT[64,20] # bottom right cell
RCT[1, ] # first row (all columns)
RCT[ ,1] # first column (all rows)
```
## Selecting columns
Most of the time the rows are 'observations' and we want to pick out 'characteristics' of those observations (i.e. the columns). Rather than having to remember the column number, we can just ask for a column by name using the `$` operator: e.g. `dataframename$some_column`
```{r eval=FALSE, echo=TRUE}
# First we need to know what column names exist
names(RCT)
RCT$age
RCT$gender
```
## Exercise: Taking a quick look at the data
1. What are the names of the columns? Hint: try `names()`
2. Display in the console the items in the column `ps0h`
3. Can you do the same for all the other columns in `RCT`?
4. Can you find out how many patients were randomised to each arm? Hint: you can use the function `table()`
5. Can you find out how many subjects there are?
## Exercise: Answers
1. What are the names of the columns? `names(RCT)`
2. Display in the console the items in the column `RCT$ps0h`
3. Can you do the same for all the other columns in `RCT`? `RCT$gender`, `RCT$age`, etc.
4. Can you find out how many patients were randomised to each arm? `table(RCT$random)`
5. Can you find out how many subjects there are? `nrow(RCT)`
## Summary
We have learned:
1. Exporting an `.xlsx` file into a `.csv` file.
2. Importing data into R using
- point-and-click
- scripting
3. Navigating a data frame to inspect data.
## Homework
1. Can you import and look at the `outreach.csv` dataset? What is the average heart rate? How many patients were accepted to ICU (the column is called `icu_accept`)? What was the mortality?
2. Have a look at our online course material where we introduce the idea of bringing data in via Google Sheets (http://datascibc.org/Data-Science-for-Docs/03-lesson-03-getting-data-into-r.html)