-
Notifications
You must be signed in to change notification settings - Fork 1
/
organization.Rmd
202 lines (143 loc) · 12.4 KB
/
organization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
title: "Data organization with spreadsheets"
output:
html_document:
toc: yes
toc_float: yes
html_notebook:
toc: yes
toc_float: yes
---
[<<BACK](https://remi-daigle.github.io/2017-CHONe-Data/)
My unofficial mantra:
!["Adam Savage via Pintrest"](https://s-media-cache-ak0.pinimg.com/564x/e9/f2/19/e9f219dce30f13670158832310b0e42c.jpg)
Data is a fundamental part of 'doing science' and spreadsheets are an important, perfectly valid[^1], tool for inputting and organizing your data.
[^1]: You may have heard some people, including myself, express great frustration regarding spreadsheet programs, namely MS Excel. This is not because Excel in itself is problematic to the process of good reproducible science, but rather the problem is that it is often misused! This problem is not unique to Excel, but rather these issues crop up in all spreadsheet programs. I'll refer to Excel throughout the lesson referring to spreadsheet programs in general. More on this [here](file:///C:/Users/Remi-Work/Desktop/2017-CHONe-Data/organization.nb.html#1_what_not_to_do_with_excel).
# What not to do with Excel
This title sounds negative, but I think by first showing you what not to do will help you use Excel (and other spreadsheet software) appropriately
## Be wary of dates
Excel is great for data entry, but as soon as you attempt more than rudimentary statistical analyses or data visualizations, it quickly becomes less than optimal. The options for these are rather limited and the details of how you calculated these things are hidden in cells or figure options. Worse yet, it does things to 'help' you, such as auto-formatting dates, which can be tremendously problematic if it goes unnoticed. For example, [this article](https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-alarming-number-of-scientific-papers-contain-excel-errors/?postshare=4161472211255740&tid=ss_tw) on how 1 in 5 genetics papers may have and Excel induced error. This kind of thing gives me nightmares!
> **Pro-Tip**
>
>While there are many correct ways to input your dates into Excel, I find that the best[^2] way is to split it information over multiple columns. Either `year`, `month`, and `day`, or even better `year`, and `DayOfYear` (i.e. 1 to 365). See [here](http://www.datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/) for more info on converting date formatting in Excel.
[^2]: By 'best' I mean that it avoids the common Excel problems and doesn't cause any problems in any downstream analysis in R or other.
## Keep your data tidy
(**tldr: tidy data = quality of life improved**)
The 'tidy' concept is something we will revisit in the R section, but the rules of tidy data are[^3]:
1. Each variable you measure should be in one column
2. Each different observation of that variable should be in a different row
3. There should be one table for each “kind” of variable
4. If you have multiple tables, they should include a column in the table that allows them to be linked
[^3]: [Jeff Leek, The Elements of Data Analytic Style, Leanpub, 2015-03-02](https://leanpub.com/datastyle)
> **Pro-Tips**
>
> - In practice, these rules generally translate into keeping your data 'square' (or rectangular), and not having multiple tables within one spreadsheet.
> - `locationID`, or `organismID` make great unique identifiers for 'linking' tables as per rule 4
Following these rules makes your data more human and machine readable. In my opinion the first 2 rules (and the 4th if you have multiple tables) are totally non-negotiable. Rule 3 for me is a little fuzzy since "kind" is a little ambiguous in my mind, and doesn't mesh with what I consider 'best-practice'. My rule of thumb is that if data were collected at the same spatial and temporal scales, they will be going into 1 table. For example, conductivity (salinity), temperature, and depth of the ocean data from a CTD is often recorded at the exact same time and place since all 3 sensors are built into the same instrument. Therefore, I would keep conductivity, temperature, and depth of the ocean in 1 table since I would otherwise be repeating identical time/space data. Conversely, a sample from a plankton net will not have the same temporal resolution as a CTD therefore should be in different tables, but these instruments can sample the same site so `locationID` would make a great column to link between tables.
## NEVER save as .xls or .xlsx (seriously)
(**tldr: re-read the title, it's the most important rule!!!**)
In science, it is important to write everything down (see Adam above), but it is also just as important very important to be able to read what you have written! Excel by default will save your spreadsheet in it's proprietary format, which means that if you don't have Excel you can't read your file[^4]! This is also not just a rule about Excel, but rather any proprietary software. **Always save in an open format rather than a proprietary format.***
[^4]: OK, I know there are other ways of opening an Excel file, but there is still an interoperability problem. Weird, unexpected things can often happen (e.g. again, dates!) when opening Excel files using alternative software (e.g. Open Office). Additionally, the proprietary Excel format changes over time, you may not be able to open old data files in the future.
> **Pro-Tip**
>
> If you have entered your data in an Excel spreadsheet, simply click on 'Save As' and select type 'CSV (Comma delimited)'
I recommend that everyone should save their data as a `.csv` (stands for comma separated values, the name of the format gives you a clue as to how the data is formatted!) instead of `.xlsx` because:
1. It's easy to do
2. You can still use Excel to enter your data
3. It's stored as a simple text file and is nearly universally readable
4. It avoids common Excel pitfalls
These pitfalls are an interesting point. How many times have you used:
- cell color (or bold text, etc) to convey information
- multiple tabs
- comments (the bubble inserts)
All these things are big no-no's since they will likely get scrambled, or not appear at all if you try to open in anything but Excel. Saving as `.csv` is a self reinforcing guarantee that you do not use these things since it will not save them. Multiple tabs should be multiple files, and comments (plus any other information encoded by special formatting) should be recorded in columns reserved for that use.
>**EXERCISE**
>
> - Open new spreadsheet in Excel, write a few rows of data into your spreadsheet and save your file as both a `.csv` and a `.xlsx`.
- Close Excel
- Open both files using Excel
- Open both files using a basic text editor (e.g. Notepad on Windows, or TextEdit on Mac)
Do the files look identical in Excel? In the text editor? Which format seems better to you and why?
## Don't re-invent the wheel
For column headers CHONe recommends you use the [Darwin Core Terms](http://rs.tdwg.org/dwc/terms/)
It contains helpful information about how to use standard column headers. It will help others (and yourself a few months after writing your data) understand your data. It also will streamline the process when you submit your data to be archived with [OBIS](http://www.iobis.org/).
Some common ones you will all likely use are for date/time: `year`, `month`, `day`, `eventTime`, `eventDate`, or for location: `locationID`, `decimalLongitude` and `decimalLatitude`.
For latitude and longitude, it's better to write: `-41.0983423` and `-121.1761111` and NOT: 47° 33' 37.9404'' N and 52° 42' 46.1880'' W because the decimal version will be easier for the computer to read and it is 'according to standard'!
For`eventdate` here are a few examples:
- `1963-03-08T14:07-0600` is 8 Mar 1963 2:07pm in the time zone six hours earlier than UTC
- `2009-02-20T08:40Z` is 20 Feb 2009 8:40am UTC
- `1809-02-12` is 12 Feb 1809
- `1906-06` is Jun 1906
- `1971` is just that year
- `2007-03-01T13:00:00Z/2008-05-11T15:30:00Z` is the interval between 1 Mar 2007 1pm UTC and 11 May 2008 3:30pm UTC
- `2007-11-13/15` is the interval between 13 Nov 2007 and 15 Nov 2007.
But I am personally uneasy about typing all of that into Excel, it may try to reinterpret what you wrote while trying to 'help' you. I would recommend keeping that information separated into the `year`, `month`, `day`, and `eventTime` columns
- `year` : 2008
- `month` : 1 (not "January")
- `day`: 28
- `eventTime`: "14:07-0600" is 2:07pm in the time zone six hours earlier than UTC
- `eventTime`: "08:40:21Z" is 8:40:21am UTC
- `eventTime`: "13:00:00Z/15:30:00Z" is the interval between 1pm UTC and 3:30pm UTC.
That way you are still following the standard, yet keeping Excel's eccentricities at bay and as a bonus, I think it's easier to work with the pre-separated variables in R. You can also generate an `evenDate` column in R by pasting all the rows together. (If the code below seems like gibberish, that's totally OK, we will explain all the steps when we get into R in the later lessons, just remember this code is here!)
```{r,message=FALSE}
require(tidyverse)
```
```{r}
data <- data.frame(
year = c(2008,2003,2009),
month = c(1,2,3),
day = c(28,12,12),
eventTime = c("12:00-0600","01:15-0600","14:07-0600")
)
# you may notice that my year, month, and day columns are numeric and won't have the require leading 0's
# we can fix that with sprintf:
sprintf("%02d",data$month)
# here's a 'one-liner' that will add leading zeros, paste all the columns together, and seperate them by the required '-'
data <- data %>%
mutate(eventDate=paste(sprintf("%04d",year),
sprintf("%02d",month),
sprintf("%02d",day),
eventTime,
sep="-")
)
data
```
Or alternatively, if you're like me and want to keep the time numeric for ease of programming, consider this (but be sure to recreate an `eventTime` that includes time zone information):
```{r}
data <- data.frame(
year = c(2008,2003,2009),
month = c(1,2,3),
day = c(28,12,12),
hour = c(12,1,14),
minute = c(0,15,7)
)
data <- data %>%
mutate(eventTime=paste0(sprintf("%02d",hour),
":",
sprintf("%02d",minute),
"-0600")
)
data
```
## Avoid other common headaches
- Not filling in zeros
- Using problematic null values
- Using problematic field names
- Using special characters in data
Zeros and null values are very different computationally and statistically! A zero is when you measured a 0, so that is useful information, and a null value is when there is no data, no information (e.g. measurement was not taken or was lost). Zeros should always be recorded `0` while null values are a bit trickier; either leaving the cell blank or inserting an `NA` are usually the least problematic options.
Many of the statistical software options are very fussy about spaces, periods, and slashes in field names (column headers). Underscores (`_`) are a good alternative to spaces. Consider writing names in camel case (like this: ExampleFileName) to improve
readability. Remember that abbreviations that make sense at the moment may not be so obvious in 6 months, but don't overdo it with names that are excessively long. Including the units in the field names avoids confusion and enables others to readily interpret your fields. Also, as always, check the standard ([Darwin Core Terms](http://rs.tdwg.org/dwc/terms/)) in case you should be using something specific, these are just general instructions for when yo need to make a new column name.
**Examples**
| Good Name | Good Alternative | Avoid |
|------------------|-------------------|-------------------|
| Max_temp_C | MaxTemp | Maximum Temp (°C) |
| Precipitation_mm | Precipitation | precmm |
| Mean_year_growth | MeanYearGrowth | Mean growth/year |
| sex | sex | M/F |
| weight | weight | w. |
| cell_type | CellType | Cell Type |
| Observation_01 | first_observation | 1st Obs |
# Important links
This section has been a shortened version of the Data Carpentry material on [spreadsheet organization](http://www.datacarpentry.org/spreadsheet-ecology-lesson/)
There is a twitter hashtag called [#otherpeoplesdata](https://twitter.com/hashtag/otherpeoplesdata?lang=en) for people venting about problems while using other people's data, hopefully this workshop will allow you to avoid that hashtag!
[<<BACK](https://remi-daigle.github.io/2017-CHONe-Data/)