-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-Grammar_of_Var.Rmd
162 lines (113 loc) · 2.85 KB
/
05-Grammar_of_Var.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# Data and variable manipulation
# Grammar of variables
## Prepare folder and data
### Set the working directory
This can be done in 2 ways:
1. Using codes
2. Using point and click
To use point and click, use the down arrow button next to *More* . Then click 'Set as working directory'
## Read Data
```{r}
library(foreign)
data_qol<-read.dta('qol.dta',convert.factors = T)
str(data_qol)
```
## Browse data
1. First few rows
2. Last few rows
```{r}
head(data_qol)
tail(data_qol)
```
## Grammar of variables
### Select columns
## Select columns
Let us create a new dataframe with only id, sex and hba1c as the variables
```{r}
data_qol2<-subset(data_qol, select = c('sex', 'age', 'hba1c'))
str(data_qol2)
```
alternatively, we can use other subsetting functions
```{r}
data_qol3<-data_qol[,c('sex','age','hba1c')]
str(data_qol3)
```
### Select rows
```{r}
data_qol4<-subset(data_qol, age > 30)
str(data_qol4)
summary(data_qol4$age)
```
alternatively, we can use other subsetting functions
```{r}
data_qol5<-data_qol[data_qol$age>30,]
str(data_qol5)
summary(data_qol5$age)
```
### Select rows and columns together
```{r}
data_qol6<-subset(data_qol,age>30 & sex=='male', select = c(id, sex, age, group))
str(data_qol6)
table(data_qol6$sex)
```
### Generate a new variable
```{r}
data_qol$age_cat<-data_qol$age
View(data_qol)
```
### Categorize into new variables
#### From a numerical variable
```{r}
data_qol$age_cat<-cut(data_qol$age_cat,
breaks=c(min(data_qol$age),40,60,Inf),
labels=c('min-39','40-59','60-above'))
min(data_qol$age)
table(data_qol$age_cat)
str(data_qol$age_cat)
```
#### From a categorical variable
```{r}
table(data_qol$tx)
str(data_qol$tx)
```
Create a variable with 'Diet only' vs 'Diet+Drug'. This is a little bit complicated
```{r}
data_qol$tx2<-data_qol$tx
str(data_qol$tx2)
str(data_qol$tx)
table(data_qol$tx2)
library(plyr)
data_qol$tx2<-revalue(data_qol$tx,c('diet only'='diet', 'OHA and diet only'='med',
'insulin and diet only'='med', 'all'='med'))
table(data_qol$tx2)
```
### Dealing with missing data
```{r}
data_qol$tx3<-data_qol$tx
str(data_qol$tx3)
str(data_qol$tx)
table(data_qol$tx3)
```
#### Replace values with 'NA'
```{r}
data_qol$tx3<-revalue(data_qol$tx,c('diet only'=NA))
table(data_qol$tx3)
str(data_qol$tx3)
```
<<<<<<< HEAD
## Additional packages
=======
## Additional package
>>>>>>> b1e7fe217da24c9c49c585f8cb7b29c3a03e88ce
### Package 'dplyr'
'dplyr' package is a very useful package that encourage users to use proper verb when manipulating variables (columns) and observations (rows)
It has 9 useful functions
1. filter()
2. arrange()
3. select()
4. distinct()
5. mutate() and transmute()
6. summarise()
7. sample_n() and sample_frac()
Package 'dplyr' is very useful when it is combined with another function that is 'group_by'
`