-
Notifications
You must be signed in to change notification settings - Fork 0
/
useful_R.qmd
192 lines (116 loc) · 5.98 KB
/
useful_R.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: "Some useful R"
---
Before we start with the splicing analysis in R I want to introduce some important functions and operators that will help us to process the splicing output.
```{r}
library(knitr)
library(dplyr)
```
## The pipe operator %\>%
Let's start with the %\>% operator. This operator allows the piping of a function's output to another function, where it is used as input.
Assume we have a vector with decimal numbers and want to round them to integers and subsequently calculate the sum. One way to do this is the following.
```{r}
decimals <- c(2.5, 1.2, 3.6, 4.1, 4.6, 7.3, 9.2, 3.1, 5.3)
sum(round(decimals))
```
As you can see the command is quite hard to read and it becomes harder if you apply more than two functions. One nice alternative is the mentioned *%\>% operator*. Note that most of the time it is not nessary to add () behind a function's name.
```{r}
decimals <- c(2.5, 1.2, 3.6, 4.1, 4.6, 7.3, 9.2, 3.1, 5.3)
round(decimals) %>% sum
```
This helps already a lot. But we can make it even more easier to read.
```{r}
decimals <- c(2.5, 1.2, 3.6, 4.1, 4.6, 7.3, 9.2, 3.1, 5.3)
decimals %>% round %>% sum
```
As you can see we can directly pipe the the vector called decimals to *round()* and the output to *sum()*.
## Count the number of occurences with the table function
Often we have a data.frame column with some specific information that we want to summarize. Here the *table()* function is helpful as it counts how often a certain string or value is found in the column.
Assume we have a data.frame that stores the regulation of eight genes (A to H) and we want to know how many genes are upregulated, not regulated or downregulated. We can easily solve this task with *table()*.
```{r}
regulationDataFrame <- data.frame(
gene=c("A", "B", "C", "D","E", "F", "G", "H"),
regulation=c("up", "down", "up", "no", "no", "down", "up", "up")
)
table(regulationDataFrame$regulation)
```
As you can see we now have a nice overview about the regulation. One nice feature of *table()* is that you can provide more than one column as input.
For instance we could also have information about the gene type and want to summarize the regulation across the different gene types. Here is an easy example, where we have 3 gene types, namely protein-coding, miRNA and rRNA.
```{r}
regulationDataFrame$type = c("protein-coding", "rRNA", "protein-coding", "miRNA",
"miRNA", "rRNA", "protein-coding", "protein-coding")
table(regulationDataFrame$type, regulationDataFrame$regulation)
```
We can see that the upregulated genes are encoding for proteins, while the downregulated genes encode rRNAs and the non-regulated ones miRNAs.
## Use kable to make a table or data.frame output look nicer
The *kable()* function from the knitr package makes tables look nicer in the html/pdf reports.
```{r}
table(regulationDataFrame$type, regulationDataFrame$regulation) %>% kable()
```
## for loops
*For* loops are the basic loop function in R. We will use it to loop over lists.
```{r}
for(i in 1:5){
print(i)
}
for(i in 1:5){
y = i + 3
print(paste(i, "+ 3 =", y ))
}
```
## Handling of data.frames
Most of the time we will work with with the results of the splicing analysis in data.frame format. There are several nice functions that help with data frames. Some of them are quickly shown in the following.
### Adding new columns with mutate
The *mutate()* function is an easy way to add new columns to a data.frame.
```{r}
# mutate can write the same word in every row
regulationDataFrame %>% mutate(newColumn = "new")
# duplicate another row
regulationDataFrame %>% mutate(newColumn = gene)
# or modify an existing column.
regulationDataFrame %>%
mutate(newColumn = paste0(regulation, "-regulation"))
```
### Using case_when to mutate conditionally
*mutate()* can also be use conditionally by combining it with the *case_when()* function.
```{r}
regulationDataFrame %>%
mutate(change = case_when( regulation == "no" ~ "no change",
T ~ "change")
)
```
### Picking only certain rows (subset) or columns (select)
We can subset a data.frame to only the certein rows with the *subset()* function. For this we need to pass a logical argument to the subset function.
Quick reminder: A logical statement will output a vector of TRUE or FALSE. Typical logical operators in R are *==* for is equal and *!=* for is not equal. Another nice one is *%in%*, which checks if a value is inside a vector.
```{r}
# logical statements
regulationDataFrame$regulation == "up"
regulationDataFrame$regulation != "up"
regulationDataFrame$regulation %in% c("up", "down")
# subsetting for specific rows
regulationDataFrame %>% subset(regulation == "up")
```
### Extract certain patterns with grepl()
The *grepl()* function also returns a logical vector. It can be used to search for a specific pattern within a vector. This can then also be used together with subset, to subset a data.frame for the rows that contains this pattern in a specific row.
```{r}
grepl(regulationDataFrame$type, pattern = "protein-coding")
grepl(regulationDataFrame$type, pattern = "protein")
subset(regulationDataFrame, grepl(regulationDataFrame$type, pattern = "protein-coding"))
```
### Arranging data.frames
data.frames can be sorted by a speciffic coulmn or also by multiple columns with the *arrange()* function. *arrange()* sorts character columns by alphabetical order and numeric columns from lowest to highest. the function *desc()* can be used to sort from highest to lowest instead.
```{r}
# arranging a data.frame by a character column
regulationDataFrame %>% arrange(type)
# arranging a data.frame by a numeric column
regulationDataFrame$someNumber <- c(1,5,8,2,2,4,3,5)
regulationDataFrame %>% arrange(someNumber)
# arranging by two columns
regulationDataFrame %>% arrange(type, someNumber)
# arranging from highest to lowest
regulationDataFrame %>% arrange(desc(someNumber))
```
### Grouping data.frames
```{r}
regulationDataFrame %>% group_by(regulation) %>% arrange(desc(someNumber), .by_group = T)
```