-
Notifications
You must be signed in to change notification settings - Fork 1
/
exercise10_regex.Rmd
99 lines (70 loc) · 1.98 KB
/
exercise10_regex.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
## Exercise 10: Regular expressions
Create the script "exercise8.R" and save it to the "Rcourse/ExtraModule" directory: you will save all the commands of exercise 8 in that script.
<br>Remember you can comment the code using #.
<details>
<summary>
*Answer*
</summary>
```{r, eval=F}
getwd()
setwd("~/Rcourse/ExtraModule")
```
</details>
**1- Play with grep**
* Create the following data frame
```{r}
df2 <- data.frame(age=c(32, 45, 12, 67, 40, 27),
citizenship=c("England", "India", "Spain", "Brasil", "Tunisia", "Poland"),
row.names=paste(rep(c("Patient", "Doctor"), c(4, 2)), 1:6, sep=""),
stringsAsFactors=FALSE)
```
Using grep: create the smaller data frame df3 that contains only the Patient's but NOT the Doctor's information.
<details>
<summary>
*Answer*
</summary>
```{r}
# Select row names
rownames(df2)
# Select only rownames that correspond to patients
grep("Patient", rownames(df2))
# Create data frame that contains only those rows
df3 <- df2[grep("Patient", rownames(df2)), ]
```
</details>
* Use **grep** and ***one*** regular expression to retrieve from df2 patients/doctors coming from either **Brasil** or **Spain**. Brainstorm a bit!
<details>
<summary>
*Answer*
</summary>
```{r, eval=F}
df2[grep("a[a-z]*i", df2$citizenship),]
```
</details>
* Use **grep** and ***one*** regular expression to retrieve from df2 patients/doctors coming from either **Brasil** or **England**.
<details>
<summary>
*Answer*
</summary>
```{r, eval=F}
df2[grep("[gi]l", df2$citizenship),]
```
</details>
**2- Play with gsub**
Build this vector of file names:
```{r}
vector1 <- c("L2_sample1_GTAGCG.fastq.gz", "L1_sample2_ATTGCC.fastq.gz",
"L1_sample3_TGTTAC.fastq.gz", "L4_sample4_ATGGTA.fastq.gz")
```
Use **gsub** and an appropriate **regular expression** on **vector1** to retrieve only "sample1", "sample2", "sample3" and "sample4".
<details>
<summary>
*Answer*
</summary>
```{r}
# | is used as OR
gsub(pattern="L[124]{1}_|_[ATGC]{6}.fastq.gz",
replacement="",
x=vector1)
```
</details>