From fec36371a283b9ae0f3c4988e9554b94965ebcf6 Mon Sep 17 00:00:00 2001 From: domonik Date: Tue, 23 Apr 2024 11:52:27 +0200 Subject: [PATCH] fixed timetable and order --- assets/timetable.json | 23 +++--- exercise-sheet-8.Rmd | 174 ++++++++++++++---------------------------- exercise-sheet-9.Rmd | 174 ++++++++++++++++++++++++++++-------------- 3 files changed, 184 insertions(+), 187 deletions(-) diff --git a/assets/timetable.json b/assets/timetable.json index bf183e1..8c9e1c5 100644 --- a/assets/timetable.json +++ b/assets/timetable.json @@ -1,16 +1,13 @@ { "exercise-sheet-1": "2024-04-17T09:00:00", - "exercise-sheet-2": "2024-12-30T09:00:00", - "exercise-sheet-2": "2024-12-30T09:00:00", - "exercise-sheet-3": "2024-12-30T09:00:00", - "exercise-sheet-4": "2024-12-30T09:00:00", - "exercise-sheet-5": "2024-12-30T09:00:00", - "exercise-sheet-6": "2024-12-30T09:00:00", - "exercise-sheet-7": "2024-12-30T09:00:00", - "exercise-sheet-8": "2024-12-30T09:00:00", - "exercise-sheet-9": "2024-12-30T09:00:00", - "exercise-sheet-10": "2024-12-30T09:00:00", - "exercise-sheet-11": "2024-12-30T09:00:00", - "exercise-sheet-12": "2024-12-30T09:00:00" - + "exercise-sheet-2": "2024-04-30T09:00:00", + "exercise-sheet-3": "2024-05-07T09:00:00", + "exercise-sheet-4": "2024-05-14T09:00:00", + "exercise-sheet-5": "2024-05-28T09:00:00", + "exercise-sheet-6": "2024-06-04T09:00:00", + "exercise-sheet-7": "2024-06-11T09:00:00", + "exercise-sheet-8": "2024-06-18T09:00:00", + "exercise-sheet-9": "2024-06-25T09:00:00", + "exercise-sheet-10": "2024-07-02T09:00:00", + "exercise-sheet-11": "2024-07-09T09:00:00" } diff --git a/exercise-sheet-8.Rmd b/exercise-sheet-8.Rmd index 030e589..47c19f3 100644 --- a/exercise-sheet-8.Rmd +++ b/exercise-sheet-8.Rmd @@ -6,22 +6,27 @@ library(officer) ``` --- -title: "Exercise sheet 8: Suffix-Trees" +title: "Exercise sheet 9: Data Driven Life Sciences" --- --------------------------------- # Exercise 1 -You are given the text T=`CAGTAGTAGC`. +### 1a) +::: {.question data-latex=""} +Arrange the following terms into their correct order in the Illumina sequencing method and describe each of them briefly: +- bridge amplification -### 1a) +- deblocking -::: {.question data-latex=""} +- library preparation + +- annealing of template strands to flow cell -Draw the corresponding suffix tree! +- fluorescence detection ::: #### {.tabset} @@ -31,125 +36,79 @@ Draw the corresponding suffix tree! ##### Solution ::: {.answer data-latex=""} -```{r, echo=FALSE, out.width="100%", fig.align='center'} -knitr::include_graphics("figures/sheet-8/suffix_tree_1.png") -``` -::: +**1. Library preparation:** -#### {-} +A sequencing *library* gets *prepared* from a sample by fragmenting the original DNA and adding Illumina-specific adapter sequences to both ends of the fragments. The *library* is what gets read during sequencing. +**2. Template strand annealing** -### 1b) -::: {.question data-latex=""} +The single-stranded library fragments are used as *template strands* in the sequencing and are *annealed* to primer sequences, which are bound to the *flow cell* and are complementary to the adapter sequences of the fragments. -Describe the steps of a counting query for $P =$ `TAG`. -::: +**3. Bridge amplification** -#### {.tabset} +After complementary strands have been synthesized and the templates been washed off, the now flow cell-bound fragments are *amplified* in several cycles of so-called *bridge-amplification* to form fragment colonies, or *clusters* on the flow cell to guarantee a detectable fluorescence signal during sequencing. -##### Hide +**4. Fluorescence detection** -##### Solution -::: {.answer data-latex=""} +Illumina-sequencing is a form of *sequencing-by-synthesis* in which the nucleotides incorporated into the growing strand are detected via attached *fluorophores*. After the first $3$ steps, the following steps are iterated to sequence the entire read: -* start at root node -* locate outgoing edge that starts with $T$ -* match subsequent characters of the pattern -* in the subtree rooted at TAG count the number of leaves $\Rightarrow 2$ -::: -#### {-} +Modified nucleotides, containing a fluorescent group, are used to extend the strand, their blocking groups are cleaved from their 3`-OH groups. +**5. Deblocking** +*Deblocking* is the removal of the fluorophore (blocking group). It is necessary before a new round of elongation by one nucleotide can begin. -### 1c) -::: {.question data-latex=""} -Describe the steps of a reporting query for $P =$ `AG`. -::: - -#### {.tabset} - -##### Hide - -##### Solution -::: {.answer data-latex=""} - -* start at root node -* locate outgoing edge that start with $A$ -* match subsequent characters of the pattern -* in the subtree rooted at AG report the labels of all leaves $\Rightarrow \{2, 5, 8\}$ +More information about this topic can be found on the [Illumina Webpage](https://www.illumina.com/science/technology/next-generation-sequencing/sequencing-technology.html). ::: #### {-} # Exercise 2 +```{r, echo=FALSE, out.width="75%", fig.align='center'} +knitr::include_graphics("figures/sheet-9/crossword.png") +``` + ### 2a) ::: {.question data-latex=""} -Draw a generalized suffix tree for the sequences $A=$`CCATG` and $B=$ `CATG`. -::: +**Solve the crossword puzzle!** -#### {.tabset} +Horizontal: -##### Hide +- 3. Added to DNA fragments during library preparation. -##### Hint 1 -::: {.answer data-latex=""} +- 8. Illumina way of determining the order of nucleotides in a DNA strand. (3 words) -Concatenate the two sequences using a unique character for splitting. e.g. -`CCATG#CATG$`. +- 9. ChIP-Seq can be used for sequencing DNA regions that are bound by these. -Dont forget to include suffix links! -::: -##### Formulae -::: {.answer data-latex=""} +- 11. The alphabet of life. -$sl(v) = w$ +- 12. Formed by bridge-amplification on Illumina flow-cells. -$\overline{v} = cb$ +- 13. Flowcell surface filled with these 2 different DNA molecules. -$\overline{w} = b$ +- 15. Measure to asses the quality of the identification of nucleobases generated by automated DNA sequencing. (3 words) -$c: character, b: string$ +Vertical: -remember: $\overline{v}$ denotes the concatenation of all path labels from the root to v. -::: -##### Solution -::: {.answer data-latex=""} +- 1. Dideoxynucleosidetriphosphates (abbrev.) -```{r, echo=FALSE, out.width="100%", fig.align='center'} -knitr::include_graphics("figures/sheet-8/suffix_tree_2.png") -``` -::: -#### {-} +- 2. Process of determining positions of reads on the reference genome. -### 2b) -::: {.question data-latex=""} +- 4. Gene expression can be measured using this. (abbrev. hyph.) -Find the Maximal Unique Matches of the sequences $A=$`CCATG` and $B=$`CATG` using -the tree from A). -::: - -#### {.tabset} +- 5. The process of making many copies of a piece of DNA. -##### Hide +- 6. Found in pairs in DNA. -##### Solution -::: {.answer data-latex=""} +- 7. Chemical group attached to nucleotides to monitor incorporation into DNA. -`CATG` is the only MUM as $\overline{v} =$ `CATG` has no suffix links pointing to -it -::: -#### {-} +- 10. File format used to store sequence information. +- 14. Breakthrough sequencing method (abbrev.) -# Exercise 3 - -### 3a) -::: {.question data-latex=""} - -Draw a generalized suffix tree for the sequence $A=$`ACGCACGCG`. ::: #### {.tabset} @@ -158,55 +117,40 @@ Draw a generalized suffix tree for the sequence $A=$`ACGCACGCG`. ##### Solution ::: {.answer data-latex=""} - -```{r, echo=FALSE, out.width="100%", fig.align='center'} -knitr::include_graphics("figures/sheet-8/suffix_tree_3.png") +```{r, echo=FALSE, out.width="75%", fig.align='center'} +knitr::include_graphics("figures/sheet-9/crossword_solved.png") ``` ::: - #### {-} +# Exercise 3 + +#### {.tabset} -### 3b) +### 3a) ::: {.question data-latex=""} +You want to determine how many reads $N$ are needed to achieve a coverage depth $C$ of 20X when sequencing reads for *Escherichia coli*. -Find all maximal pairs of length at least 2. +The length of the reads $L$ is 30nt and the *E. coli* genome $G$ is approximately 4.6 million bases long. ::: #### {.tabset} ##### Hide -##### Solution +##### Formula ::: {.answer data-latex=""} - -`ACGC`: $(1,5,4)$ - -`CG`: $(2,8,2), (6,8,2)$ +$$ +N = \frac{C\times G}{L} +$$ ::: -#### {-} - - -### 3c) -::: {.question data-latex=""} - -Why is `C`: $(2, 8, 1)$ not a maximal pair? - -::: - -#### {.tabset} - -##### Hide ##### Solution ::: {.answer data-latex=""} - -It is not right maximal. -This can be seen since `CG`: $(2, 8, 2)$ already includes the indices 2 and 8 with -a longer match. - +$$ +N = \frac{20\times 4600000}{30} \approx 3066667 \text{ reads} +$$ ::: -#### {-} diff --git a/exercise-sheet-9.Rmd b/exercise-sheet-9.Rmd index 47c19f3..030e589 100644 --- a/exercise-sheet-9.Rmd +++ b/exercise-sheet-9.Rmd @@ -6,27 +6,22 @@ library(officer) ``` --- -title: "Exercise sheet 9: Data Driven Life Sciences" +title: "Exercise sheet 8: Suffix-Trees" --- --------------------------------- # Exercise 1 +You are given the text T=`CAGTAGTAGC`. -### 1a) -::: {.question data-latex=""} -Arrange the following terms into their correct order in the Illumina sequencing method and describe each of them briefly: - -- bridge amplification -- deblocking -- library preparation +### 1a) -- annealing of template strands to flow cell +::: {.question data-latex=""} -- fluorescence detection +Draw the corresponding suffix tree! ::: #### {.tabset} @@ -36,79 +31,125 @@ Arrange the following terms into their correct order in the Illumina sequencing ##### Solution ::: {.answer data-latex=""} -**1. Library preparation:** +```{r, echo=FALSE, out.width="100%", fig.align='center'} +knitr::include_graphics("figures/sheet-8/suffix_tree_1.png") +``` +::: -A sequencing *library* gets *prepared* from a sample by fragmenting the original DNA and adding Illumina-specific adapter sequences to both ends of the fragments. The *library* is what gets read during sequencing. +#### {-} -**2. Template strand annealing** -The single-stranded library fragments are used as *template strands* in the sequencing and are *annealed* to primer sequences, which are bound to the *flow cell* and are complementary to the adapter sequences of the fragments. +### 1b) +::: {.question data-latex=""} -**3. Bridge amplification** +Describe the steps of a counting query for $P =$ `TAG`. +::: -After complementary strands have been synthesized and the templates been washed off, the now flow cell-bound fragments are *amplified* in several cycles of so-called *bridge-amplification* to form fragment colonies, or *clusters* on the flow cell to guarantee a detectable fluorescence signal during sequencing. +#### {.tabset} -**4. Fluorescence detection** +##### Hide -Illumina-sequencing is a form of *sequencing-by-synthesis* in which the nucleotides incorporated into the growing strand are detected via attached *fluorophores*. After the first $3$ steps, the following steps are iterated to sequence the entire read: +##### Solution +::: {.answer data-latex=""} -Modified nucleotides, containing a fluorescent group, are used to extend the strand, their blocking groups are cleaved from their 3`-OH groups. +* start at root node +* locate outgoing edge that starts with $T$ +* match subsequent characters of the pattern +* in the subtree rooted at TAG count the number of leaves $\Rightarrow 2$ +::: +#### {-} -**5. Deblocking** -*Deblocking* is the removal of the fluorophore (blocking group). It is necessary before a new round of elongation by one nucleotide can begin. +### 1c) +::: {.question data-latex=""} -More information about this topic can be found on the [Illumina Webpage](https://www.illumina.com/science/technology/next-generation-sequencing/sequencing-technology.html). +Describe the steps of a reporting query for $P =$ `AG`. +::: + +#### {.tabset} + +##### Hide + +##### Solution +::: {.answer data-latex=""} + +* start at root node +* locate outgoing edge that start with $A$ +* match subsequent characters of the pattern +* in the subtree rooted at AG report the labels of all leaves $\Rightarrow \{2, 5, 8\}$ ::: #### {-} # Exercise 2 -```{r, echo=FALSE, out.width="75%", fig.align='center'} -knitr::include_graphics("figures/sheet-9/crossword.png") -``` - ### 2a) ::: {.question data-latex=""} -**Solve the crossword puzzle!** +Draw a generalized suffix tree for the sequences $A=$`CCATG` and $B=$ `CATG`. +::: -Horizontal: +#### {.tabset} -- 3. Added to DNA fragments during library preparation. +##### Hide -- 8. Illumina way of determining the order of nucleotides in a DNA strand. (3 words) +##### Hint 1 +::: {.answer data-latex=""} -- 9. ChIP-Seq can be used for sequencing DNA regions that are bound by these. +Concatenate the two sequences using a unique character for splitting. e.g. +`CCATG#CATG$`. -- 11. The alphabet of life. +Dont forget to include suffix links! +::: +##### Formulae +::: {.answer data-latex=""} -- 12. Formed by bridge-amplification on Illumina flow-cells. +$sl(v) = w$ -- 13. Flowcell surface filled with these 2 different DNA molecules. +$\overline{v} = cb$ -- 15. Measure to asses the quality of the identification of nucleobases generated by automated DNA sequencing. (3 words) +$\overline{w} = b$ +$c: character, b: string$ -Vertical: -- 1. Dideoxynucleosidetriphosphates (abbrev.) +remember: $\overline{v}$ denotes the concatenation of all path labels from the root to v. +::: +##### Solution +::: {.answer data-latex=""} -- 2. Process of determining positions of reads on the reference genome. +```{r, echo=FALSE, out.width="100%", fig.align='center'} +knitr::include_graphics("figures/sheet-8/suffix_tree_2.png") +``` +::: +#### {-} -- 4. Gene expression can be measured using this. (abbrev. hyph.) +### 2b) +::: {.question data-latex=""} -- 5. The process of making many copies of a piece of DNA. +Find the Maximal Unique Matches of the sequences $A=$`CCATG` and $B=$`CATG` using +the tree from A). +::: + +#### {.tabset} -- 6. Found in pairs in DNA. +##### Hide -- 7. Chemical group attached to nucleotides to monitor incorporation into DNA. +##### Solution +::: {.answer data-latex=""} -- 10. File format used to store sequence information. +`CATG` is the only MUM as $\overline{v} =$ `CATG` has no suffix links pointing to +it +::: +#### {-} -- 14. Breakthrough sequencing method (abbrev.) +# Exercise 3 + +### 3a) +::: {.question data-latex=""} + +Draw a generalized suffix tree for the sequence $A=$`ACGCACGCG`. ::: #### {.tabset} @@ -117,40 +158,55 @@ Vertical: ##### Solution ::: {.answer data-latex=""} -```{r, echo=FALSE, out.width="75%", fig.align='center'} -knitr::include_graphics("figures/sheet-9/crossword_solved.png") + +```{r, echo=FALSE, out.width="100%", fig.align='center'} +knitr::include_graphics("figures/sheet-8/suffix_tree_3.png") ``` ::: -#### {-} -# Exercise 3 +#### {-} -#### {.tabset} -### 3a) +### 3b) ::: {.question data-latex=""} -You want to determine how many reads $N$ are needed to achieve a coverage depth $C$ of 20X when sequencing reads for *Escherichia coli*. -The length of the reads $L$ is 30nt and the *E. coli* genome $G$ is approximately 4.6 million bases long. +Find all maximal pairs of length at least 2. ::: #### {.tabset} ##### Hide -##### Formula +##### Solution ::: {.answer data-latex=""} -$$ -N = \frac{C\times G}{L} -$$ + +`ACGC`: $(1,5,4)$ + +`CG`: $(2,8,2), (6,8,2)$ ::: +#### {-} + + +### 3c) +::: {.question data-latex=""} + +Why is `C`: $(2, 8, 1)$ not a maximal pair? + +::: + +#### {.tabset} + +##### Hide ##### Solution ::: {.answer data-latex=""} -$$ -N = \frac{20\times 4600000}{30} \approx 3066667 \text{ reads} -$$ + +It is not right maximal. +This can be seen since `CG`: $(2, 8, 2)$ already includes the indices 2 and 8 with +a longer match. + ::: +#### {-}