-
Notifications
You must be signed in to change notification settings - Fork 3
/
rna-seq.tex
251 lines (214 loc) · 13.6 KB
/
rna-seq.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
\section{\abbr{rna} sequencing quantifies protein-coding gene expression}
\label{sec:rna}
\subsection{The transcriptome reflects the state of the cell}
As we have seen, the central part of the cellular machinery is abstracted by the
central dogma of molecular biology, with the \dna at one end encoding the
hereditary identity of the cell, and the proteins at the other end as the
effectors of this information.
It is therefore proteins that determine how a cell behaves, and changes in
proteins abundance ultimately determine changes in cellular function. The
entirety of protein abundance in the cell at a given instance is called the
\define{proteome}. Unfortunately, quantifying the proteome in an unbiased
fashion is hard and expensive \citep{Graumann:2008}. Instead, modern biology
often uses the abundance of \mrna molecules, coding for individual proteins (the
\define[\defineword{transcriptome} the entirety of the \rna molecules present in
the cell at a given time]{transcriptome}), as an accurate proxy of the proteome.
The appropriateness of this approach has been verified in numerous studies
\citep{Nagaraj:2011,Nookaew:2012}.
\subsection{Microarrays}
Until recently, the abundance of \mrna has been mostly determined using specific
probes, which hybridise to complementary target \mrna sequences. Probes are
tagged --- for instance with a \define[\defineword{fluorophore} small chemical
that emits light after excitation by a specific wavelength]{fluorophore} ---
such that the presence of a probe can be detected visually. In order to
determine the abundance of many different \mrna[s] simultaneously, large arrays
of different probes can be generated and queried in parallel. Due to their
miniaturisation, these arrays are known as \define{expression microarrays}
\citep{Schena:1995}.
While microarrays still play an important role in transcriptome analysis, they
have recently been superseded by another technology due, mainly, to the
following disadvantages \citep{Casneuf:2007,Marioni:2008}:
\begin{enumerate*}
\item Each probe is sequence specific, and only recognises its target \mrna.
As a consequence, quantification is inherently biased and requires that
each target is known beforehand. Microarrays thus cannot discover new
target transcripts, and desiging a new microarray is technically
challenging and expensive.
\item Cross-hybridisation causes low-level, non-specific binding of
transcripts to non-targeted probes, skewing their reported expression
strength. Since this effect is probe dependent, this means that
while identical transcripts’ relative abundances can be compared across
arrays, abundance of different transcripts \emph{on the same array}
cannot be compared.
\item Several interesting biological questions which cannot be answered by
microarray analysis at all, or only with difficulties, become tractable
using new methods; these include the analysis of isoforms from
alternative splicing \citep{Katz:2010} and of allele-specific expression
\citep{Pickrell:2010}.
\end{enumerate*}
\subsection{\abbr{rnaseq}}
In 2008, focus shifted from microarrays to whole-transcriptome shotgun
sequencing for \rna quantification
\citep{Nagalakshmi:2008,Mortazavi:2008,Marioni:2008}. In contrast to other \rna
quantification approaches, whole-transcriptome shotgun sequencing (now typically
known as \rnaseq) is entirely unbiased in that it does not rely on a
pre-selected set of transcripts to assay. The approach has been shown to yield
high-quality results, is highly replicable, has very low noise, and is sensitive
to transcripts present at low concentration.
\subsubsection{\abbr{rnaseq} sample preparation}
\rnaseq analysis starts with the enrichment of relevant transcripts from a
sample’s total \rna pool. This is important since the \rna fraction that we are
usually interested in (\mrna constitutes \numrange{3}{5} per cent of the total
\rna \citep{Alberts:2002}) is dwarfed in abundance by the fraction of \rrna
(more than \num{70} per cent of the cell’s \rna is \rrna in eukaryotes; in
prokaryotes, that number is even higher \citep{Sittman:1999}), and would thus
dilute the signal. In practice, this enrichment can be done in one of two ways:
\begin{enumerate*}
\item Polyadenylated \rna can be targeted due to its affinity to oligo(dT)
primers, short strings of thymine; this will efficiently capture \mrna,
which is post-transcriptionally \threep tagged with poly(A)
tails \citep{Mortazavi:2008}.
\item The \rna can be \rrna depleted using a RiboMinus protocol, which
specifically targets \rrna molecules for removal \citep{Cui:2010}.
\end{enumerate*}
Thus, if one wants to profile non-\mrna molecules, \rrna depletion rather than
poly(A) selection is the method of choice. However, neither method is
\emph{sufficient} to reliably select only a particular type of \rna. For one
thing, neither method is \num{100} per cent specific; in addition, \mrna[s] are
not the only type of \rna that is polyadenylated: a large fraction of \ncrna is
also polyadenylated and exported from the nucleus into the cytoplasm
\citep{Cheng:2005}.
The enriched \rna is subsequently fragmented to a uniform length of about
\SI{200}{nt}, and reverse transcribed into \cdna. Ultimately, the result is a
\cdna library of fragments which are then sequenced on a high-throughput
sequencing machine (\cref{fig:rna-seq-workflow}).
\textfig{rna-seq-workflow}{body}{0.75\textwidth}
{Typical \abbr{rnaseq} workflow.}
{Starting from a total \rna sample (\num{1}), the \rna of interest is
enriched either via poly(A) selection or \rrna depletion (\num{2}). The
enriched fraction is fragmented and size-selected (\num{3}). The fragments
are reverse transcribed into \cdna (\num{4}) and sequenced. The resulting
sequencing reads are aligned back to a reference (\num{5}) and counted on
features of interest (adapted from \citet{Mortazavi:2008}).}
\subsubsection{Computational processing of the sequencing information}
Sequencing the \rnaseq samples yields short-read libraries, typically several
tens of millions of reads in size, with reads of uniform length from \SI{25}{bp}
up to (currently) about \SI{125}{bp}. These are then mapped to a reference ---
either the whole genome or the transcriptome --- or assembled \emph{de novo}
(usually in absence of a suitable reference). This assigns reads to genomic
locations, which can be queried and matched to features (usually genes or
exons) \citep{Cox:2007,Kim:2013,Anders:2014}.
It is important to distinguish whether the \cdna fragments were sequenced from
both ends or from one end only. In the first case, instead of a single read per
fragment we end up with paired-end reads, which are separated by an
approximately known distance (the fragment length minus the lengths of the
sequence reads). Mapping and assembling paired-end reads creates further
algorithmic challenges but increases the amount of information contained in a
read (pair), which increases the amount of unambiguously mappable data
(\cref{fig:paired-end-reads}) \citep{Langmead:2012}.
\textfig{paired-end-reads}{body}{0.75\textwidth}
{Transcript fragment with paired-end reads.}
{The transcript is sequenced from both ends, resulting in a read pair whose
distance to each other can be inferred from the known approximate fragment
length. As long as one of the reads is “anchored” by being uniquely mapped,
the paired read can be used to improve the mapping confidence in a repeat
region.}
Another distinction is made for fragments which are tagged and selected in such
a way as to be strand-specific. This allows identification of the strand from
which fragments originate, and aids in the unambiguous reassembly of the
sequenced reads, as well as the identification of antisense information within
(intronic regions in) a gene body, which would otherwise confound the expression
estimate (\cref{fig:strand-specific-rna-seq}) \citep{Yassour:2010}.
\textfig{strand-specific-rna-seq}{spill}{\textwidth}
{Effect of strand-specific mapping on feature identification.}
{\rnaseq coverage data from \species{scer} without strand specificity (top)
and from two libraries prepared with forward- and reverse-strand specific
protocols, as well as the feature units thus found. This example shows that
strand-specific mapping can lead to the discovery of new reverse-strand
features, which may skew the expression quantification of forward-strand
features (figure modified from \citet{Yassour:2010}, licensed under CC-BY).}
\subsection{Expression normalisation}
Since the ultimate goal of \rnaseq is to quantify transcript abundance, the next
step is to quantify the number of reads mapping to genomic features. In the
simplest case, one can count the number of reads overlapping with a feature’s
genomic range. This is what e.g.\ \name{HTSeq} \citep{Anders:2014} does.
However, read counts obtained in this manner vary with the length of the
mappable region --- longer features originate more sequenced fragments, and
hence more reads, with equal coverage, compared to shorter features --- and with
the total size of the sequenced library.
\cite{Mortazavi:2008} therefore introduced a relative measure of transcript
abundance, the \rpkm, defined as
\begin{equation}
x^*_i = \frac{x_i}{\tilde l_i \cdot 10^{-3} \cdot n \cdot 10^{-6}}
\text{\ ,}
\end{equation}
where \(x^*_i\) is the \rpkm of transcript \(i\), \(x_i\) is the raw number of
reads annotated with transcript \(i\), \(\tilde l_i\) is the effective length of
the transcript (i.e.\ its length minus the fragment length plus \num{1}) and
\(n\) is the library size (in number of reads). The constant multiplier
\num{1e9} merely serves to make otherwise very small values more manageable.
With the advent of paired-end sequencing the \rpkm has been supplanted by the
\fpkm, simply replacing the number of reads in the equation with the number of
fragments (i.e.\ the read pairs rather than single reads in the case of
paired-end data).
A related measure is \tpm, which additionally normalise by the total transcript
abundance \citep{Li:2010}. In other words,
\begin{equation}
x^*_i = \frac{x_i}{\tilde l_i} \cdot \left(\sum_j{\frac{x_j}{\tilde
l_j}}\right)^{-1} \cdot 10^6 \text{\ .}
\end{equation}
\tpm succinctly answers the question, “given one million transcript in my
sample, how often will I see transcript \(i\)?”
It is important to note that both approaches give a sample dependent abundance,
and thus neither of these units makes measured transcript abundance comparable
across different experiments: Consider two biological conditions which are
identical except for a transcript \(X\), which is highly abundant in condition
\(A\), and not present in condition \(B\). If we sequence the same amount of
\mrna from both samples, relatively fewer fragments of the non-\(X\) transcript
will exists for condition \(A\) relative to condition \(B\), even though
their absolute abundance in the original sample was the same
\citep{Robinson:2010a}.
To compare transcript abundance across biological samples, a different approach
has to be taken.
\subsection{Differential expression}
One of the main uses of \rnaseq data is to compare and contrast gene expression
between different conditions, such as between tissues, between different
species, between healthy and tumour tissues, \etc. By doing so, we hope to
establish which genes characterise differences and may thus be \emph{causal} of
the phenotypic change.
To find statistically significant differences in gene expressions between two
samples, we want to test the null hypothesis that the gene count in both samples
comes from the same distribution with the same mean (\(H_0: \mu_{iA} =
\mu_{iB}\) for a gene \(i\) between two conditions \(A\) and \(B\)).
Reads are assumed to be sampled independently from a population, thus read
counts on a given feature can be approximated by a Poisson distribution
\citep{Mortazavi:2008, Marioni:2008}. However, actual expression data has been
shown to be over-dispersed compared to this model \citep{Robinson:2007}. In
order to accurately model this greater dispersion, a negative binomial
distribution can be used instead.
\begin{equation}
X_{ij} \sim \operatorname{NB}(\mu_{ij}, \sigma^2_{ij}) \text{\ ,}
\end{equation}
for each gene \(i\) in library \(j\), with mean \(\mu_{ij}\) and variance
\(\sigma^2_{ij}\).
To account for different sampling depth across libraries, it is furthermore
assumed that most genes do not drastically change expression between biological
samples. But the few that are highly expressed in some but not all samples have
a large influence on the total count. Thus, rather than normalising by the
ratios between total library sizes, each library \(j = 1 \dots m\) can be
normalised by a summary of the ratios between read counts for all \(n\) genes or
a subset of the genes across libraries \citep{Robinson:2010a,Anders:2010}.
Generalised to more than two libraries, gene expression ratios can be calculated
as the ratio of each gene’s read count to the geometric mean across samples:
\begin{equation}
s_j = \operatorname{median}_{i = 1}^n
\frac{x_{ij}}{\sqrt[\uproot{2}m]{\prod_{\nu = 1}^m x_{i\nu}}}
\text{\ .}
\end{equation}
The parameters of the negative binomial distribution can, in principle, be
estimated from the data. However, this is complicated by the typically very
small number of samples. Different solutions for this problem exist.
\name{edgeR} \citep{Robinson:2010} fixes one parameter per sample and only
estimates the other for each gene, while \name{DESeq} by \citet{Anders:2010}
pools data across genes of similar expression strength, and performs local
regression to find the dispersion.