forked from henrythemes/jekyll-bootstrap-theme
-
Notifications
You must be signed in to change notification settings - Fork 33
/
lab.html
1378 lines (1276 loc) · 93.8 KB
/
lab.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<title>NGS Intro | RNA-Seq Lab</title>
<script src="lab_files/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="lab_files/bootstrap-3.3.5/css/flatly.min.css" rel="stylesheet" />
<script src="lab_files/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="lab_files/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="lab_files/bootstrap-3.3.5/shim/respond.min.js"></script>
<script src="lab_files/jqueryui-1.11.4/jquery-ui.min.js"></script>
<link href="lab_files/tocify-1.9.1/jquery.tocify.css" rel="stylesheet" />
<script src="lab_files/tocify-1.9.1/jquery.tocify.js"></script>
<script src="lab_files/navigation-1.1/tabsets.js"></script>
<link href="lab_files/highlightjs-9.12.0/textmate.css" rel="stylesheet" />
<script src="lab_files/highlightjs-9.12.0/highlight.js"></script>
<link href="lab_files/pagedtable-1.1/css/pagedtable.css" rel="stylesheet" />
<script src="lab_files/pagedtable-1.1/js/pagedtable.js"></script>
<link href="lab_files/font-awesome-5.1.0/css/all.css" rel="stylesheet" />
<link href="lab_files/font-awesome-5.1.0/css/v4-shims.css" rel="stylesheet" />
<link id="font-awesome-1-attachment" rel="attachment" href="lab_files/font-awesome-5.1.0/fonts/fontawesome-webfont.ttf"/>
<script src="lab_files/kePrint-0.0.1/kePrint.js"></script>
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<script type="text/javascript">
if (window.hljs) {
hljs.configure({languages: []});
hljs.initHighlightingOnLoad();
if (document.readyState && document.readyState === "complete") {
window.setTimeout(function() { hljs.initHighlighting(); }, 0);
}
}
</script>
<style type="text/css">
h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}
</style>
<link rel="stylesheet" href="assets/lab.css" type="text/css" />
</head>
<body>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
height: auto;
}
.tabbed-pane {
padding-top: 12px;
}
.html-widget {
margin-bottom: 20px;
}
button.code-folding-btn:focus {
outline: none;
}
summary {
display: list-item;
}
</style>
<div class="container-fluid main-container">
<!-- tabsets -->
<style type="text/css">
.tabset-dropdown > .nav-tabs {
display: inline-table;
max-height: 500px;
min-height: 44px;
overflow-y: auto;
background: white;
border: 1px solid #ddd;
border-radius: 4px;
}
.tabset-dropdown > .nav-tabs > li.active:before {
content: "";
font-family: 'Glyphicons Halflings';
display: inline-block;
padding: 10px;
border-right: 1px solid #ddd;
}
.tabset-dropdown > .nav-tabs.nav-tabs-open > li.active:before {
content: "";
border: none;
}
.tabset-dropdown > .nav-tabs.nav-tabs-open:before {
content: "";
font-family: 'Glyphicons Halflings';
display: inline-block;
padding: 10px;
border-right: 1px solid #ddd;
}
.tabset-dropdown > .nav-tabs > li.active {
display: block;
}
.tabset-dropdown > .nav-tabs > li > a,
.tabset-dropdown > .nav-tabs > li > a:focus,
.tabset-dropdown > .nav-tabs > li > a:hover {
border: none;
display: inline-block;
border-radius: 4px;
}
.tabset-dropdown > .nav-tabs.nav-tabs-open > li {
display: block;
float: none;
}
.tabset-dropdown > .nav-tabs > li {
display: none;
}
</style>
<script>
$(document).ready(function () {
window.buildTabsets("TOC");
});
$(document).ready(function () {
$('.tabset-dropdown > .nav-tabs > li').click(function () {
$(this).parent().toggleClass('nav-tabs-open')
});
});
</script>
<!-- code folding -->
<script>
$(document).ready(function () {
// move toc-ignore selectors from section div to header
$('div.section.toc-ignore')
.removeClass('toc-ignore')
.children('h1,h2,h3,h4,h5').addClass('toc-ignore');
// establish options
var options = {
selectors: "h1,h2,h3,h4",
theme: "bootstrap3",
context: '.toc-content',
hashGenerator: function (text) {
return text.replace(/[.\\/?&!#<>]/g, '').replace(/\s/g, '_').toLowerCase();
},
ignoreSelector: ".toc-ignore",
scrollTo: 0
};
options.showAndHide = true;
options.smoothScroll = true;
// tocify
var toc = $("#TOC").tocify(options).data("toc-tocify");
});
</script>
<style type="text/css">
#TOC {
margin: 25px 0px 20px 0px;
}
@media (max-width: 768px) {
#TOC {
position: relative;
width: 100%;
}
}
.toc-content {
padding-left: 30px;
padding-right: 40px;
}
div.main-container {
max-width: 1200px;
}
div.tocify {
width: 20%;
max-width: 260px;
max-height: 85%;
}
@media (min-width: 768px) and (max-width: 991px) {
div.tocify {
width: 25%;
}
}
@media (max-width: 767px) {
div.tocify {
width: 100%;
max-width: none;
}
}
.tocify ul, .tocify li {
line-height: 20px;
}
.tocify-subheader .tocify-item {
font-size: 0.90em;
padding-left: 25px;
text-indent: 0;
}
.tocify .list-group-item {
border-radius: 0px;
}
</style>
<!-- setup 3col/9col grid for toc_float and main content -->
<div class="row-fluid">
<div class="col-xs-12 col-sm-4 col-md-3">
<div id="TOC" class="tocify">
</div>
</div>
<div class="toc-content col-xs-12 col-sm-8 col-md-9">
<div class="fluid-row" id="header">
<h1 class="title toc-ignore">NGS Intro | RNA-Seq Lab</h1>
</div>
<p><link href="https://fonts.googleapis.com/css?family=Lato:400,700|Roboto:400,700" rel="stylesheet"></p>
<p><img src="assets/logo.svg" alt="logo" style="height:50px; position:absolute; top:0; right:0; padding-right:40px; margin-top:22px"></p>
<h4 class="toc-ignore author">
<b>NBIS</b> | 05-Feb-2019
</h4>
<p><br></p>
<hr />
<p>RNA-seq has become a powerful approach to study the continually changing cellular transcriptome. Here, one of the most common questions is to identify genes that are differentially expressed between two conditions, e.g. controls and treatment. The <strong>main</strong> exercise in this tutorial will take you through a basic bioinformatic analysis pipeline to answer just that, it will show you how to find differentially expressed (DE) genes.</p>
<div class="abstract">
<p><strong>Main exercise</strong></p>
<ul>
<li>01 Check the quality of the raw reads with <strong>FastQC</strong></li>
<li>02 Map the reads to the reference genome using <strong>Star</strong></li>
<li>03 Assess the post-alignment quality using <strong>QualiMap</strong></li>
<li>04 Count the reads overlapping with genes using <strong>featureCounts</strong></li>
<li>05 Find DE genes using <strong>edgeR</strong> in R</li>
</ul>
</div>
<p>RNA-seq experiment does not necessarily end with a list of DE genes. If you have time after completing the <strong>main</strong> exercise, try one (or more) of the <strong>bonus</strong> exercises. The <strong>bonus</strong> exercises can be run independently of each other, so choose the one that matches your interest. Bonus sections are listed below.</p>
<div class="abstract">
<p><strong>Bonus exercises</strong></p>
<ul>
<li>01 Functional annotation of DE genes using <strong>GO/Reactome/Kegg</strong> databases</li>
<li>02 Visualisation of RNA-seq BAM files using <strong>IGV</strong> genome browser</li>
<li>03 RNA-Seq figures and plots using <strong>R</strong></li>
<li>04 De-novo transcriptome assembly using <strong>Trinity</strong></li>
</ul>
</div>
<div class="instruction">
<p>Expected run times (in minutes, when running all samples) for some of the steps as shown below when using 8 cores with 64 GB RAM.</p>
<table class="table table-striped table-hover table-responsive" style="width: auto !important; ">
<thead>
<tr>
<th style="text-align:right;">
Step
</th>
<th style="text-align:right;">
Time_Min
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:right;">
FastQC
</td>
<td style="text-align:right;">
11:00
</td>
</tr>
<tr>
<td style="text-align:right;">
STAR Mapping
</td>
<td style="text-align:right;">
32:00
</td>
</tr>
<tr>
<td style="text-align:right;">
QualiMap
</td>
<td style="text-align:right;">
42:00
</td>
</tr>
<tr>
<td style="text-align:right;">
MultiQC
</td>
<td style="text-align:right;">
08:00
</td>
</tr>
<tr>
<td style="text-align:right;">
FeatureCounts
</td>
<td style="text-align:right;">
03:00
</td>
</tr>
<tr>
<td style="text-align:right;">
Trinity
</td>
<td style="text-align:right;">
47:00
</td>
</tr>
</tbody>
</table>
<p>It is not recommended to run every step on all samples are it is not possible to complete in the available time. Pre-computed files for all steps are made available. Instructions to copy them are shown at the end of each section.</p>
<p>You are welcome to try your own solutions to the problems, before checking the solution. Click the <code style="background-color:#0093BD;color:white;">+</code> button to see the suggested solution. There is more than one way to complete a task. Discuss with person next to you and ask us when in doubt.</p>
<p><strong>Markers:</strong> <i class="fas fa-lightbulb"></i> Tip <i class="fas fa-comments"></i> Discuss <i class="fas fa-clipboard-list"></i> Task</p>
</div>
<p><br></p>
<div id="data-description" class="section level1">
<h1><span class="header-section-number">1</span> Data description</h1>
<p>The data used in this exercise is from the paper: <strong>Poitelon, Yannick, <em>et al</em>. “YAP and TAZ control peripheral myelination and the expression of laminin receptors in Schwann cells.” <a href="https://www.nature.com/articles/nn.4316">Nature neuroscience 19.7 (2016): 879</a></strong>. In this study, YAP and TAZ genes were knocked-down in Schwann cells to study myelination, using the sciatic nerve in mice as a model.</p>
<p>Myelination is essential for nervous system function. Schwann cells interact with neurons and the basal lamina to myelinate axons using receptors, signals and transcription factors. Hippo pathway is a conserved pathway involved in cell contact inhibition, and it acts to promote cell proliferation and inhibits apoptosis. The pathway integrates mechanical signals (cell polarity, mechanotransduction, membrane tension) and gene expression response. In addition to its role in organ size control, the Hippo pathway has been implicated in tumorigenesis, for example its deregulation occurs in a broad range of human carcinomas. Transcription co-activators YAP and TAZ are two major downstream effectors of the Hippo pathway, and have redundant roles in transcriptional activation.</p>
<p>The material for RNA-seq was collected from 2 conditions (<strong>Wt</strong> and <strong>KO</strong>), each with 3 biological replicates.</p>
<table class="table table-striped table-hover table-responsive" style="width: auto !important; ">
<thead>
<tr>
<th style="text-align:right;">
Accession
</th>
<th style="text-align:right;">
Condition
</th>
<th style="text-align:right;">
Replicate
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:right;">
SRR3222409
</td>
<td style="text-align:right;">
<span style=" font-weight: bold; color: white;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #fb8072;">KO</span>
</td>
<td style="text-align:right;">
<span style=" font-weight: bold; color: white;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #80b1d3;">1</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
SRR3222410
</td>
<td style="text-align:right;">
<span style=" font-weight: bold; color: white;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #fb8072;">KO</span>
</td>
<td style="text-align:right;">
<span style=" font-weight: bold; color: white;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #fdb462;">2</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
SRR3222411
</td>
<td style="text-align:right;">
<span style=" font-weight: bold; color: white;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #fb8072;">KO</span>
</td>
<td style="text-align:right;">
<span style=" font-weight: bold; color: white;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #bebada;">3</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
SRR3222412
</td>
<td style="text-align:right;">
<span style=" font-weight: bold; color: white;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #8dd3c7;">Wt</span>
</td>
<td style="text-align:right;">
<span style=" font-weight: bold; color: white;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #80b1d3;">1</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
SRR3222413
</td>
<td style="text-align:right;">
<span style=" font-weight: bold; color: white;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #8dd3c7;">Wt</span>
</td>
<td style="text-align:right;">
<span style=" font-weight: bold; color: white;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #fdb462;">2</span>
</td>
</tr>
<tr>
<td style="text-align:right;">
SRR3222414
</td>
<td style="text-align:right;">
<span style=" font-weight: bold; color: white;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #8dd3c7;">Wt</span>
</td>
<td style="text-align:right;">
<span style=" font-weight: bold; color: white;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: #bebada;">3</span>
</td>
</tr>
</tbody>
</table>
<div class="instruction">
<p>For the purpose of this tutorial, to shorten the time needed to run various bioinformatics steps, we have downsampled the original files. We randomly sampled, without replacement, 25% reads from each sample, using <code>fastq-sample</code> from the toolset <a href="https://homes.cs.washington.edu/~dcjones/fastq-tools/">fastq-tools</a>.</p>
</div>
<p><br></p>
</div>
<div id="main-exercise" class="section level1">
<h1><span class="header-section-number">2</span> Main exercise</h1>
<p>The main exercise covers Differential Gene Expression (DGE) workflow from raw reads to a list of differentially expressed genes.</p>
<div id="preparation" class="section level2">
<h2><span class="header-section-number">2.1</span> Preparation</h2>
<p><i class="fas fa-lightbulb"></i> For Linux and Mac users, Log in to Uppmax in a way so that the generated graphics are exported via the network to your screen. Login in to Uppmax with X-forwarding enabled. This will allow any graphical interface that you start on your compute node to be exported to your computer.</p>
<p>Linux users are recommended to use this:</p>
<pre><code>ssh -X [email protected]</code></pre>
<p>And Mac user are recommended to use this:</p>
<pre><code>ssh -Y [email protected]</code></pre>
<p>Windows users on MobaXterm do not need to worry about this option.</p>
<div id="book-a-node" class="section level3">
<h3><span class="header-section-number">2.1.1</span> Book a node</h3>
<p>For the RNA-Seq part of the course (Thu/Fri), we will work on the Snowy cluster. A standard compute node on cluster Snowy has 128 GB of RAM and 16 cores. Therefore, each core gives you 8 GB of RAM. We will use 8 cores per person for this session which gives you about 64 GB RAM. The code below is valid to run at the start of the day. If you are running it in the middle of a day, you need to decrease the time (<code>-t</code>). Do not run this twice and also make sure you are not running computations on a login node.</p>
<p>To run jobs on the snowy cluster therefore, we need to add <code>-M snowy</code>.</p>
<p>Book resources for RNA-Seq day 1.</p>
<pre><code>salloc -A snic2019-8-3 -t 08:00:00 -p core -n 8 --reservation=snic2019-8-3_7 -M snowy</code></pre>
<p>Book resources for RNA-Seq day 2.</p>
<pre><code>salloc -A snic2019-8-3 -t 08:00:00 -p core -n 8 --reservation=snic2019-8-3_8 -M snowy</code></pre>
</div>
<div id="set-up-directory" class="section level3">
<h3><span class="header-section-number">2.1.2</span> Set-up directory</h3>
<p>Setting up the directory structure is an important step as it helps to keep our raw data, intermediate data and results in an organised manner. All work must be carried out at this location <code>/proj/snic2019-8-3/nobackup/[user]/</code> where <code>[user]</code> is your user name. All RNA-Seq related activities must be carried out in a sub-directory named <code>rnaseq</code>.</p>
<p><i class="fas fa-clipboard-list"></i> Set up the below directory structure in your project directory.</p>
<pre><code>[user]/
rnaseq/
+-- 1_raw/
+-- 2_fastqc/
+-- 3_mapping/
+-- 4_qualimap/
+-- 5_dge/
+-- 6_multiqc/
+-- reference/
| +-- mouse/
| +-- mouse_chr11/
+-- scripts/</code></pre>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511030518" aria-expanded="false" aria-controls="2019020511030518"><b>+</b></a>
<div id="2019020511030518" class="collapse">
<pre class="r" style="margin-top:5px;">
mkdir rnaseq
cd rnaseq
mkdir 1_raw 2_fastqc 3_mapping 4_qualimap 5_dge 6_multiqc reference scripts
cd reference
mkdir mouse
mkdir mouse_chr11
cd ..</pre>
</div>
<p>The <code>1_raw</code> directory will hold the raw fastq files (soft-links). <code>2_fastqc</code> will hold FastQC outputs. <code>3_mapping</code> will hold the STAR mapping output files. <code>4_qualimap</code> will hold the QualiMap output files. <code>5_dge</code> will hold the counts from featureCounts and all differential gene expression related files. <code>6_multiqc</code> will hold MultiQC outputs. <code>reference</code> directory will hold the reference genome, annotations and STAR indices.</p>
<div class="instruction">
<p><i class="fas fa-lightbulb"></i> It might be a good idea to open an additional terminal window. One to navigate through directories and another for scripting in the <code>scripts</code> directory.</p>
</div>
</div>
<div id="create-symbolic-links" class="section level3">
<h3><span class="header-section-number">2.1.3</span> Create symbolic links</h3>
<p>We have the raw fastq files in this remote directory: <code>/sw/share/compstore/courses/ngsintro/rnaseq/main/1_raw/</code>. We are going to create symbolic links (soft-links) for these files from our <code>1_raw</code> directory to the remote directory. We do this because they are large files and simply copying them would use up a lot of storage space. Soft-linking files and folders allows us to work with those files as if they were actually there. Use <code>pwd</code> to check if you are standing in the correct directory. You should be standing here:</p>
<pre><code>/proj/snic2019-8-3/nobackup/[user]/rnaseq/1_raw</code></pre>
<p>Run below to create softlinks.</p>
<pre><code>ln -s /sw/share/compstore/courses/ngsintro/rnaseq/main/1_raw/*.fastq.gz .</code></pre>
<p>Check if your files have linked correctly. You should be able to see as below.</p>
<pre><code>[user@rackham2 1_raw]$ ls -l
SRR3222409_1.fastq.gz -> /sw/share/compstore/courses/ngsintro/rnaseq/main/1_raw/SRR3222409_1.fastq.gz
SRR3222409_2.fastq.gz -> /sw/share/compstore/courses/ngsintro/rnaseq/main/1_raw/SRR3222409_2.fastq.gz
SRR3222410_1.fastq.gz -> /sw/share/compstore/courses/ngsintro/rnaseq/main/1_raw/SRR3222410_1.fastq.gz
SRR3222410_2.fastq.gz -> /sw/share/compstore/courses/ngsintro/rnaseq/main/1_raw/SRR3222410_2.fastq.gz
SRR3222411_1.fastq.gz -> /sw/share/compstore/courses/ngsintro/rnaseq/main/1_raw/SRR3222411_1.fastq.gz
SRR3222411_2.fastq.gz -> /sw/share/compstore/courses/ngsintro/rnaseq/main/1_raw/SRR3222411_2.fastq.gz
SRR3222412_1.fastq.gz -> /sw/share/compstore/courses/ngsintro/rnaseq/main/1_raw/SRR3222412_1.fastq.gz
SRR3222412_2.fastq.gz -> /sw/share/compstore/courses/ngsintro/rnaseq/main/1_raw/SRR3222412_2.fastq.gz
SRR3222413_1.fastq.gz -> /sw/share/compstore/courses/ngsintro/rnaseq/main/1_raw/SRR3222413_1.fastq.gz
SRR3222413_2.fastq.gz -> /sw/share/compstore/courses/ngsintro/rnaseq/main/1_raw/SRR3222413_2.fastq.gz
SRR3222414_1.fastq.gz -> /sw/share/compstore/courses/ngsintro/rnaseq/main/1_raw/SRR3222414_1.fastq.gz
SRR3222414_2.fastq.gz -> /sw/share/compstore/courses/ngsintro/rnaseq/main/1_raw/SRR3222414_2.fastq.gz</code></pre>
<p><br></p>
</div>
</div>
<div id="fastqc-quality-check" class="section level2">
<h2><span class="header-section-number">2.2</span> FastQC: Quality check</h2>
<p>After receiving raw reads from a high throughput sequencing centre it is essential to check their quality. <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/">FastQC</a> provides a simple way to do some quality control check on raw sequence data. It provides a modular set of analyses which you can use to get a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.</p>
<p><i class="fas fa-clipboard-list"></i> Change into the <code>2_fastqc</code> directory. Use <code>pwd</code> to check if you are standing in the correct directory. You should be standing here:</p>
<pre><code>/proj/snic2019-8-3/nobackup/[user]/rnaseq/2_fastqc</code></pre>
<p>Load Uppmax modules <code>bioinfo-tools</code> and FastQC <code>FastQC/0.11.5</code>.</p>
<pre><code>module load bioinfo-tools
module load FastQC/0.11.5</code></pre>
<p>Once the module is loaded, FastQC program is available through the command <code>fastqc</code>. Use <code>fastqc --help</code> to see the various parameters available to the program. We will use <code>-t 8</code>, to specify number of threads, <code>-o</code> to specify the output directory path and finally, the name of the input fastq file to analyse. The syntax will look like below.</p>
<pre><code>fastqc -t 8 -o . ../1_raw/filename.fastq.gz</code></pre>
<p>Based on the above command, we will write a bash loop to process all fastq files in the directory. Writing multi-line commands through the terminal can be a pain. Therefore, we will run larger scripts from a bash script file. Move to your <code>scripts</code> directory and create a new file named <code>fastqc.sh</code>.</p>
<p>You should be standing here to run this:</p>
<pre><code>/proj/snic2019-8-3/nobackup/[user]/rnaseq/scripts</code></pre>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511036906" aria-expanded="false" aria-controls="2019020511036906"><b>+</b></a>
<div id="2019020511036906" class="collapse">
<pre class="r" style="margin-top:5px;">touch fastqc.sh</pre>
</div>
<p>Use <code>nano</code>,<code>vim</code> or <code>gedit</code> to edit <code>fastqc.sh</code>.</p>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511031313" aria-expanded="false" aria-controls="2019020511031313"><b>+</b></a>
<div id="2019020511031313" class="collapse">
<pre class="r" style="margin-top:5px;">#!/bin/bash
for i in ../1_raw/*.fastq.gz
do
echo "Running $i ..."
fastqc -t 8 -o . "$i"
done</pre>
</div>
<p>While standing in the <code>2_fastqc</code> directory, run the file <code>fastqc.sh</code>. Use <code>pwd</code> to check if you are standing in the correct directory.</p>
<p>You should be standing here to run this:</p>
<pre><code>/proj/snic2019-8-3/nobackup/[user]/rnaseq/2_fastqc</code></pre>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511035537" aria-expanded="false" aria-controls="2019020511035537"><b>+</b></a>
<div id="2019020511035537" class="collapse">
<pre class="r" style="margin-top:5px;">bash ../scripts/fastqc.sh</pre>
</div>
<p>After the fastqc run, there should be a <code>.zip</code> file and a <code>.html</code> file for every fastq file. The <code>.html</code> file is the report that you need. Open the <code>.html</code> in the browser and view it. You only need to do this for one file now. We will do a comparison with all samples when using the MultiQC tool.</p>
<pre><code>firefox file.html &</code></pre>
<p><i class="fas fa-lightbulb"></i> Adding <code>&</code> at the end sends that process to the background, so that the console is free to accept new commands.</p>
<div class="optional">
<p><strong>Optional</strong></p>
<p>Download the <code>.html</code> file to your computer and view it.</p>
<p><i class="fas fa-lightbulb"></i> All users can use an SFTP browser like <a href="https://filezilla-project.org/">Filezilla</a> or <a href="https://cyberduck.io/">Cyberduck</a> for a GUI interface. Windows users can also use the MobaXterm SFTP file browser to drag and drop. Linux and Mac users can use SFTP or SCP by running the below command in a <strong>LOCAL</strong> terminal and <strong>NOT</strong> on Uppmax.</p>
<pre><code>scp [email protected]:/proj/snic2019-8-3/nobackup/[user]/2_fastqc/SRR3222409_1_fastqc.html ./</code></pre>
</div>
<p><i class="fas fa-clipboard-list"></i> Go back to the FastQC website and compare your report with the sample report for <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html">Good Illumina data</a> and <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html">Bad Illumina data</a>.</p>
<p><i class="fas fa-comments"></i> Discuss based on your reports, whether your data is of good enough quality and/or what steps are needed to fix it.</p>
<p><br></p>
</div>
<div id="star-mapping" class="section level2">
<h2><span class="header-section-number">2.3</span> STAR: Mapping</h2>
<p>After verifying that the quality of the raw sequencing reads is acceptable, we will map the reads to the reference genome. There are many mappers/aligners available, so it may be good to choose one that is adequate for your type of data. Here, we will use a software called STAR (Spliced Transcripts Alignment to a Reference) as it is good for generic purposes, fast, easy to use and has been shown to outperform many of the other tools when aligning 2x76bp paired-end data. Before we begin mapping, we need to obtain genome reference sequence (<code>.fasta</code> file) and a corresponding annotation file (<code>.gtf</code>) and build a STAR index. Due to time constraints, we will practice index building only on chromosome 11. But, then we will use the pre-prepared full-genome index to run the actual mapping.</p>
<div id="get-reference" class="section level3">
<h3><span class="header-section-number">2.3.1</span> Get reference</h3>
<p>It is best if the reference genome (<code>.fasta</code>) and annotation (<code>.gtf</code>) files come from the same source to avoid potential naming conventions problems. It is also good to check in the manual of the aligner you use for hints on what type of files are needed to do the mapping.</p>
<p><i class="fas fa-comments"></i> What is the idea behind building STAR index? What files are needed to build one? Where do we take them from? Could one use a STAR index that was generated before? Browse through <a href="https://www.ensembl.org/index.html">Ensembl</a> and try to find the files needed. Note that we are working with Mouse (<em>Mus musculus</em>).</p>
<p><i class="fas fa-clipboard-list"></i> Move into the <code>reference</code> directory and download the Chr 11 genome (<code>.fasta</code>) file and the genome-wide annotation file (<code>.gtf</code>) from Ensembl.</p>
<p>You should be standing here to run this:</p>
<pre><code>/proj/snic2019-8-3/nobackup/[user]/rnaseq/reference</code></pre>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511032921" aria-expanded="false" aria-controls="2019020511032921"><b>+</b></a>
<div id="2019020511032921" class="collapse">
<pre class="r" style="margin-top:5px;">wget ftp://ftp.ensembl.org/pub/release-93/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.11.fa.gz
wget ftp://ftp.ensembl.org/pub/release-93/gtf/mus_musculus/Mus_musculus.GRCm38.93.gtf.gz</pre>
</div>
<p>Decompress the files for use.</p>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511030947" aria-expanded="false" aria-controls="2019020511030947"><b>+</b></a>
<div id="2019020511030947" class="collapse">
<pre class="r" style="margin-top:5px;">gunzip Mus_musculus.GRCm38.dna.chromosome.11.fa.gz
gunzip Mus_musculus.GRCm38.93.gtf.gz</pre>
</div>
<p>You should now have the files as below.</p>
<pre><code>[user@rackham2 reference]$ ll
drwxrwsr-x 2 user g201XXXX 4.0K Sep 4 19:33 mouse
drwxrwsr-x 2 user g201XXXX 4.0K Sep 4 19:32 mouse_chr11
-rw-rw-r-- 1 user g201XXXX 742M Sep 4 19:31 Mus_musculus.GRCm38.93.gtf
-rw-rw-r-- 1 user g201XXXX 119M Sep 4 19:31 Mus_musculus.GRCm38.dna.chromosome.11.fa</code></pre>
</div>
<div id="build-index" class="section level3">
<h3><span class="header-section-number">2.3.2</span> Build index</h3>
<p>Move into the <code>reference</code> directory if not already there. Load module STAR version 2.5.2b. Remember to load <code>bioinfo-tools</code> if you haven’t done so already.</p>
<pre><code>module load bioinfo-tools
module load star/2.5.2b</code></pre>
<p><i class="fas fa-lightbulb"></i> To search for other available versions of STAR, use <code>module spider star</code>.</p>
<p>Create a new bash script in your <code>scripts</code> directory named <code>star_index.sh</code> and add the following lines:</p>
<pre><code>#!/bin/bash
# load module
module load bioinfo-tools
module load star/2.5.2b
star \
--runMode genomeGenerate \
--runThreadN 8 \
--genomeDir ./mouse_chr11 \
--genomeFastaFiles ./Mus_musculus.GRCm38.dna.chromosome.11.fa \
--sjdbGTFfile ./Mus_musculus.GRCm38.93.gtf</code></pre>
<p>The above script means that STAR should run in <code>genomeGenerate</code> mode to build an index. It should use 8 threads for computation. The output files must be directed to the indicated directory. The paths to the <code>.fasta</code> file and the annotation file (<code>.gtf</code>) is also shown.</p>
<p>Run the script from the <code>reference</code> directory. Use <code>pwd</code> to check if you are standing in the correct directory.</p>
<pre><code>bash ../scripts/star_index.sh</code></pre>
<p>Once the indexing is complete, move into the <code>mouse_chr11</code> directory and make sure you have all the files.</p>
<pre><code>[user@rackham2 mouse_chr11]$ ll
-rw-rw-r-- 1 user g201XXXX 10 Sep 4 19:31 chrLength.txt
-rw-rw-r-- 1 user g201XXXX 13 Sep 4 19:31 chrNameLength.txt
-rw-rw-r-- 1 user g201XXXX 3 Sep 4 19:31 chrName.txt
-rw-rw-r-- 1 user g201XXXX 12 Sep 4 19:31 chrStart.txt
-rw-rw-r-- 1 user g201XXXX 1.7M Sep 4 19:33 exonGeTrInfo.tab
-rw-rw-r-- 1 user g201XXXX 805K Sep 4 19:33 exonInfo.tab
-rw-rw-r-- 1 user g201XXXX 56K Sep 4 19:33 geneInfo.tab
-rw-rw-r-- 1 user g201XXXX 121M Sep 4 19:33 Genome
-rw-rw-r-- 1 user g201XXXX 553 Sep 4 19:31 genomeParameters.txt
-rw-rw-r-- 1 user g201XXXX 967M Sep 4 19:33 SA
-rw-rw-r-- 1 user g201XXXX 1.5G Sep 4 19:33 SAindex
-rw-rw-r-- 1 user g201XXXX 522K Sep 4 19:33 sjdbInfo.txt
-rw-rw-r-- 1 user g201XXXX 463K Sep 4 19:33 sjdbList.fromGTF.out.tab
-rw-rw-r-- 1 user g201XXXX 463K Sep 4 19:33 sjdbList.out.tab
-rw-rw-r-- 1 user g201XXXX 480K Sep 4 19:33 transcriptInfo.tab</code></pre>
<p>This index for chr11 was created just to familiarise with the steps. We will use the index built on the whole genome for downstream exercises. The index for the whole genome was prepared for us before class in the very same way as for the chromosome 11 in steps above. It just requires more time (ca. 4h) to run. The index is found here: <code>/sw/share/compstore/courses/ngsintro/rnaseq/reference/mouse/</code>.</p>
<p>Soft-link all the files inside <code>/sw/share/compstore/courses/ngsintro/rnaseq/reference/mouse/</code> to the directory named <code>mouse</code> which is inside your <code>rnaseq/reference/</code>.</p>
<p>You should be standing here to run this:</p>
<pre><code>/proj/snic2019-8-3/nobackup/[user]/rnaseq/reference</code></pre>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511039517" aria-expanded="false" aria-controls="2019020511039517"><b>+</b></a>
<div id="2019020511039517" class="collapse">
<pre class="r" style="margin-top:5px;">cd mouse
ln -s /sw/share/compstore/courses/ngsintro/rnaseq/reference/mouse/* .</pre>
</div>
</div>
<div id="map-reads" class="section level3">
<h3><span class="header-section-number">2.3.3</span> Map reads</h3>
<p>Now that we have the index ready, we are ready to map reads. Move to the directory <code>3_mapping</code>. Use <code>pwd</code> to check if you are standing in the correct directory.</p>
<p>You should be standing here to run this:</p>
<pre><code>/proj/snic2019-8-3/nobackup/[user]/rnaseq/3_mapping</code></pre>
<p>We will create softlinks to the fastq files from here to make things easier.</p>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511034450" aria-expanded="false" aria-controls="2019020511034450"><b>+</b></a>
<div id="2019020511034450" class="collapse">
<pre class="r" style="margin-top:5px;">cd 3_mapping
ln -s ../1_raw/* .</pre>
</div>
<p>These are the parameters that we want to specify for the STAR mapping run:</p>
<ul>
<li>Run mode is now <code>alignReads</code></li>
<li>Specify the full genome index path</li>
<li>Specify the number of threads</li>
<li>We must indicate the input is gzipped and must be uncompressed</li>
<li>Indicate read1 and read2 since we have paired-end reads</li>
<li>Specify the annotation (.gtf) file</li>
<li>Specify an output file name</li>
<li>Specify that the output must be BAM and the reads must be sorted by coordinate</li>
</ul>
<p>Our mapping script will look like this:</p>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511030406" aria-expanded="false" aria-controls="2019020511030406"><b>+</b></a>
<div id="2019020511030406" class="collapse">
<pre class="r" style="margin-top:5px;">star \
--runMode alignReads \
--genomeDir "../reference/mouse" \
--runThreadN 8 \
--readFilesCommand zcat \
--readFilesIn sample_1.fastq.gz sample_2.fastq.gz \
--sjdbGTFfile "../reference/Mus_musculus.GRCm38.93.gtf" \
--outFileNamePrefix "sample1" \
--outSAMtype BAM SortedByCoordinate</pre>
</div>
<p>But, we will generalise the above script to be used as a bash script to read any two input files and to automatically create the output filename.</p>
<p><i class="fas fa-clipboard-list"></i> Now create a new bash script file named <code>star_align.sh</code> in your <code>scripts</code> directory and add the script below to it.</p>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511037008" aria-expanded="false" aria-controls="2019020511037008"><b>+</b></a>
<div id="2019020511037008" class="collapse">
<pre class="r" style="margin-top:5px;">#!/bin/bash
module load bioinfo-tools
module load star/2.5.2b
# create output file name
prefix="${1/_*/}"
star \
--runMode alignReads \
--genomeDir "../reference/mouse" \
--runThreadN 8 \
--readFilesCommand zcat \
--readFilesIn $1 $2 \
--sjdbGTFfile "../reference/Mus_musculus.GRCm38.93.gtf" \
--outFileNamePrefix "$prefix" \
--outSAMtype BAM SortedByCoordinate</pre>
</div>
<p>In the above script, the two input fastq files as passed in as parameters <code>$1</code> and <code>$2</code>. The output filename is automatically created using this line <code>prefix='${1/_*/}'</code> from input filename of <code>$1</code>. For example, a file named <code>sample_1.fastq.gz</code> will have the <code>_1.fastq.gz</code> removed and the prefix will be just <code>sample</code>. This approach will work only if your filenames are labelled suitably.</p>
<p>Now we can run the bash script like below while standing in the <code>3_mapping</code> directory.</p>
<pre><code>bash ../scripts/star_align.sh sample_1.fastq.gz sample_2.fastq.gz</code></pre>
<p>Now, do the same for the other samples as well if you have time. Otherwise just run for one sample and results for the other samples can be copied (See end of this section).</p>
<div class="optional">
<p><strong>Optional</strong></p>
<p>Try to create a new bash loop script (<code>star_align_batch.sh</code>) to iterate over all fastq files in the directory and run the mapping using the <code>star_align.sh</code> script. Note that there is a bit of a tricky issue here. You need to use two fastq files (<code>_1</code> and <code>_2</code>) per run rather than one file.</p>
<a class="btn btn-sm btn-collapse btn-collapse-optional" role="button" data-toggle="collapse" href="#2019020511039921" aria-expanded="false" aria-controls="2019020511039921"><b>+</b></a>
<div id="2019020511039921" class="collapse">
<pre class="r" style="margin-top:5px;">## find only files for read 1 and extract the sample name
lines=$(find *_1.fastq.gz | sed "s/_1.fastq.gz//")
for i in ${lines}
do
## use the sample name and add suffix (_1.fastq.gz or _2.fastq.gz)
echo "Mapping ${i}_1.fastq.gz and ${i}_2.fastq.gz ..."
bash ../scripts/star_align.sh "${i}_1.fastq.gz ${i}_2.fastq.gz"
done</pre>
</div>
<p>Run the <code>star_align_batch.sh</code> script in the <code>3_mapping</code> directory.</p>
<p><code>bash ../scripts/star_align_batch.sh</code></p>
</div>
<p>At the end of the mapping jobs, you should have the following list of output files for every sample:</p>
<pre><code>[user@rackham2 3_mapping]$ ls -l
-rw-rw-r-- 1 user g201XXXX 628M Sep 6 00:54 SRR3222409Aligned.sortedByCoord.out.bam
-rw-rw-r-- 1 user g201XXXX 1.9K Sep 6 00:54 SRR3222409Log.final.out
-rw-rw-r-- 1 user g201XXXX 21K Sep 6 00:54 SRR3222409Log.out
-rw-rw-r-- 1 user g201XXXX 482 Sep 6 00:54 SRR3222409Log.progress.out
-rw-rw-r-- 1 user g201XXXX 3.6M Sep 6 00:54 SRR3222409SJ.out.tab
drwx--S--- 2 user g201XXXX 4.0K Sep 6 00:50 SRR3222409_STARgenome</code></pre>
<p>The <code>.bam</code> file contains the alignment of all reads to the reference genome in binary format. BAM files are not human readable directly. To view a BAM file in text format, you can use <code>samtools view</code> functionality.</p>
<pre><code>module load samtools/1.6
samtools view SRR3222409Aligned.sortedByCoord.out.bam | head
SRR3222409.8816556 163 1 3199842 255 101M = 3199859 116 TTTTAAAGTTTTACAAGAAAAAAAATCAGATAACCGAGGAAAATTATTCATTATGAAGTACTACTTTCCACTTCATTTCATCACAAATTGTAACTTACTTA DDBDDIIIHIIHHHIHIHHIIIIIDHHIIIIIIIIIIIIIIHIIIIHIIIEHHIIIHIIIIGIIIIIIIIIIIIIIHIIHEHIIIIIIHIIIIIHIIIIII NH:i:1 HI:i:1 AS:i:198 nM:i:0
SRR3222409.8816556 83 1 3199859 255 99M = 3199842 -116AAAAAAAATCAGATAACCGAGGAAAATTATTCATTATGAAGTACTACTTTCCACTTCATTTCATCACAAATTGTAACTTACTTAACTGACCAAAAAAAC IIIIIHHIHHIIIIHHEEHIIIHIIHHHIHIIIIIIIHIHHIIIIIIHIIIIIIIIHHHHHIIIIIHIHHIIIHIHHFHHIIHIIIIHCIIIIHDDD@D NH:i:1 HI:i:1 AS:i:198 nM:i:0
SRR3222409.2149741 163 1 3199933 255 101M = 3200069 237 AACTTACTTAACTGACCAAAAAAACTATGGTACTGCAGTATAGCAAATACTCCACACACTGTGCTTTGAGCTAGAGCACTTGGAGTCACTGCCCAGGGCAG ABDDDHHIIIIIIIIIIIIIIIHHIIIIIIIIIIIIIIIIIIIIIIII<<FHIHGHIIIIGIHEHIIIIIGIIIIIIIIIIIIIIHIIIIIHIIIIHIIIH NH:i:1 HI:i:1 AS:i:200 nM:i:0</code></pre>
<p><i class="fas fa-comments"></i> Can you identify what some of these columns are?</p>
<p>The <code>Log.final.out</code> file gives a summary of the mapping run. This file is used by MultiQC later to collect mapping statistics.</p>
<p><i class="fas fa-clipboard-list"></i> Inspect one of the mapping log files to identify the number of uniquely mapped reads and multi-mapped reads.</p>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511038454" aria-expanded="false" aria-controls="2019020511038454"><b>+</b></a>
<div id="2019020511038454" class="collapse">
<pre class="r" style="margin-top:5px;">cat SRR3222409Log.final.out
Started job on | Sep 08 14:03:46
Started mapping on | Sep 08 14:07:01
Finished on | Sep 08 14:09:05
Mapping speed, Million of reads per hour | 154.78
Number of input reads | 5331353
Average input read length | 201
UNIQUE READS:
Uniquely mapped reads number | 4532497
Uniquely mapped reads % | 85.02%
Average mapped length | 199.72
Number of splices: Total | 2628072
Number of splices: Annotated (sjdb) | 2608823
Number of splices: GT/AG | 2604679
Number of splices: GC/AG | 15762
Number of splices: AT/AC | 2422
Number of splices: Non-canonical | 5209
Mismatch rate per base, % | 0.18%
Deletion rate per base | 0.02%
Deletion average length | 1.49
Insertion rate per base | 0.01%
Insertion average length | 1.37
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 493795
% of reads mapped to multiple loci | 9.26%
Number of reads mapped to too many loci | 8241
% of reads mapped to too many loci | 0.15%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 5.51%
% of reads unmapped: other | 0.06%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
</pre>
</div>
<p>The BAM file names can be simplified by renaming them. This command renames all BAM files.</p>
<pre><code>rename "Aligned.sortedByCoord.out" "" *.bam</code></pre>
<p>Next, we need to index these BAM files. Indexing creates <code>.bam.bai</code> files which are required by many downstream programs to quickly and efficiently locate reads anywhere in the BAM file.</p>
<p><i class="fas fa-clipboard-list"></i> Index all BAM files.</p>
<pre><code>module load samtools/1.8
for i in *.bam
do
echo "Indexing $i ..."
samtools index $i
done</code></pre>
<p>Finally, we should have <code>.bai</code> index files for all BAM files.</p>
<pre><code>[user@rackham2 3_mapping]$ ls -l
-rw-rw-r-- 1 user g201XXXX 628M Sep 6 00:54 SRR3222409.bam
-rw-rw-r-- 1 user g201XXXX 1.8M Sep 6 01:22 SRR3222409.bam.bai</code></pre>
<p><i class="fas fa-lightbulb"></i> If you are running short of time or unable to run the mapping, you can copy over results for all samples that have been prepared for you before class. They are available at this location: <code>/sw/share/compstore/courses/ngsintro/rnaseq/main/3_mapping/</code>.</p>
<pre><code>cp -r /sw/share/compstore/courses/ngsintro/rnaseq/main/3_mapping/* /proj/snic2019-8-3/nobackup/[user]/rnaseq/3_mapping/</code></pre>
<p><br></p>
</div>
</div>
<div id="qualimap-post-alignment-qc" class="section level2">
<h2><span class="header-section-number">2.4</span> QualiMap: Post-alignment QC</h2>
<p>Some important quality aspects, such as saturation of sequencing depth, read distribution between different genomic features or coverage uniformity along transcripts, can be measured only after mapping reads to the reference genome. One of the tools to perform this post-alignment quality control is QualiMap. QualiMap examines sequencing alignment data in SAM/BAM files according to the features of the mapped reads and provides an overall view of the data that helps to the detect biases in the sequencing and/or mapping of the data and eases decision-making for further analysis.</p>
<p><i class="fas fa-clipboard-list"></i> Read through <a href="http://qualimap.bioinfo.cipf.es/doc_html/intro.html">QualiMap</a> documentation and see if you can figure it out how to run it to assess post-alignment quality on the RNA-seq mapped samples. Here is the RNA-Seq specific tool <a href="http://qualimap.bioinfo.cipf.es/doc_html/analysis.html#rnaseqqc">explanation</a>. The tool is already installed on Uppmax as a module.</p>
<p><i class="fas fa-clipboard-list"></i> Load the QualiMap module version 2.2.1 and create a bash script named <code>qualimap.sh</code> in your <code>scripts</code> directory.</p>
<p>Add the following script to it.</p>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511035855" aria-expanded="false" aria-controls="2019020511035855"><b>+</b></a>
<div id="2019020511035855" class="collapse">
<pre class="r" style="margin-top:5px;">#!/bin/bash
# load modules
module load bioinfo-tools
module load QualiMap/2.2.1
prefix="${1/.bam/}"
export DISPLAY=""
qualimap rnaseq -pe \
-bam $1 \
-gtf "../reference/Mus_musculus.GRCm38.93.gtf" \
-outdir "$prefix" \
-outfile "$prefix" \
-outformat "HTML" \
--java-mem-size=50G >& "${prefix}-qualimap.log"</pre>
</div>
<p>The line <code>prefix="${1/.bam/}"</code> is used to remove <code>.bam</code> from the input filename and create a prefix which will be used to label output. The <code>export DISPLAY=""</code> is used to run JAVA application in headless mode or else throws an error about X11 display. The last part <code>>& "${prefix}-qualimap.log"</code> saves the <strong>standard-out</strong> as a log file.</p>
<p><i class="fas fa-clipboard-list"></i> create a new bash loop script named <code>qualimap_batch.sh</code> with a bash loop to run the qualimap script over all BAM files. The loop should look like below. Alternatively, you can also simply run the script below directly on the command line.</p>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511031764" aria-expanded="false" aria-controls="2019020511031764"><b>+</b></a>
<div id="2019020511031764" class="collapse">
<pre class="r" style="margin-top:5px;">for i in ../3_mapping/*.bam
do
echo "Running QualiMap on $i ..."
bash ../scripts/qualimap.sh $i
done</pre>
</div>
<p>Run the loop script <code>qualimap_batch.sh</code> in the directory <code>4_qualimap</code>.</p>
<p><code>bash ../scripts/qualimap_batch.sh</code></p>
<p>Qualimap should have created a directory for every BAM file. Inside every directory, you should see:</p>
<pre><code>[user@rackham2 4_qualimap]$ ls -l
drwxrwxr-x 2 user g201XXXX 4.0K Sep 14 17:24 css
drwxrwxr-x 2 user g201XXXX 4.0K Sep 14 17:24 images_qualimapReport
-rw-rw-r-- 1 user g201XXXX 11K Sep 14 17:24 qualimapReport.html
drwxrwxr-x 2 user g201XXXX 4.0K Sep 14 17:24 raw_data_qualimapReport
-rw-rw-r-- 1 user g201XXXX 1.2K Sep 14 17:24 rnaseq_qc_results.txt</code></pre>
<p><i class="fas fa-clipboard-list"></i> Inspect the HTML output file and try to make sense of it.</p>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511039673" aria-expanded="false" aria-controls="2019020511039673"><b>+</b></a>
<div id="2019020511039673" class="collapse">
<pre class="r" style="margin-top:5px;">firefox qualimapReport.html &</pre>
</div>
<p><i class="fas fa-lightbulb"></i> If you are running out of time or were unable to run QualiMap, you can also copy pre-run QualiMap output from this location: <code>/sw/share/compstore/courses/ngsintro/rnaseq/main/4_qualimap/</code>.</p>
<pre><code>cp -r /sw/share/compstore/courses/ngsintro/rnaseq/main/4_qualimap/* /proj/snic2019-8-3/nobackup/[user]/rnaseq/4_qualimap/</code></pre>
<p><i class="fas fa-comments"></i> Check the QualiMap report for one sample and discuss if the sample is of good quality. You only need to do this for one file now. We will do a comparison with all samples when using the MultiQC tool.</p>
<p><br></p>
</div>
<div id="featurecounts-counting-reads" class="section level2">
<h2><span class="header-section-number">2.5</span> featureCounts: Counting reads</h2>
<p>After ensuring mapping quality, we can move on to enumerating reads mapping to genomic features of interest. Here we will use <strong>featureCounts</strong>, an ultrafast and accurate read summarization program, that can count mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations.</p>
<p><i class="fas fa-clipboard-list"></i> Read featureCounts <a href="http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf">documentation</a> and see if you can figure it out how to use paired-end reads using an unstranded library to count fragments overlapping with exonic regions and summarise over genes.</p>
<p><i class="fas fa-clipboard-list"></i> Load the subread module version 1.5.2 on Uppmax. Create a bash script named <code>featurecounts.sh</code> in the directory <code>scripts</code>.</p>
<p>We could run featureCounts on each BAM file, produce a text output for each sample and combine the output. But the easier way is to provide a list of all BAM files and featureCounts will combine counts for all samples into one text file.</p>
<p>Below is the script that we will use:</p>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511038474" aria-expanded="false" aria-controls="2019020511038474"><b>+</b></a>
<div id="2019020511038474" class="collapse">
<pre class="r" style="margin-top:5px;">#!/bin/bash
# load modules
module load bioinfo-tools
module load subread/1.5.2
featureCounts \
-a "../reference/Mus_musculus.GRCm38.93.gtf" \
-o "counts.txt" \
-F "GTF" \
-t "exon" \
-g "gene_id" \
-p \
-s 0 \
-T 8 \
../3_mapping/*.bam</pre>
</div>
<p>In the above script, we indicate the path of the annotation file (<code>-a "../reference/Mus_musculus.GRCm38.93.gtf"</code>), specify the output file name (<code>-o "counts.txt"</code>), specify that that annotation file is in GTF format (<code>-F "GTF"</code>), specify that reads are to be counted over exonic features (<code>-t "exon"</code>) and summarised to the gene level (<code>-g "gene_id"</code>). We also specify that the reads are paired-end (<code>-p</code>), the library is unstranded (<code>-s 0</code>) and the number of threads to use (<code>-T 8</code>).</p>
<p>Run the featurecounts bash script in the directory <code>5_dge</code>. Use <code>pwd</code> to check if you are standing in the correct directory.</p>
<p>You should be standing here to run this:</p>
<pre><code>/proj/snic2019-8-3/nobackup/[user]/rnaseq/5_dge</code></pre>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511032638" aria-expanded="false" aria-controls="2019020511032638"><b>+</b></a>
<div id="2019020511032638" class="collapse">
<pre class="r" style="margin-top:5px;">bash ../scripts/featurecounts.sh</pre>
</div>
<p>You should have two files:</p>
<pre><code>[user@rackham2 5_dge]$ ls -l
-rw-rw-r-- 1 user g201XXXX 2.8M Sep 15 11:05 counts.txt
-rw-rw-r-- 1 user g201XXXX 658 Sep 15 11:05 counts.txt.summary</code></pre>
<p><i class="fas fa-comments"></i> Inspect the files and try to make sense of them.</p>
<p><br></p>
</div>
<div id="multiqc-combined-qc-report" class="section level2">
<h2><span class="header-section-number">2.6</span> MultiQC: Combined QC report</h2>
<p>We will use the tool <strong>MultiQC</strong> to crawl through the output, log files etc from FastQC, STAR, QualiMap and featureCounts to create a combined QC report.</p>
<p>Run MultiQC as shown below in the <code>6_multiqc</code> directory. You should be standing here to run this:</p>
<pre><code>/proj/snic2019-8-3/nobackup/[user]/rnaseq/6_multiqc</code></pre>
<pre><code>module load bioinfo-tools
module load MultiQC/1.6
multiqc --interactive ../</code></pre>
<p>You should have two files:</p>
<pre><code>[user@rackham2 6_multiqc]$ ls -l
drwxrwsr-x 2 user g201XXXX 4.0K Sep 6 22:33 multiqc_data
-rw-rw-r-- 1 user g201XXXX 1.3M Sep 6 22:33 multiqc_report.html</code></pre>
<p><i class="fas fa-comments"></i> Open the MultiQC HTML report using <code>firefox</code> and/or transfer to your computer and inspect the report.</p>
<a class="btn btn-sm btn-collapse btn-collapse-normal" role="button" data-toggle="collapse" href="#2019020511033042" aria-expanded="false" aria-controls="2019020511033042"><b>+</b></a>
<div id="2019020511033042" class="collapse">
<pre class="r" style="margin-top:5px;">firefox multiqc_report.html &</pre>
</div>
<p><br></p>
</div>
<div id="differential-gene-expression" class="section level2">
<h2><span class="header-section-number">2.7</span> Differential gene expression</h2>
<p>The easiest way to perform differential expression is to use one of the statistical packages, within R environment, that were specifically designed for analyses of read counts arising from RNA-seq, SAGE and similar technologies. Here, we will one of such packages called <strong>edgeR</strong>. Learning R is beyond the scope of this course so we prepared basic ready to run R scripts to find DE genes between conditions <strong>KO</strong> and <strong>Wt</strong>.</p>
<p>Move to the <code>5_dge</code> directory and load R modules for use.</p>
<pre><code>module load R/3.4.3
module load R_packages/3.4.3</code></pre>
<p>Use <code>pwd</code> to check if you are standing in the correct directory. Copy the following files to the <code>5_dge</code> directory.</p>
<ul>
<li><code>/sw/share/compstore/courses/ngsintro/rnaseq/main/5_dge/annotations.txt</code></li>
<li><code>/sw/share/compstore/courses/ngsintro/rnaseq/main/5_dge/dge.R</code></li>
</ul>
<p>Make sure you have the <code>counts.txt</code> file from featureCounts. If not, you can copy this file too.</p>
<ul>
<li><code>/sw/share/compstore/courses/ngsintro/rnaseq/main/5_dge/counts.txt</code></li>
</ul>
<pre><code>cp /sw/share/compstore/courses/ngsintro/rnaseq/main/5_dge/annotations.txt .
cp /sw/share/compstore/courses/ngsintro/rnaseq/main/5_dge/dge.R .
cp /sw/share/compstore/courses/ngsintro/rnaseq/main/5_dge/counts.txt .</code></pre>
<p>Now, run the R script in <code>5_dge</code> directory.</p>
<pre><code>Rscript dge.R</code></pre>
<p>This should have produced the following output files:</p>
<pre><code>[user@rackham2 5_dge]$ ls -l
-rw-rw-r-- 1 user g201XXXX 8.9M Nov 29 16:31 dge_data.RData
-rw-rw-r-- 1 user g201XXXX 2.6M Nov 29 16:31 dge_results.txt</code></pre>
<p><i class="fas fa-clipboard-list"></i> Copy the results text file (<code>dge_results.txt</code>) to your computer and inspect the results. What are the columns? How many differentially expressed genes are present at an FDR cutoff of 0.05? How many genes are upregulated and how many are down-regulated? How does this change if we set a fold-change cut-off of 1?</p>
<p><i class="fas fa-lightbulb"></i> Open in a spreadsheet editor like Microsoft Excel or LibreOffice Calc.</p>
<p>If you do not have the results or were unable to run the DGE step, you can copy these two here which will be required for functional annotation (optional).</p>
<pre><code>cp /sw/share/compstore/courses/ngsintro/rnaseq/main/5_dge/dge_results.txt .
cp /sw/share/compstore/courses/ngsintro/rnaseq/main/5_dge/dge_data.Rdata .</code></pre>
<p><br></p>
</div>
</div>
<div id="bonus-exercises" class="section level1">
<h1><span class="header-section-number">3</span> Bonus exercises</h1>
<div class="instruction">
<p>These exercises are completely optional and to be run only if you have time and if it interests you.</p>
<p><strong>Markers:</strong> <i class="fas fa-desktop"></i> Run locally <i class="fas fa-cloud"></i> Run on Uppmax</p>
</div>
<div id="functional-annotation" class="section level2">
<h2><span class="header-section-number">3.1</span> Functional annotation</h2>