-
Notifications
You must be signed in to change notification settings - Fork 0
/
amplicon_workflow.html
3397 lines (3300 loc) · 302 KB
/
amplicon_workflow.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta http-equiv="X-UA-Compatible" content="IE=EDGE" />
<meta name="author" content="Author: Linton Freund ([email protected])" />
<title>Amplicon Analysis Workflow</title>
<script src="site_libs/header-attrs-2.20/header-attrs.js"></script>
<script src="site_libs/jquery-3.6.0/jquery-3.6.0.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="site_libs/bootstrap-3.3.5/css/cosmo.min.css" rel="stylesheet" />
<script src="site_libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="site_libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="site_libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<style>h1 {font-size: 34px;}
h1.title {font-size: 38px;}
h2 {font-size: 30px;}
h3 {font-size: 24px;}
h4 {font-size: 18px;}
h5 {font-size: 16px;}
h6 {font-size: 12px;}
code {color: inherit; background-color: rgba(0, 0, 0, 0.04);}
pre:not([class]) { background-color: white }</style>
<script src="site_libs/jqueryui-1.11.4/jquery-ui.min.js"></script>
<link href="site_libs/tocify-1.9.1/jquery.tocify.css" rel="stylesheet" />
<script src="site_libs/tocify-1.9.1/jquery.tocify.js"></script>
<script src="site_libs/navigation-1.1/tabsets.js"></script>
<script src="site_libs/navigation-1.1/codefolding.js"></script>
<link href="site_libs/font-awesome-5.1.0/css/all.css" rel="stylesheet" />
<link href="site_libs/font-awesome-5.1.0/css/v4-shims.css" rel="stylesheet" />
<style type="text/css">
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
span.underline{text-decoration: underline;}
div.column{display: inline-block; vertical-align: top; width: 50%;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
ul.task-list{list-style: none;}
</style>
<style type="text/css">
code {
white-space: pre;
}
.sourceCode {
overflow: visible;
}
</style>
<style type="text/css" data-origin="pandoc">
pre > code.sourceCode { white-space: pre; position: relative; }
pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
pre > code.sourceCode > span:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
pre > code.sourceCode { white-space: pre-wrap; }
pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
}
pre.numberSource code
{ counter-reset: source-line 0; }
pre.numberSource code > span
{ position: relative; left: -4em; counter-increment: source-line; }
pre.numberSource code > span > a:first-child::before
{ content: counter(source-line);
position: relative; left: -1em; text-align: right; vertical-align: baseline;
border: none; display: inline-block;
-webkit-touch-callout: none; -webkit-user-select: none;
-khtml-user-select: none; -moz-user-select: none;
-ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
color: #aaaaaa;
}
pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; }
div.sourceCode
{ }
@media screen {
pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
}
code span.al { color: #ff0000; font-weight: bold; } /* Alert */
code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code span.at { color: #7d9029; } /* Attribute */
code span.bn { color: #40a070; } /* BaseN */
code span.bu { color: #008000; } /* BuiltIn */
code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code span.ch { color: #4070a0; } /* Char */
code span.cn { color: #880000; } /* Constant */
code span.co { color: #60a0b0; font-style: italic; } /* Comment */
code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code span.do { color: #ba2121; font-style: italic; } /* Documentation */
code span.dt { color: #902000; } /* DataType */
code span.dv { color: #40a070; } /* DecVal */
code span.er { color: #ff0000; font-weight: bold; } /* Error */
code span.ex { } /* Extension */
code span.fl { color: #40a070; } /* Float */
code span.fu { color: #06287e; } /* Function */
code span.im { color: #008000; font-weight: bold; } /* Import */
code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
code span.kw { color: #007020; font-weight: bold; } /* Keyword */
code span.op { color: #666666; } /* Operator */
code span.ot { color: #007020; } /* Other */
code span.pp { color: #bc7a00; } /* Preprocessor */
code span.sc { color: #4070a0; } /* SpecialChar */
code span.ss { color: #bb6688; } /* SpecialString */
code span.st { color: #4070a0; } /* String */
code span.va { color: #19177c; } /* Variable */
code span.vs { color: #4070a0; } /* VerbatimString */
code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
.sourceCode .row {
width: 100%;
}
.sourceCode {
overflow-x: auto;
}
.code-folding-btn {
margin-right: -30px;
}
</style>
<script>
// apply pandoc div.sourceCode style to pre.sourceCode instead
(function() {
var sheets = document.styleSheets;
for (var i = 0; i < sheets.length; i++) {
if (sheets[i].ownerNode.dataset["origin"] !== "pandoc") continue;
try { var rules = sheets[i].cssRules; } catch (e) { continue; }
var j = 0;
while (j < rules.length) {
var rule = rules[j];
// check if there is a div.sourceCode rule
if (rule.type !== rule.STYLE_RULE || rule.selectorText !== "div.sourceCode") {
j++;
continue;
}
var style = rule.style.cssText;
// check if color or background-color is set
if (rule.style.color === '' && rule.style.backgroundColor === '') {
j++;
continue;
}
// replace div.sourceCode by a pre.sourceCode rule
sheets[i].deleteRule(j);
sheets[i].insertRule('pre.sourceCode{' + style + '}', j);
}
}
})();
</script>
<style type="text/css">
/* for pandoc --citeproc since 2.11 */
div.csl-bib-body { }
div.csl-entry {
clear: both;
}
.hanging div.csl-entry {
margin-left:2em;
text-indent:-2em;
}
div.csl-left-margin {
min-width:2em;
float:left;
}
div.csl-right-inline {
margin-left:2em;
padding-left:1em;
}
div.csl-indent {
margin-left: 2em;
}
</style>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
img {
max-width:100%;
}
.tabbed-pane {
padding-top: 12px;
}
.html-widget {
margin-bottom: 20px;
}
button.code-folding-btn:focus {
outline: none;
}
summary {
display: list-item;
}
details > summary > p:only-child {
display: inline;
}
pre code {
padding: 0;
}
</style>
<style type="text/css">
.dropdown-submenu {
position: relative;
}
.dropdown-submenu>.dropdown-menu {
top: 0;
left: 100%;
margin-top: -6px;
margin-left: -1px;
border-radius: 0 6px 6px 6px;
}
.dropdown-submenu:hover>.dropdown-menu {
display: block;
}
.dropdown-submenu>a:after {
display: block;
content: " ";
float: right;
width: 0;
height: 0;
border-color: transparent;
border-style: solid;
border-width: 5px 0 5px 5px;
border-left-color: #cccccc;
margin-top: 5px;
margin-right: -10px;
}
.dropdown-submenu:hover>a:after {
border-left-color: #adb5bd;
}
.dropdown-submenu.pull-left {
float: none;
}
.dropdown-submenu.pull-left>.dropdown-menu {
left: -100%;
margin-left: 10px;
border-radius: 6px 0 6px 6px;
}
</style>
<script type="text/javascript">
// manage active state of menu based on current page
$(document).ready(function () {
// active menu anchor
href = window.location.pathname
href = href.substr(href.lastIndexOf('/') + 1)
if (href === "")
href = "index.html";
var menuAnchor = $('a[href="' + href + '"]');
// mark the anchor link active (and if it's in a dropdown, also mark that active)
var dropdown = menuAnchor.closest('li.dropdown');
if (window.bootstrap) { // Bootstrap 4+
menuAnchor.addClass('active');
dropdown.find('> .dropdown-toggle').addClass('active');
} else { // Bootstrap 3
menuAnchor.parent().addClass('active');
dropdown.addClass('active');
}
// Navbar adjustments
var navHeight = $(".navbar").first().height() + 15;
var style = document.createElement('style');
var pt = "padding-top: " + navHeight + "px; ";
var mt = "margin-top: -" + navHeight + "px; ";
var css = "";
// offset scroll position for anchor links (for fixed navbar)
for (var i = 1; i <= 6; i++) {
css += ".section h" + i + "{ " + pt + mt + "}\n";
}
style.innerHTML = "body {" + pt + "padding-bottom: 40px; }\n" + css;
document.head.appendChild(style);
});
</script>
<!-- tabsets -->
<style type="text/css">
.tabset-dropdown > .nav-tabs {
display: inline-table;
max-height: 500px;
min-height: 44px;
overflow-y: auto;
border: 1px solid #ddd;
border-radius: 4px;
}
.tabset-dropdown > .nav-tabs > li.active:before, .tabset-dropdown > .nav-tabs.nav-tabs-open:before {
content: "\e259";
font-family: 'Glyphicons Halflings';
display: inline-block;
padding: 10px;
border-right: 1px solid #ddd;
}
.tabset-dropdown > .nav-tabs.nav-tabs-open > li.active:before {
content: "\e258";
font-family: 'Glyphicons Halflings';
border: none;
}
.tabset-dropdown > .nav-tabs > li.active {
display: block;
}
.tabset-dropdown > .nav-tabs > li > a,
.tabset-dropdown > .nav-tabs > li > a:focus,
.tabset-dropdown > .nav-tabs > li > a:hover {
border: none;
display: inline-block;
border-radius: 4px;
background-color: transparent;
}
.tabset-dropdown > .nav-tabs.nav-tabs-open > li {
display: block;
float: none;
}
.tabset-dropdown > .nav-tabs > li {
display: none;
}
</style>
<!-- code folding -->
<style type="text/css">
.code-folding-btn { margin-bottom: 4px; }
</style>
<style type="text/css">
#TOC {
margin: 25px 0px 20px 0px;
}
@media (max-width: 768px) {
#TOC {
position: relative;
width: 100%;
}
}
@media print {
.toc-content {
/* see https://github.com/w3c/csswg-drafts/issues/4434 */
float: right;
}
}
.toc-content {
padding-left: 30px;
padding-right: 40px;
}
div.main-container {
max-width: 1200px;
}
div.tocify {
width: 20%;
max-width: 260px;
max-height: 85%;
}
@media (min-width: 768px) and (max-width: 991px) {
div.tocify {
width: 25%;
}
}
@media (max-width: 767px) {
div.tocify {
width: 100%;
max-width: none;
}
}
.tocify ul, .tocify li {
line-height: 20px;
}
.tocify-subheader .tocify-item {
font-size: 0.90em;
}
.tocify .list-group-item {
border-radius: 0px;
}
</style>
</head>
<body>
<div class="container-fluid main-container">
<!-- setup 3col/9col grid for toc_float and main content -->
<div class="row">
<div class="col-xs-12 col-sm-4 col-md-3">
<div id="TOC" class="tocify">
</div>
</div>
<div class="toc-content col-xs-12 col-sm-8 col-md-9">
<div class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-bs-toggle="collapse" data-target="#navbar" data-bs-target="#navbar">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="index.html">The Resources</a>
</div>
<div id="navbar" class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<li>
<a href="index.html">Home</a>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" data-bs-toggle="dropdown" aria-expanded="false">
Workflows
<span class="caret"></span>
</a>
<ul class="dropdown-menu" role="menu">
<li>
<a href="amplicon_workflow.html">
<span class="fa fa-solid fa-dna"></span>
Amplicon Sequencing Worfklow
</a>
</li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" data-bs-toggle="dropdown" aria-expanded="false">
Tutorials
<span class="caret"></span>
</a>
<ul class="dropdown-menu" role="menu">
<li>
<a href="Basic_Bash_Tutorial.html">
<span class="fa fa-solid fa-terminal"></span>
Bash Basics Tutorial
</a>
</li>
</ul>
</li>
</ul>
<ul class="nav navbar-nav navbar-right">
</ul>
</div><!--/.nav-collapse -->
</div><!--/.container -->
</div><!--/.navbar -->
<div id="header">
<div class="btn-group pull-right float-right">
<button type="button" class="btn btn-default btn-xs btn-secondary btn-sm dropdown-toggle" data-toggle="dropdown" data-bs-toggle="dropdown" aria-haspopup="true" aria-expanded="false"><span>Code</span> <span class="caret"></span></button>
<ul class="dropdown-menu dropdown-menu-right" style="min-width: 50px;">
<li><a id="rmd-show-all-code" href="#">Show All Code</a></li>
<li><a id="rmd-hide-all-code" href="#">Hide All Code</a></li>
</ul>
</div>
<h1 class="title toc-ignore">Amplicon Analysis Workflow</h1>
<h4 class="author">Author: Linton Freund (<a
href="mailto:[email protected]" class="email">[email protected]</a>)</h4>
<h4 class="date">Last update: 02 October, 2023</h4>
</div>
<style type="text/css">
body{
font-size: 13pt;
}
</style>
<!--
Render from R:
rmarkdown::render("Amplicon_Workflow.Rmd", clean=TRUE, output_format="html_document")
R
Note: if you render as a PDF then try to render a website, change the date format above (see here https://stackoverflow.com/questions/39415294/avoid-yaml-header-change-when-switching-knitr-output-format)
Rendering from the command-line. To render to PDF format, use the argument setting: output_format="pdf_document".
$ Rscript -e "rmarkdown::render('Amplicon_Workflow.Rmd', output_format='html_document', clean=TRUE)"
Add logo:
htmltools::img(src = knitr::image_uri("mylogo.png"),
alt = 'logo',
style = 'position:absolute; top:0; center:0; padding:10px;')
-->
<div id="background" class="section level1" number="1">
<h1><span class="header-section-number">1</span> Background</h1>
<p>This is a tutorial on how to process your 16S ITS1/ITS2 amplicon
sequences and identify the taxonomic identification of the ASVs (i.e.,
amplicon sequence variants, also known as zOTUs for zero OTUs or ESVs
for exact sequence variants) in your sequence data.</p>
<p>To create this tutorial, I have assembled scripts I’ve used to
analyze 16S amplicon sequence data provided by Dr. Emma Aronson’s lab.
The data I am working with to create this workflow comes from a project
that examined soil microbial community composition in Mount Saint
Helens. The target region was the V4 region within the 16S gene, and
sequencing was performed with an Illumina MiSeq (2x300).</p>
<p>This tutorial would not have been possible without <a
href="https://callahanlab.cvm.ncsu.edu/">Dr. Benjamin Callahan’s</a>
<code>DADA2</code> <a
href="https://github.com/benjjneb/dada2">program</a> <span
class="citation">(Callahan et al. 2016)</span> and <a
href="https://benjjneb.github.io/dada2/tutorial.html">tutorials</a>.
Additionally, I would like to especially thank <a
href="https://astrobiomike.github.io/research/">Dr. Mike Lee</a> for his
guidance, his patience, and his Happy Belly Bioinformatics tutorial
called <a
href="https://astrobiomike.github.io/amplicon/dada2_workflow_ex#removing-likely-contaminants"><strong>Amplicon
Analysis</strong></a> tutorial <span class="citation">(Lee
2019)</span>.</p>
<p><strong>You can find all of the scripts used in this workflow in this
<a
href="https://github.com/hlfreund/Amplicon_Sequencing_Worfklow">GitHub
repository</a> or you can download it from SourceForge or Zenodo, which
is linked in the <a href="#about-me">About Me</a> section.</strong></p>
<p><strong>If you do use this workflow, please cite this using the DOI,
which is included in the <a href="#about-me">About Me</a>
section</strong></p>
<div id="considerations-before-you-begin" class="section level2"
number="1.1">
<h2><span class="header-section-number">1.1</span> Considerations before
you begin</h2>
<p>I was able to analyze these sequences on a High Performance Computing
cluster (HPCC) that uses a Slurm scheduler. The minimum amount of total
memory I used (not per CPU, but overall) for each step in this workflow
(i.e., each step as a separate ‘job’ via Slurm) was 400GB. Having enough
memory is essential to running most of these programs, so please keep
this in mind before embarking on this workflow!</p>
<p>These steps are also time consuming depending on your memory
constraints, so do not be concerned if this process takes a while. If
you plan to run through multiple steps in the workflow in one sitting,
then I suggest loading <strong>tmux</strong> before you run your
scripts. Here is a handy <a href="https://tmuxcheatsheet.com/">tmux
cheat sheet</a> that I refer to often. For more information on what tmux
is and how to utilize it, check this <a
href="https://thoughtbot.com/blog/a-tmux-crash-course">tmux crash
course</a> by Josh Clayton.</p>
<p>I also suggest exploring a program called <strong>neovim</strong>
(aka nvim) that allows you to use Vim (a text editor) to edit R code and
run the code simultaneously. Though nvim is not necessary to run through
this workflow, I find that it makes my life a bit easier when running
through the <code>DADA2</code> portion of the workflow. I will get more
into the usage of nvim once we get to the <code>DADA2</code> step(s),
but for more information please view the <a
href="https://github.com/jalvesaq/Nvim-R">Neovim Github</a> as well as
its <a
href="https://github.com/jamespeapen/Nvim-R/wiki#overview">documentation</a>.
You can also find a helpful nvim tutorial <a
href="https://girke.bioinformatics.ucr.edu/GEN242/tutorials/linux/linux/#nvim-r-tmux-essentials">here</a>
created by Dr. Thomas Girke from UC Riverside.</p>
<p>Additionally, you will need to change your path to each of these
programs depending on where they are stored in your computer or HPCC. If
you are running these steps locally (which, if you are, then you have
one badass computer!), then you can skip the module loading lines in
each step – loading modules is specifically for running these scripts on
a HPCC that uses a Slurm Workload Manager.</p>
</div>
<div id="submitting-scripts-as-jobs-with-slurm" class="section level2"
number="1.2">
<h2><span class="header-section-number">1.2</span> Submitting Scripts as
Jobs with Slurm</h2>
<p>If you are unsure as to how to set up the script for submitting on
your HPCC, check the code chunk below. This is the information I use
when submitting a job to our Slurm system. Again, this is specifically
for a system that uses the Slurm scheduler. For more information on what
the arguments mean and how to set up your job submissions, please refer
to this handy <a
href="https://slurm.schedmd.com/pdfs/summary.pdf">cheatsheet</a> made by
Slurm.</p>
<p><strong>NOTE</strong>: If you are running these scripts on an HPCC,
please load the module you need before running, or add
<code>load module name_of_module</code> to your script before you call
on the program you want to use.</p>
<div class="sourceCode" id="cb1"><pre
class="sourceCode bash"><code class="sourceCode bash"><span id="cb1-1"><a href="#cb1-1" tabindex="-1"></a><span class="co">#!/bin/bash -l</span></span>
<span id="cb1-2"><a href="#cb1-2" tabindex="-1"></a></span>
<span id="cb1-3"><a href="#cb1-3" tabindex="-1"></a><span class="co">#SBATCH --nodes=1</span></span>
<span id="cb1-4"><a href="#cb1-4" tabindex="-1"></a><span class="co">#SBATCH --ntasks=1</span></span>
<span id="cb1-5"><a href="#cb1-5" tabindex="-1"></a><span class="co">#SBATCH --cpus-per-task=4 # must match the # of threads if program allows threading (-t #)</span></span>
<span id="cb1-6"><a href="#cb1-6" tabindex="-1"></a><span class="co">##SBATCH --mem-per-cpu=500G # memory per cpu - * if threading, do not let this line run (use ##). Cannot ask for too much memory per cpu!</span></span>
<span id="cb1-7"><a href="#cb1-7" tabindex="-1"></a><span class="co">#SBATCH --mem=500GB # overall memory - if you're threading, keep this line</span></span>
<span id="cb1-8"><a href="#cb1-8" tabindex="-1"></a><span class="co">#SBATCH --time=1-00:00:00 # time requested; this example is 1 day, 0 hrs</span></span>
<span id="cb1-9"><a href="#cb1-9" tabindex="-1"></a><span class="co">#SBATCH --output=name_of_log_file_6.27.21.stdout # name of your log file</span></span>
<span id="cb1-10"><a href="#cb1-10" tabindex="-1"></a><span class="co">#SBATCH [email protected] # your email address </span></span>
<span id="cb1-11"><a href="#cb1-11" tabindex="-1"></a><span class="co">#SBATCH --mail-type=ALL # will send you email updates when job starts and ends (and if it runs successfully or not)</span></span>
<span id="cb1-12"><a href="#cb1-12" tabindex="-1"></a><span class="co">#SBATCH --job-name="Name of Job 1/1/21" # name of your job for Slurm</span></span>
<span id="cb1-13"><a href="#cb1-13" tabindex="-1"></a><span class="co">#SBATCH -p node_name_here # partition node name</span></span></code></pre></div>
<p>When I don’t know exactly what a program’s output will look like, I
will run the program via an interactive job on the HPCC. I also suggest
running programs interactively if the program requires multiple lines of
code to run, and you want to make sure each step has the correct input
(whether it be a file, an object, or the output of a previous step in
the code). For some more information on interactive jobs in a Slurm
system, check out this <a
href="https://yunmingzhang.wordpress.com/2015/06/29/how-to-use-srun-to-get-an-interactive-node/">blog
post</a> by Yunming Zhang. This is how I set up an interactive job on
the HPCC (that uses Slurm).</p>
<div class="sourceCode" id="cb2"><pre
class="sourceCode bash"><code class="sourceCode bash"><span id="cb2-1"><a href="#cb2-1" tabindex="-1"></a><span class="ex">srun</span> <span class="at">-p</span> node_name_here <span class="at">--mem</span><span class="op">=</span>500gb <span class="at">--time</span><span class="op">=</span>1-00:00:00 <span class="at">--pty</span> bash <span class="at">-l</span></span>
<span id="cb2-2"><a href="#cb2-2" tabindex="-1"></a><span class="co"># -p = partition</span></span>
<span id="cb2-3"><a href="#cb2-3" tabindex="-1"></a><span class="co"># --mem = overall memory being requested, not memory per CPU</span></span>
<span id="cb2-4"><a href="#cb2-4" tabindex="-1"></a><span class="co"># --time = overall time requested; 1 day, 0 hrs</span></span>
<span id="cb2-5"><a href="#cb2-5" tabindex="-1"></a><span class="co"># -–pty = gives you a pseudo terminal that runs bash</span></span>
<span id="cb2-6"><a href="#cb2-6" tabindex="-1"></a><span class="co"># bash -l = setting bash as language</span></span></code></pre></div>
</div>
<div id="a-bash-scripting-tip-for-before-we-start"
class="section level2" number="1.3">
<h2><span class="header-section-number">1.3</span> A bash scripting tip
for before we start</h2>
<p>I wanted to share a bit of code that you will see being implemented
in every script throughout the tutorial. This little bit of code will
help you pull out the sample names from your files, allowing you easily
run through your files while also keeping track of which samples those
files belong to.</p>
<div class="sourceCode" id="cb3"><pre
class="sourceCode bash"><code class="sourceCode bash"><span id="cb3-1"><a href="#cb3-1" tabindex="-1"></a><span class="cf">for</span> FILE <span class="kw">in</span> path/<span class="pp">*</span>.fastq<span class="kw">;</span></span>
<span id="cb3-2"><a href="#cb3-2" tabindex="-1"></a><span class="cf">do</span></span>
<span id="cb3-3"><a href="#cb3-3" tabindex="-1"></a> <span class="va">f</span><span class="op">=</span><span class="va">$(</span><span class="fu">basename</span> <span class="va">$FILE)</span></span>
<span id="cb3-4"><a href="#cb3-4" tabindex="-1"></a> <span class="va">SAMPLE</span><span class="op">=</span><span class="va">${f</span><span class="op">%</span>.fastq<span class="pp">*</span><span class="va">}</span></span>
<span id="cb3-5"><a href="#cb3-5" tabindex="-1"></a> <span class="bu">echo</span> <span class="va">$SAMPLE</span></span>
<span id="cb3-6"><a href="#cb3-6" tabindex="-1"></a><span class="cf">done</span></span></code></pre></div>
<p>Here I am using using a for loop to loop through each fastq files in
a specific directory. In each iteration of the loop, an <code>$f</code>
variable is created, which uses the <code>basename</code> function to
get the file name of the <code>$FILE</code> variable. Then
<code>$SAMPLE</code> is created by using <code>%</code> to remove the
<code>.fastq</code> extension and everything that follows, keeping only
the file name (minus the extension) and calling that
<code>$SAMPLE</code>. Then we can use the <code>$SAMPLE</code> variable
to substitute the file names, which come in handy for running these
scripts over multiple samples at one time. This concept should become
clearer as we move through the workflow. If you’d like more information
on string substitution (i.e., using <code>%</code> to remove parts of a
string), please see this helpful <a
href="https://tldp.org/LDP/abs/html/string-manipulation.html">link</a>.</p>
</div>
</div>
<div id="sample-pre-processing" class="section level1" number="2">
<h1><span class="header-section-number">2</span> Sample
pre-processing</h1>
<div id="demultiplex-your-samples" class="section level2" number="2.1">
<h2><span class="header-section-number">2.1</span> Demultiplex your
samples</h2>
<p>When preparing sequencing libraries, we typically multiplex our
samples. This means that during library preparation, we’ve attached
barcodes to our sequences that help us trace the sample that these
sequences came from. This allows us to pool multiple libraries together
in one sequencing run. After sequencing, the sequences are
<em>demultiplexed</em>, meaning the individual sequences are separated
out by sample into individual FASTQ files.</p>
<p>Typically your samples will be returned to you already demultiplexed.
However, if your samples are still pooled into one large FASTQ file, do
not panic! You can follow the <a
href="https://astrobiomike.github.io/amplicon/demultiplexing"><strong>demultiplexing
tutorial</strong></a> by <a
href="https://astrobiomike.github.io/research/">Dr. Mike Lee</a> which
utilizes the <a
href="https://github.com/najoshi/sabre"><code>sabre</code> tool</a>. Or,
you can use <code>bcl2fastq2</code> by Illumina (more information <a
href="https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl2fastq/bcl2fastq2-v2-20-software-guide-15051736-03.pdf">here</a>).</p>
</div>
<div id="sequence-quality-and-where-to-trim" class="section level2"
number="2.2">
<h2><span class="header-section-number">2.2</span> Sequence Quality and
Where to Trim</h2>
<div id="check-the-quality-of-your-sequences-with-fastqc"
class="section level3" number="2.2.1">
<h3><span class="header-section-number">2.2.1</span> Check the quality
of your sequences with <code>FastQC</code></h3>
<p>It’s always a good idea to check the quality of your sequences before
you start your analysis, regardless of the type of sequences they are
(metagenomes, RNA-seq data, etc). <code>FastQC</code> <span
class="citation">(Andrews, n.d.)</span> provides a comprehensive report
on the quality of your sequences and is helpful for the following:
identifying primers or adapters still attached to your sequences;
determining the quality of your reverse reads; etc. You can also use the
FastQC reports to determine if you should attempt to merge your forward
and reverse reads, or just proceed with only the forward reads.</p>
<div class="sourceCode" id="cb4"><pre
class="sourceCode bash"><code class="sourceCode bash"><span id="cb4-1"><a href="#cb4-1" tabindex="-1"></a><span class="co"># my 16S and ITS2 sequences are in separate directories, which I why I loop through them separately below</span></span>
<span id="cb4-2"><a href="#cb4-2" tabindex="-1"></a></span>
<span id="cb4-3"><a href="#cb4-3" tabindex="-1"></a><span class="co"># create a directory to store your FastQC results in</span></span>
<span id="cb4-4"><a href="#cb4-4" tabindex="-1"></a><span class="cf">if</span> <span class="kw">[[</span> <span class="ot">!</span> <span class="ot">-d</span> ./FastQC_Results <span class="kw">]];</span> <span class="cf">then</span></span>
<span id="cb4-5"><a href="#cb4-5" tabindex="-1"></a> <span class="fu">mkdir</span> FastQC_Results</span>
<span id="cb4-6"><a href="#cb4-6" tabindex="-1"></a><span class="cf">fi</span></span>
<span id="cb4-7"><a href="#cb4-7" tabindex="-1"></a></span>
<span id="cb4-8"><a href="#cb4-8" tabindex="-1"></a><span class="co"># create directory within results directory for 16S FastQC Results</span></span>
<span id="cb4-9"><a href="#cb4-9" tabindex="-1"></a><span class="cf">if</span> <span class="kw">[[</span> <span class="ot">!</span> <span class="ot">-d</span> ./FastQC_Results/16S_FastQC <span class="kw">]];</span> <span class="cf">then</span></span>
<span id="cb4-10"><a href="#cb4-10" tabindex="-1"></a> <span class="fu">mkdir</span> FastQC_Results/16S_FastQC</span>
<span id="cb4-11"><a href="#cb4-11" tabindex="-1"></a><span class="cf">fi</span></span>
<span id="cb4-12"><a href="#cb4-12" tabindex="-1"></a></span>
<span id="cb4-13"><a href="#cb4-13" tabindex="-1"></a><span class="co"># create directory within results directory for ITS2 (or ITS1) FastQC Results</span></span>
<span id="cb4-14"><a href="#cb4-14" tabindex="-1"></a><span class="cf">if</span> <span class="kw">[[</span> <span class="ot">!</span> <span class="ot">-d</span> ./FastQC_Results/ITS2_FastQC <span class="kw">]];</span> <span class="cf">then</span></span>
<span id="cb4-15"><a href="#cb4-15" tabindex="-1"></a> <span class="fu">mkdir</span> FastQC_Results/ITS2_FastQC</span>
<span id="cb4-16"><a href="#cb4-16" tabindex="-1"></a><span class="cf">fi</span></span>
<span id="cb4-17"><a href="#cb4-17" tabindex="-1"></a></span>
<span id="cb4-18"><a href="#cb4-18" tabindex="-1"></a><span class="co"># loop through each 16S fastq.gz file and run through FastQC</span></span>
<span id="cb4-19"><a href="#cb4-19" tabindex="-1"></a><span class="cf">for</span> FILE <span class="kw">in</span> 16S_Seqs/<span class="pp">*</span>.fastq.gz<span class="kw">;</span></span>
<span id="cb4-20"><a href="#cb4-20" tabindex="-1"></a><span class="cf">do</span></span>
<span id="cb4-21"><a href="#cb4-21" tabindex="-1"></a> <span class="co"># extract out just the sample name from the file name</span></span>
<span id="cb4-22"><a href="#cb4-22" tabindex="-1"></a> <span class="va">f</span><span class="op">=</span><span class="va">$(</span><span class="fu">basename</span> <span class="va">$FILE)</span></span>
<span id="cb4-23"><a href="#cb4-23" tabindex="-1"></a> <span class="va">SAMPLE</span><span class="op">=</span><span class="va">${f</span><span class="op">%</span>.fastq<span class="pp">*</span><span class="va">}</span> <span class="co">#string manipulation to drop .fastq and everything that comes after</span></span>
<span id="cb4-24"><a href="#cb4-24" tabindex="-1"></a> </span>
<span id="cb4-25"><a href="#cb4-25" tabindex="-1"></a> <span class="ex">fastqc</span> <span class="va">$FILE</span> <span class="at">--outdir</span><span class="op">=</span>./FastQC_Results/16S_FastQC</span>
<span id="cb4-26"><a href="#cb4-26" tabindex="-1"></a> </span>
<span id="cb4-27"><a href="#cb4-27" tabindex="-1"></a><span class="cf">done</span></span>
<span id="cb4-28"><a href="#cb4-28" tabindex="-1"></a></span>
<span id="cb4-29"><a href="#cb4-29" tabindex="-1"></a><span class="co"># loop through each ITS2 fastq.gz file and run through FastQC</span></span>
<span id="cb4-30"><a href="#cb4-30" tabindex="-1"></a><span class="cf">for</span> FILE <span class="kw">in</span> ITS2_Seqs/<span class="pp">*</span>.fastq.gz<span class="kw">;</span></span>
<span id="cb4-31"><a href="#cb4-31" tabindex="-1"></a><span class="cf">do</span></span>
<span id="cb4-32"><a href="#cb4-32" tabindex="-1"></a> <span class="va">f</span><span class="op">=</span><span class="va">$(</span><span class="fu">basename</span> <span class="va">$FILE)</span></span>
<span id="cb4-33"><a href="#cb4-33" tabindex="-1"></a> <span class="va">SAMPLE</span><span class="op">=</span><span class="va">${f</span><span class="op">%</span>.fastq<span class="pp">*</span><span class="va">}</span></span>
<span id="cb4-34"><a href="#cb4-34" tabindex="-1"></a> <span class="ex">fastqc</span> <span class="va">$FILE</span> <span class="at">--outdir</span><span class="op">=</span>./FastQC_Results/ITS2_FastQC</span>
<span id="cb4-35"><a href="#cb4-35" tabindex="-1"></a> </span>
<span id="cb4-36"><a href="#cb4-36" tabindex="-1"></a><span class="cf">done</span></span></code></pre></div>
<p>FastQC will return a report assessing the per base and per sequence
quality of your sequences, as well as the GC and N (i.e., unidentified
base) content across your sequences, the distribution of your sequence
lengths, and whether or not adapters are still attached to your
sequences. The second tab of the report details the per base sequence
quality across all of your sequences. The per base quality score
(<strong>Q score</strong>), also known as a <strong>Phred
score</strong>, is the estimated probability that the base call is
wrong. The following equation is used for calculating the Q score: <span
class="math display">\[
Q = -10log_{10}E
\]</span> Here, E is the estimated probability of the base call being
wrong. The higher the Q score, the smaller the probability of a base
call error. A quality score of 30 (Q30) means that the probability of an
incorrect base call is 1 in 1000, and that the base call accuracy (1 -
probability of incorrect base call) is 99.9%. For more information on
quality scores, please see this info from <a
href="https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/quality-scores.html">Illumina</a>.</p>
<p>Below is an example of the “per base sequence quality” portion of the
report. This portion of the report helps me to determine where I should
trim my sequences as I move forward with the analysis. This part of the
report can also give you a sense on whether there was an error in your
sequencing run. For example, if the average quality score (i.e., the
blue line in the report) across all of the bases dips below 30 for half
of the sequence length in all of my samples, that could indicate that
there was an error with the sequencing run itself.</p>
<center>
<img src="amplicon_workflow/fastqc_base_seq_qual_plot.png" />
</center>
<div align="center">
Figure 1: Per Base Quality Scores from FastQC Report
</div>
<p></br></p>
<p>Another useful piece of the FastQC report is the adapter content tab,
which is the very last tab in the report. This portion of the report
tells us the percentage of reads that have adapter sequences at specific
base positions along the reads. The following snapshot from a FastQC
report shows that the Small RNA 3’ adapter sequence is found in ~2% of
the sequences starting at around the 160th base. We can use this
information to then decide exactly which adapter sequences to cut from
our samples in the trimming step.</p>
<center>
<img src="amplicon_workflow/fastqc_adapter_content.png" />
</center>
<div align="center">
Figure 2: Frequency of Adapter Sequences from FastQC Report
</div>
<p></br></p>
<p>For more on how to interpret FastQC reports, please check out this
helpful <a
href="https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/">FastQC
tutorial</a> from Michigan State University.</p>
</div>
<div id="expected-error-filtering-of-sequences-with-eestats"
class="section level3" number="2.2.2">
<h3><span class="header-section-number">2.2.2</span> Expected Error
Filtering of Sequences with <code>eestats</code></h3>
<p>The <code>eestats2</code> program <span class="citation">(Edgar and
Flyvbjerg 2015)</span> creates a report detailing the percentage of
reads that will pass through an expected error filter when the reads are
at different lengths. Specifically the program will determine how many
reads at each specific length (i.e., 50 bp, 100 bp, 150 bp, etc.) have
good enough quality to surpass the three expected error thresholds:
0.5%, 1%, and 2%.</p>
<p>Before you run the <code>eestats</code> program, be sure to
<em>gunzip</em> (aka decompress) your fastq.gz files! You can do that by
running the following command:
<code>gunzip /path/to/*.fastq.gz</code>.</p>
<div class="sourceCode" id="cb5"><pre
class="sourceCode bash"><code class="sourceCode bash"><span id="cb5-1"><a href="#cb5-1" tabindex="-1"></a><span class="co"># Create directory to store eestats results</span></span>
<span id="cb5-2"><a href="#cb5-2" tabindex="-1"></a><span class="cf">if</span> <span class="kw">[[</span> <span class="ot">!</span> <span class="ot">-d</span> ./EEstats_Results <span class="kw">]];</span> <span class="cf">then</span></span>
<span id="cb5-3"><a href="#cb5-3" tabindex="-1"></a> <span class="fu">mkdir</span> EEstats_Results</span>
<span id="cb5-4"><a href="#cb5-4" tabindex="-1"></a><span class="cf">fi</span></span>
<span id="cb5-5"><a href="#cb5-5" tabindex="-1"></a></span>
<span id="cb5-6"><a href="#cb5-6" tabindex="-1"></a><span class="co"># Create specific directory within eestats results for 16S eestats results</span></span>
<span id="cb5-7"><a href="#cb5-7" tabindex="-1"></a><span class="cf">if</span> <span class="kw">[[</span> <span class="ot">!</span> <span class="ot">-d</span> ./EEstats_Results/16S_EEstats <span class="kw">]];</span> <span class="cf">then</span></span>
<span id="cb5-8"><a href="#cb5-8" tabindex="-1"></a> <span class="fu">mkdir</span> EEstats_Results/16S_EEstats</span>
<span id="cb5-9"><a href="#cb5-9" tabindex="-1"></a><span class="cf">fi</span></span>
<span id="cb5-10"><a href="#cb5-10" tabindex="-1"></a></span>
<span id="cb5-11"><a href="#cb5-11" tabindex="-1"></a><span class="co"># Create specific directory within eestats results for ITS2 (or ITS1) eestats results</span></span>
<span id="cb5-12"><a href="#cb5-12" tabindex="-1"></a><span class="cf">if</span> <span class="kw">[[</span> <span class="ot">!</span> <span class="ot">-d</span> ./EEstats_Results/ITS2_EEstats <span class="kw">]];</span> <span class="cf">then</span></span>
<span id="cb5-13"><a href="#cb5-13" tabindex="-1"></a> <span class="fu">mkdir</span> EEstats_Results/ITS2_EEstats</span>
<span id="cb5-14"><a href="#cb5-14" tabindex="-1"></a><span class="cf">fi</span></span>
<span id="cb5-15"><a href="#cb5-15" tabindex="-1"></a></span>
<span id="cb5-16"><a href="#cb5-16" tabindex="-1"></a><span class="co"># Run eestats2 in loop with 16S fastq files</span></span>
<span id="cb5-17"><a href="#cb5-17" tabindex="-1"></a><span class="cf">for</span> FILE <span class="kw">in</span> 16S_Seqs/<span class="pp">*</span>.fastq<span class="kw">;</span></span>
<span id="cb5-18"><a href="#cb5-18" tabindex="-1"></a><span class="cf">do</span></span>
<span id="cb5-19"><a href="#cb5-19" tabindex="-1"></a> <span class="va">f</span><span class="op">=</span><span class="va">$(</span><span class="fu">basename</span> <span class="va">$FILE)</span></span>
<span id="cb5-20"><a href="#cb5-20" tabindex="-1"></a> <span class="va">SAMPLE</span><span class="op">=</span><span class="va">${f</span><span class="op">%</span>.fastq<span class="pp">*</span><span class="va">}</span></span>
<span id="cb5-21"><a href="#cb5-21" tabindex="-1"></a> </span>
<span id="cb5-22"><a href="#cb5-22" tabindex="-1"></a> <span class="ex">usearch</span> <span class="at">-fastq_eestats2</span> <span class="va">$FILE</span> <span class="at">-output</span> <span class="va">${SAMPLE}</span>_eestats2.txt</span>
<span id="cb5-23"><a href="#cb5-23" tabindex="-1"></a> <span class="co"># move results to EEstats_Results directory</span></span>
<span id="cb5-24"><a href="#cb5-24" tabindex="-1"></a> <span class="fu">mv</span> <span class="va">${SAMPLE}</span>_eestats2.txt EEstats_Results/16S_EEstats</span>
<span id="cb5-25"><a href="#cb5-25" tabindex="-1"></a><span class="cf">done</span></span>
<span id="cb5-26"><a href="#cb5-26" tabindex="-1"></a></span>
<span id="cb5-27"><a href="#cb5-27" tabindex="-1"></a><span class="co"># Run eestats2 in loop with ITS2 fastq files</span></span>
<span id="cb5-28"><a href="#cb5-28" tabindex="-1"></a><span class="cf">for</span> FILE <span class="kw">in</span> ITS2_Seqs/<span class="pp">*</span>.fastq<span class="kw">;</span></span>
<span id="cb5-29"><a href="#cb5-29" tabindex="-1"></a><span class="cf">do</span></span>
<span id="cb5-30"><a href="#cb5-30" tabindex="-1"></a> <span class="va">f</span><span class="op">=</span><span class="va">$(</span><span class="fu">basename</span> <span class="va">$FILE)</span></span>
<span id="cb5-31"><a href="#cb5-31" tabindex="-1"></a> <span class="va">SAMPLE</span><span class="op">=</span><span class="va">${f</span><span class="op">%</span>.fastq<span class="pp">*</span><span class="va">}</span></span>
<span id="cb5-32"><a href="#cb5-32" tabindex="-1"></a> </span>
<span id="cb5-33"><a href="#cb5-33" tabindex="-1"></a> <span class="ex">usearch</span> <span class="at">-fastq_eestats2</span> <span class="va">$FILE</span> <span class="at">-output</span> <span class="va">${SAMPLE}</span>_eestats2.txt</span>
<span id="cb5-34"><a href="#cb5-34" tabindex="-1"></a> <span class="co"># move results to EEstats_Results directory</span></span>
<span id="cb5-35"><a href="#cb5-35" tabindex="-1"></a> <span class="fu">mv</span> <span class="va">${SAMPLE}</span>_eestats2.txt EEstats_Results/ITS2_EEstats</span>
<span id="cb5-36"><a href="#cb5-36" tabindex="-1"></a> </span>
<span id="cb5-37"><a href="#cb5-37" tabindex="-1"></a><span class="cf">done</span></span></code></pre></div>
The following image shows a sample eestats2 report. For this specific
sample, the ideal trimming length of the sequences would be around 250
basepairs long. This is because when considering the expected error
threshold of 1%, more than 80.8% of the sequence pass this threshold.
Though the 300 bp length also allows high rentention of reads, we know
that the per base quality of this sample drops as we approach the 300 bp
position. Thus, it seems like trimming these sequences to 250 bps would
be ideal moving forward.
<center>
<img src="amplicon_workflow/eestats_example_report.png" />
</center>
<div align="center">
Figure 3: eestats Report Example
</div>
<p></br> For more information on the <code>eestats2</code> programs by
<a href="https://www.drive5.com/usearch/">USEARCH</a>, please read the
documentation <a
href="https://www.drive5.com/usearch/manual/cmd_fastq_eestats2.html">here</a>.</p>
</div>
</div>
<div id="decontaminate-trim-sequences-and-cut-adapters-primers-etc"
class="section level2" number="2.3">
<h2><span class="header-section-number">2.3</span> Decontaminate &
Trim sequences and cut adapters, primers, etc</h2>
<p>There are plenty of programs out there that can be used for trimming,
and the following three are the most popular for amplicon analyses: <a
href="https://cutadapt.readthedocs.io/en/stable/"><code>cutadapt</code></a>,
<a
href="http://www.usadellab.org/cms/?page=trimmomatic"><code>trimmomatic</code></a>,
and <a
href="https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/"><code>bbduk</code></a>
<span class="citation">(Bushnell, n.d.)</span>. All of these programs
are reputable, but I personally like the <code>bbduk</code>, and will
use this tool for trimming and adapter removal.</p>
<p>Before I trim my sequences, I refer to the FastQC reports to find out
exactly which adapters I should remove from my sequences. For example,
when looking at the adapter content portion of the FastQC report above,
I can see that the Nextera Transposase Sequence is still present in that
particular sample. Thankfully Illumina shares their adapter sequences on
their <a
href="https://support-docs.illumina.com/SHARE/AdapterSeq/Content/SHARE/AdapterSeq/AdapterSequencesIntro.htm">website</a>,
allowing us to easily find common adapters in sequences, like the <a
href="https://support-docs.illumina.com/SHARE/AdapterSeq/Content/SHARE/AdapterSeq/Nextera/SequencesNXTILMPrepAndPCR.htm">Nextera
Transposase Sequence</a> for example.</p>
<p>I also know that with the sequences I am analyzing, the PCR primers
are still attached (FastQC may identify these primers in your report’s
Overrepresented Sequences tab, but not necessarily the origin of these
sequences). I can either remove these primer sequences using the actual
sequence using the (<code>literal=</code>) flag, or I can trim from the
right (<code>ftr=</code>) and/or the left (<code>ftl=</code>) of the
sequences if I know exactly how long the primer sequences were.</p>
<p>It is recommended to check the overrepresented sequences from the
FastQC report to see if there are contaminating sequences present in
your data. I suggest taking the most frequent overrepresented sequence
and running it through <a
href="https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&BLAST_SPEC=GeoBlast&PAGE_TYPE=BlastSearch">BLASTn</a>
if the source of this overrepresented sequence says “No Hit” (meaning
that FastQC cannot attribute this sequence to its list of adapter
sequences). If the sequence comes up as a contaminant (i.e., a different
gene than the amplicon you’re looking at) or adapter/primer of some
kind, you can add this to the <code>literal=</code> flag in
<code>bbduk</code> to remove the contaminant.</p>
<p>In addition to removing adapter and primer sequences using the the
<code>literal=</code> flag, I also include a reference file provided by
<code>bbduk</code> (referenced in the <code>ref=</code> flag) that
contains all of the Illumina TruSeq adapters. The sequences in the
reference file, in addition to the given adapters and primers, will be
removed from the sequences. Additionally, bbduk decontaminates sequences
by matching kmers (aka reads of a specific length k) to reference
genomes. If the kmers match the reference genome, then the kmer is kept.
The longer the kmer, the higher the specificity - but there is a limit
to this, seeing as the likelihood that long kmers are shared across
multiple reads is unlikely. If your kmer length is too short, you could
be keeping sequences that are adapters, primers, etc by accident (this
is why using the literal sequence flag and the adapter reference file is
helpful in <code>bbduk</code>).</p>
<p>Below the shell script is a description of all of the flags used by
<code>bbduk</code> and exactly what they mean. For more information on
the <code>bbduk</code> flags, please see the <code>bbduk</code> <a
href="https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/">documentation</a>.</p>
<div class="sourceCode" id="cb6"><pre
class="sourceCode bash"><code class="sourceCode bash"><span id="cb6-1"><a href="#cb6-1" tabindex="-1"></a><span class="va">path</span><span class="op">=</span>/path/to/sequences/here <span class="co"># replace with the path to your files</span></span>
<span id="cb6-2"><a href="#cb6-2" tabindex="-1"></a><span class="co"># my sequence files are in $path/16S_Seqs/ -- see for loop below</span></span>
<span id="cb6-3"><a href="#cb6-3" tabindex="-1"></a></span>
<span id="cb6-4"><a href="#cb6-4" tabindex="-1"></a><span class="cf">if</span> <span class="kw">[[</span> <span class="ot">!</span> <span class="ot">-d</span> ./Trimmed_Seqs <span class="kw">]];</span> <span class="cf">then</span> <span class="co"># creating directory to store trimmed sequences in</span></span>
<span id="cb6-5"><a href="#cb6-5" tabindex="-1"></a> <span class="fu">mkdir</span> Trimmed_Seqs</span>
<span id="cb6-6"><a href="#cb6-6" tabindex="-1"></a><span class="cf">fi</span></span>
<span id="cb6-7"><a href="#cb6-7" tabindex="-1"></a></span>
<span id="cb6-8"><a href="#cb6-8" tabindex="-1"></a><span class="cf">if</span> <span class="kw">[[</span> <span class="ot">!</span> <span class="ot">-d</span> ./Trimmed_Seqs/16S_Trimmed <span class="kw">]];</span> <span class="cf">then</span> <span class="co"># creating directory for specifically trimmed 16S sequences</span></span>
<span id="cb6-9"><a href="#cb6-9" tabindex="-1"></a> <span class="fu">mkdir</span> Trimmed_Seqs/16S_Trimmed</span>
<span id="cb6-10"><a href="#cb6-10" tabindex="-1"></a><span class="cf">fi</span></span>
<span id="cb6-11"><a href="#cb6-11" tabindex="-1"></a></span>
<span id="cb6-12"><a href="#cb6-12" tabindex="-1"></a><span class="cf">for</span> i <span class="kw">in</span> <span class="va">${path}</span>/16S_Seqs/<span class="pp">*</span>_R1.fastq<span class="kw">;</span></span>
<span id="cb6-13"><a href="#cb6-13" tabindex="-1"></a><span class="cf">do</span></span>
<span id="cb6-14"><a href="#cb6-14" tabindex="-1"></a> <span class="va">f</span><span class="op">=</span><span class="va">$(</span><span class="fu">basename</span> <span class="va">$i)</span></span>
<span id="cb6-15"><a href="#cb6-15" tabindex="-1"></a> <span class="va">SAMPLE</span><span class="op">=</span><span class="va">${f</span><span class="op">%</span>_R<span class="pp">*</span><span class="va">}</span></span>
<span id="cb6-16"><a href="#cb6-16" tabindex="-1"></a> </span>
<span id="cb6-17"><a href="#cb6-17" tabindex="-1"></a> <span class="ex">bbduk.sh</span> <span class="at">-Xmx10g</span> in1=<span class="va">${path}</span>/16S_Seqs/<span class="va">${SAMPLE}</span>_R1.fastq in2=<span class="va">${path}</span>/16S_Seqs/<span class="va">${SAMPLE}</span>_R2.fastq out1=<span class="va">${path}</span>/Trimmed_Seqs/16S_Trimmed/<span class="va">${SAMPLE}</span>_R1_clean.fastq out2=<span class="va">${path}</span>/Trimmed_Seqs/16S_Trimmed/<span class="va">${SAMPLE}</span>_R2_clean.fastq literal=TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG,GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG ref=/bigdata/aronsonlab/shared/bbmap_resources/adapters.fa rcomp=t ktrim=r k=23 maq=10 minlength=200 mink=13 hdist=1 tpe tbo</span>
<span id="cb6-18"><a href="#cb6-18" tabindex="-1"></a> </span>
<span id="cb6-19"><a href="#cb6-19" tabindex="-1"></a><span class="cf">done</span></span>
<span id="cb6-20"><a href="#cb6-20" tabindex="-1"></a></span>
<span id="cb6-21"><a href="#cb6-21" tabindex="-1"></a><span class="co"># ref ---> file provided by bbduk that holds collection of Illumina TruSeq adapters</span></span>
<span id="cb6-22"><a href="#cb6-22" tabindex="-1"></a><span class="co"># literal=(sequence here) ---> literal adapter sequences to remove; "N" represents any base -- in this case, they are indexes within the adapters</span></span>
<span id="cb6-23"><a href="#cb6-23" tabindex="-1"></a><span class="co"># rcomp=t ---> Rcomp looks for kmers and their reverse-complements, rather than just forward kmer, if set to true</span></span>
<span id="cb6-24"><a href="#cb6-24" tabindex="-1"></a><span class="co"># ktrim=r ---> “ktrim=r” is for right-trimming (3′ adapters)</span></span>
<span id="cb6-25"><a href="#cb6-25" tabindex="-1"></a><span class="co"># k=23 ---> look for kmer that is 23 bp long</span></span>
<span id="cb6-26"><a href="#cb6-26" tabindex="-1"></a><span class="co"># mink=11 ---> in addition to kmers of x length, look for shorter kmers with lengths 23 to 11 (in this case)</span></span>
<span id="cb6-27"><a href="#cb6-27" tabindex="-1"></a><span class="co"># maq=10 ---> This will discard reads with average quality below 10</span></span>
<span id="cb6-28"><a href="#cb6-28" tabindex="-1"></a><span class="co"># hdist=1 ---> hamming distance of 1</span></span>
<span id="cb6-29"><a href="#cb6-29" tabindex="-1"></a><span class="co"># mlf=50 ---> (minlengthfraction=50) would discard reads under 50% of their original length after trimming</span></span>
<span id="cb6-30"><a href="#cb6-30" tabindex="-1"></a><span class="co"># trimq=10 ---> quality-trim to Q10 using the Phred algorithm, which is more accurate than naive trimming.</span></span>
<span id="cb6-31"><a href="#cb6-31" tabindex="-1"></a><span class="co"># qtrim=r ---> means it will quality trim the right side only</span></span>
<span id="cb6-32"><a href="#cb6-32" tabindex="-1"></a><span class="co"># tpe ---> which specifies to trim both reads to the same length</span></span>
<span id="cb6-33"><a href="#cb6-33" tabindex="-1"></a><span class="co"># tbo ---> which specifies to also trim adapters based on pair overlap detection using BBMerge (which does not require known adapter sequences)</span></span>
<span id="cb6-34"><a href="#cb6-34" tabindex="-1"></a><span class="co"># mm ----> Maskmiddle ignores the middle base of a kmer, can turn off with mm=f</span></span></code></pre></div>
<p>To be extra cautious and ensure that the trimming step was
successful, I will run the trimmed sequences through FastQC and compare
the reports. If the per base and per sequence qualities have improved
and/or the adapters are absent, then I will move forward with the
workflow. However, if I am still not happy with the quality of the
trimmed reads, I will then run the <em>trimmed</em> reads through
<code>bbduk</code> in hopes of removing persistent, unwanted sequences.
I will also check the overrepresented sequences and their frequencies
again, and run the most frequent overepresented sequence(s) in
BLASTn.</p>
</div>
</div>
<div id="asv-assignment-with-dada2" class="section level1" number="3">
<h1><span class="header-section-number">3</span> ASV Assignment with
<code>DADA2</code></h1>
<p>All of the steps in this portion of the workflow (excluding the tmux
and nvim code chunks) have been adapted from Dr. Callahan’s
<code>DADA2</code> <a