Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add clumpify-based dedup #970

Open
wants to merge 56 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
a5d58b5
add bbmap.BBMapTool().dedup_clumpify()
tomkinsc Jul 2, 2019
595764e
pass JVMmemory; add read_utils.rmdup_clumpify_bam; dedup_bam WDL task
tomkinsc Jul 2, 2019
09901d3
switch from mvicuna to clumpify-based dedup in taxon_filter.py deplete
tomkinsc Jul 2, 2019
df208ea
replace unicode apostrophe
tomkinsc Jul 2, 2019
98ac4fc
reduce clumpify max_mismatches 5->3
tomkinsc Jul 2, 2019
784877a
dump dx-toolkit version and update URL to reflect new source
tomkinsc Jul 2, 2019
232f9cd
dedup prior to metagenomics classification in WDL workflows
tomkinsc Jul 2, 2019
c01bb5b
add missing import
tomkinsc Jul 3, 2019
e25ef52
rename read_utils.wdl -> downsample.wdl, dedup.wdl
tomkinsc Jul 3, 2019
6ba96d4
rename dedup_bam wdl workflow to "dedup"
tomkinsc Jul 3, 2019
e8a4081
increase dx instance size for dedup and memory spec
tomkinsc Jul 3, 2019
8280063
correct argparse parser attachement for rmdup_clumpify_bam
tomkinsc Jul 3, 2019
b86b1c9
wrap WDL variable in dedup command block for var interpolation
tomkinsc Jul 3, 2019
8afe18f
avoid collision
tomkinsc Jul 4, 2019
7d2f45a
Merge branch 'master' into ct-add-clumpify
tomkinsc Jul 10, 2019
f48038e
Merge branch 'master' into ct-add-clumpify
tomkinsc Jul 10, 2019
c78f246
Merge branch 'master' into ct-add-clumpify
tomkinsc Jul 11, 2019
72fb4cd
add sambamba since bbtools looks for it?
tomkinsc Jul 11, 2019
d97f773
Merge branch 'master' into ct-add-clumpify
tomkinsc Jul 13, 2019
a685a8a
remove sambamba
tomkinsc Jul 13, 2019
a2ce0f1
specify containment=t for bbmap clumpify
tomkinsc Aug 1, 2019
6f26717
Merge branch 'master' into ct-add-clumpify
tomkinsc Aug 26, 2019
b73950e
Merge branch 'master' into ct-add-clumpify
tomkinsc Sep 11, 2019
a3010ea
Merge branch 'master' into ct-add-clumpify
tomkinsc Sep 11, 2019
8199b12
enforce containment=False; more tolerant bbmap unit test
tomkinsc Sep 11, 2019
6bb3f6b
update miniconda ssl certs
tomkinsc Sep 12, 2019
97eff11
increase debug info emitted by build-conda.sh
tomkinsc Sep 20, 2019
f6f9b85
Merge branch 'master' into ct-add-clumpify
tomkinsc Sep 20, 2019
1674c44
Merge branch 'master' into ct-add-clumpify
tomkinsc Oct 3, 2019
4ca4693
Merge branch 'master' into ct-add-clumpify
tomkinsc Nov 7, 2019
8f8aaae
bump bbmap to 38.71; set containment=True for clumpify
tomkinsc Nov 7, 2019
218a12b
Merge branch 'ct-add-clumpify' of ssh://github.com/broadinstitute/vir…
tomkinsc Nov 7, 2019
fa5e01e
update stage number
tomkinsc Nov 7, 2019
a0735c7
set bbmap jvmMemDefault='2g'; 1g for clumpify test
tomkinsc Nov 8, 2019
663deba
no longer skip demux_metag from validation/compilation
tomkinsc Nov 20, 2019
5a7ed3b
demux_plus/demux_metag: merge linear parts of scatters, run spike-in …
tomkinsc Nov 20, 2019
2be4a85
add DNAnexus defaults for demux_metag, set inputs in demux_metag
tomkinsc Nov 20, 2019
1d691b2
rmdup_clumpify_bam: preserve sortorder value of input bam
tomkinsc Nov 20, 2019
1ba7415
bump bbmap version 38.71 -> 38.73
tomkinsc Nov 20, 2019
bb589a1
fix bug in conda command quiet calling
tomkinsc Nov 20, 2019
472703b
maintain RG info in clumpify dedup; move processing to bbmap.py
tomkinsc Nov 20, 2019
ca726d0
demux_plus/demux_metag: update dx defaults and pass explicitly in wor…
tomkinsc Nov 20, 2019
3f9f188
remove redundant defaults from dx wdl test inputs
tomkinsc Nov 20, 2019
995cf0d
move krakenuniq back outside scatter
tomkinsc Nov 20, 2019
d54eff3
respecify kaiju deps
tomkinsc Nov 20, 2019
21a6ac4
WDL dedup_bam: report read count before & after dedup
tomkinsc Nov 20, 2019
13f5172
switch to clumpify for downsample dedup
tomkinsc Nov 21, 2019
c1d18be
change to clumpify for pre-depletion dedup
tomkinsc Nov 21, 2019
7c45da6
--JVMmemory=1g for TestDepleteHuman
tomkinsc Nov 21, 2019
d91eca5
remove rmdup from depletion call
tomkinsc Nov 21, 2019
12f73cb
expand arguments exposed for clumpify dedup
tomkinsc Nov 21, 2019
16a2b50
update expected depletion output now that we're not running dedup on it
tomkinsc Nov 21, 2019
f1f9a40
scatter/gather clumpify dedup across libraries
tomkinsc Dec 2, 2019
49bffcb
Merge branch 'master' into ct-add-clumpify
tomkinsc Dec 3, 2019
362d0f3
pass through single-end IDs for bbmap dedup
tomkinsc Mar 27, 2020
f816f6b
Merge branch 'master' into ct-add-clumpify
tomkinsc Mar 27, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion metagenomics.py
Original file line number Diff line number Diff line change
Expand Up @@ -962,7 +962,7 @@ def fasta_library_accessions(library):
for seqr in SeqIO.parse(filepath, 'fasta'):
name = seqr.name
# Search for accession
mo = re.search('([A-Z]+_?\d+\.\d+)', name)
mo = re.search(r'([A-Z]+_?\d+\.\d+)', name)
if mo:
accession = mo.group(1)
library_accessions.add(accession)
Expand Down
27 changes: 27 additions & 0 deletions pipes/WDL/dx-defaults-demux_metag.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"demux_metag.spikein_db":
"dx://file-FZY2v7Q0xf5VBy5FFY3z5fz7",

"demux_metag.bwaDbs": [
"dx://file-F9k7Bx00Z3ybJjvY3ZVj7Z9P"
],
"demux_metag.blastDbs": [
"dx://file-F8B3B6Q09y3bZg3j1FqK7bJ9",
"dx://file-F8BjgXj09y3gkfZGPPQZbZkK",
"dx://file-F8B3Pp809y3jBpXq7xjxbq94",
"dx://file-F8B3B6809y3kK1JP5X8Pg361"
],

"demux_metag.trim_clip_db":
"dx://file-BXF0vYQ0QyBF509G9J12g927",

"demux_metag.kraken.krakenuniq_db_tar_lz4":
"dx://file-FVYQqP006zFF064QBGf022X1",
"demux_metag.krona_taxonomy_db_tgz":
"dx://file-F4z0fgj07FZ8jg8yP7yz0Qzb",

"demux_metag.kaiju_db_lz4":
"dx://file-FVYQyvQ06zF55bFGBGYJ2XxX",
"demux_metag.ncbi_taxonomy_db_tgz":
"dx://file-F8KgJK009y3Qgy3FF1791Vgq"
}
13 changes: 12 additions & 1 deletion pipes/WDL/workflows/classify_kaiju.wdl
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
import "tasks_metagenomics.wdl" as metagenomics
import "tasks_read_utils.wdl" as reads

workflow classify_kaiju {
call metagenomics.kaiju
Array[File] unclassified_bams
scatter(reads_bam in unclassified_bams) {
call reads.dedup_bam as dedup {
input:
in_bam = reads_bam
}
}
call metagenomics.kaiju {
input:
reads_unmapped_bam = dedup.dedup_bam
}
}
13 changes: 12 additions & 1 deletion pipes/WDL/workflows/classify_krakenuniq.wdl
Original file line number Diff line number Diff line change
@@ -1,8 +1,19 @@
import "tasks_metagenomics.wdl" as metagenomics
import "tasks_reports.wdl" as reports
import "tasks_read_utils.wdl" as reads

workflow classify_krakenuniq {
call metagenomics.krakenuniq
Array[File] unclassified_bams
scatter(reads_bam in unclassified_bams) {
call reads.dedup_bam as dedup {
input:
in_bam = reads_bam
}
}
call metagenomics.krakenuniq {
input:
reads_unmapped_bam = dedup.dedup_bam
}

call reports.aggregate_metagenomics_reports as metag_summary_report {
input:
Expand Down
5 changes: 5 additions & 0 deletions pipes/WDL/workflows/dedup.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
import "tasks_read_utils.wdl" as reads

workflow dedup {
call reads.dedup_bam
}
64 changes: 51 additions & 13 deletions pipes/WDL/workflows/demux_metag.wdl
Original file line number Diff line number Diff line change
@@ -1,44 +1,82 @@
#DX_SKIP_WORKFLOW

import "tasks_demux.wdl" as demux
import "tasks_metagenomics.wdl" as metagenomics
import "tasks_taxon_filter.wdl" as taxon_filter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think this is ready and want to try it out, shouldn't you remove #DX_SKIP_WORKFLOW?

import "tasks_assembly.wdl" as assembly
import "tasks_reports.wdl" as reports
import "tasks_read_utils.wdl" as reads

workflow demux_metag {
call demux.illumina_demux as illumina_demux

File spikein_db
File trim_clip_db
Array[File]? bmtaggerDbs # .tar.gz, .tgz, .tar.bz2, .tar.lz4, .fasta, or .fasta.gz
Array[File]? blastDbs # .tar.gz, .tgz, .tar.bz2, .tar.lz4, .fasta, or .fasta.gz
Array[File]? bwaDbs
File krona_taxonomy_db_tgz
File kaiju_db_lz4
File ncbi_taxonomy_db_tgz

scatter(raw_reads in illumina_demux.raw_reads_unaligned_bams) {
# de-duplicate raw reads
call reads.dedup_bam as dedup {
input:
in_bam = raw_reads
}

# count spike-ins in the sample
# NB: the spike-in report is created from raw reads
# that have NOT been de-duplicated
call reports.spikein_report as spikein {
input:
reads_bam = raw_reads
reads_bam = raw_reads,
spikein_db = spikein_db
}

# deplete human/host genomic reads
call taxon_filter.deplete_taxa as deplete {
input:
raw_reads_unmapped_bam = raw_reads
raw_reads_unmapped_bam = dedup.dedup_bam,
bmtaggerDbs = bmtaggerDbs,
blastDbs = blastDbs,
bwaDbs = bwaDbs
}

# create de novo contigs from depleted reads via spaces
call assembly.assemble as spades {
input:
assembler = "spades",
reads_unmapped_bam = deplete.cleaned_bam
reads_unmapped_bam = deplete.cleaned_bam,
trim_clip_db = trim_clip_db,
always_succeed = true
}

# classify de-duplicated reads to taxa via kaiju
call metagenomics.kaiju as kaiju {
input:
reads_unmapped_bam = dedup.dedup_bam,
krona_taxonomy_db_tgz = krona_taxonomy_db_tgz,
kaiju_db_lz4 = kaiju_db_lz4,
ncbi_taxonomy_db_tgz = ncbi_taxonomy_db_tgz
}
}

# classify de-duplicated reads to taxa via krakenuniq
call metagenomics.krakenuniq as kraken {
input:
reads_unmapped_bam = illumina_demux.raw_reads_unaligned_bams,
}
call reports.aggregate_metagenomics_reports as metag_summary_report {
input:
kraken_summary_reports = kraken.krakenuniq_summary_reports
reads_unmapped_bam = dedup.dedup_bam,
krona_taxonomy_db_tgz = krona_taxonomy_db_tgz
}

# summarize spike-in reports from all samples
call reports.spikein_summary as spike_summary {
input:
spikein_count_txt = spikein.report
}
call metagenomics.kaiju as kaiju {
input:
reads_unmapped_bam = illumina_demux.raw_reads_unaligned_bams,
# summarize kraken reports from all samples
call reports.aggregate_metagenomics_reports as metag_summary_report {
input:
kraken_summary_reports = kraken.krakenuniq_summary_reports
}
# TODO: summarize kaiju reports from all samples
}
23 changes: 21 additions & 2 deletions pipes/WDL/workflows/demux_plus.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ import "tasks_metagenomics.wdl" as metagenomics
import "tasks_taxon_filter.wdl" as taxon_filter
import "tasks_assembly.wdl" as assembly
import "tasks_reports.wdl" as reports
import "tasks_read_utils.wdl" as reads

workflow demux_plus {

Expand All @@ -13,38 +14,56 @@ workflow demux_plus {
Array[File]? bmtaggerDbs # .tar.gz, .tgz, .tar.bz2, .tar.lz4, .fasta, or .fasta.gz
Array[File]? blastDbs # .tar.gz, .tgz, .tar.bz2, .tar.lz4, .fasta, or .fasta.gz
Array[File]? bwaDbs

scatter(raw_reads in illumina_demux.raw_reads_unaligned_bams) {
# de-duplicate raw reads
call reads.dedup_bam as dedup {
input:
in_bam = raw_reads
}

# count spike-ins in the sample
# NB: the spike-in report is created from raw reads
# that have NOT been de-duplicated
call reports.spikein_report as spikein {
input:
reads_bam = raw_reads,
spikein_db = spikein_db
}

# deplete human/host genomic reads
call taxon_filter.deplete_taxa as deplete {
input:
raw_reads_unmapped_bam = raw_reads,
raw_reads_unmapped_bam = dedup.dedup_bam,
bmtaggerDbs = bmtaggerDbs,
blastDbs = blastDbs,
bwaDbs = bwaDbs
}

# create de novo contigs from depleted reads via spaces
call assembly.assemble as spades {
input:
assembler = "spades",
reads_unmapped_bam = deplete.cleaned_bam,
trim_clip_db = trim_clip_db,
always_succeed = true
}

}

# classify de-duplicated reads to taxa via krakenuniq
call metagenomics.krakenuniq as krakenuniq {
input:
reads_unmapped_bam = illumina_demux.raw_reads_unaligned_bams
reads_unmapped_bam = dedup.dedup_bam
}

# summarize spike-in reports from all samples
call reports.spikein_summary as spike_summary {
input:
spikein_count_txt = spikein.report
}

# summarize kraken reports from all samples
call reports.aggregate_metagenomics_reports as metag_summary_report {
input:
kraken_summary_reports = krakenuniq.krakenuniq_summary_reports
Expand Down
File renamed without changes.
33 changes: 33 additions & 0 deletions pipes/WDL/workflows/tasks/tasks_read_utils.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,37 @@ task downsample_bams {
}
}

task dedup_bam {
File in_bam
Int? max_mismatches=3

String sample_name = basename(in_bam, ".bam")

command {
read_utils.py rmdup_clumpify_bam \
${in_bam} \
${sample_name}.dedup.bam \
${'--maxMismatches=' + max_mismatches} \
--JVMmemory "8g"

samtools view -c ${in_bam} | tee dedup_read_count_pre
samtools view -c ${sample_name}.dedup.bam | tee dedup_read_count_post

reports.py fastqc ${sample_name}.dedup.bam ${sample_name}.deduped_fastqc.html
}

output {
File dedup_bam = "${sample_name}.dedup.bam"
File dedup_only_reads_fastqc = "${sample_name}.deduped_fastqc.html"
Int dedup_read_count_pre = read_int("dedup_read_count_pre")
Int dedup_read_count_post = read_int("dedup_read_count_post")
String viralngs_version = "viral-ngs_version_unknown"
}
runtime {
docker: "quay.io/broadinstitute/viral-ngs"
memory: "52 GB"
cpu: 8
dx_instance_type: "mem1_ssd1_x32"
preemptible: 0
}
}
1 change: 0 additions & 1 deletion pipes/WDL/workflows/tasks/tasks_taxon_filter.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,6 @@ task deplete_taxa {
tmpfile.raw.bam \
tmpfile.bwa.bam \
tmpfile.bmtagger_depleted.bam \
tmpfile.rmdup.bam \
${bam_basename}.cleaned.bam \
$DBS_BMTAGGER $DBS_BLAST $DBS_BWA \
${'--chunkSize=' + query_chunk_size} \
Expand Down
3 changes: 1 addition & 2 deletions pipes/rules/hs_deplete.rules
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ rule depletion:
output:
config["tmp_dir"] +'/'+config["subdirs"]["depletion"]+'/{sample}.bwa_depleted.bam',
config["tmp_dir"] +'/'+config["subdirs"]["depletion"]+'/{sample}.bmtagger_depleted.bam',
config["tmp_dir"] +'/'+config["subdirs"]["depletion"]+'/{sample}.rmdup.bam',
config["data_dir"]+'/'+config["subdirs"]["depletion"]+'/{sample}.cleaned.bam'
resources:
mem_mb = 15*1000,
Expand Down Expand Up @@ -135,4 +134,4 @@ rule merge_one_per_sample:
shell("{config[bin_dir]}/read_utils.py merge_bams {input} {output} --picardOptions SORT_ORDER=queryname")
else:
shell("{config[bin_dir]}/read_utils.py merge_bams {input} {params.tmpf_bam} --picardOptions SORT_ORDER=queryname")
shell("{config[bin_dir]}/read_utils.py rmdup_mvicuna_bam {params.tmpf_bam} {output} --JVMmemory {mem_mb}m")
shell("{config[bin_dir]}/read_utils.py rmdup_clumpify_bam {params.tmpf_bam} {output} --JVMmemory {mem_mb}m")
69 changes: 67 additions & 2 deletions read_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import util.file
import util.misc
from util.file import mkstempfname
import tools.bbmap
import tools.bwa
import tools.cdhit
import tools.picard
Expand Down Expand Up @@ -458,7 +459,7 @@ def dedup_bams(data_pairs, threads=None, JVMmemory=None):
JVMmemory = JVMmemory if JVMmemory else tools.picard.DownsampleSamTool.jvmMemDefault
jvm_worker_memory = str(max(1,int(JVMmemory.rstrip("g"))/workers))+'g'
with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
future_to_file = {executor.submit(rmdup_mvicuna_bam, *fp, JVMmemory=jvm_worker_memory): fp[0] for fp in data_pairs}
future_to_file = {executor.submit(rmdup_clumpify_bam, *fp, JVMmemory=jvm_worker_memory): fp[0] for fp in data_pairs}
for future in concurrent.futures.as_completed(future_to_file):
f = future_to_file[future]
try:
Expand Down Expand Up @@ -918,9 +919,73 @@ def parser_rmdup_cdhit_bam(parser=argparse.ArgumentParser()):
util.cmd.common_args(parser, (('loglevel', None), ('version', None), ('tmp_dir', None)))
util.cmd.attach_main(parser, rmdup_cdhit_bam, split_args=True)
return parser
__commands__.append(('rmdup_cdhit_bam', parser_rmdup_cdhit_bam))


__commands__.append(('rmdup_cdhit_bam', parser_rmdup_cdhit_bam))
def rmdup_clumpify_bam(in_bam, out_bam, max_mismatches=3, optical_only=False, dupedist=40, spanx=False, spany=False, adjacent=False, no_containment=False, threads=None, JVMmemory=None):
''' Remove duplicate reads from BAM file using bbmap's clumpify tool.
'''
bbmap = tools.bbmap.BBMapTool()
bbmap.dedup_clumpify(in_bam, out_bam, subs=max_mismatches, optical=optical_only, dupedist=dupedist, spanx=spanx, spany=spany, adjacent=adjacent, containment=no_containment==False, threads=None, JVMmemory=JVMmemory)
return 0

def parser_rmdup_clumpify_bam(parser=argparse.ArgumentParser()):
parser.add_argument('in_bam', help='Input reads, BAM format.')
parser.add_argument('out_bam', help='Output reads, BAM format.')
parser.add_argument(
'--maxMismatches',
dest="max_mismatches",
type=int,
default=3,
help='The max number of base mismatches to allow when identifying duplicate reads. (default: %(default)s)'
)
parser.add_argument('--no_containment', dest='no_containment', default=False, action='store_true',
help='Disables containments (where one sequence is shorter)'
)
optical_group = parser.add_argument_group('optical','arguments related to removal of optical duplicates')
optical_group.add_argument('--optical_only', dest='optical_only', default=False, action='store_true',
help='If true, only remove *optical* duplicates. Optical duplicate '
'removal is limited by xy-position, and is intended for single-end sequencing.'
)
optical_group.add_argument('--spany', dest='spany', default=False, action='store_true',
help='Allow reads to be considered optical duplicates if they '
'are on different tiles, but are within dupedist in the '
'y-axis. Should only be enabled when looking for '
'tile-edge duplicates (as in NextSeq).'
)
optical_group.add_argument('--spanx', dest='spanx', default=False, action='store_true',
help='Like spany, but for the x-axis. Not necessary for NextSeq.'
)
optical_group.add_argument('--adjacent', dest='spany', default=False, action='store_true',
help='Limit tile-spanning to adjacent tiles (those with consecutive numbers).'
)
optical_group.add_argument(
'--dupedist',
dest="dupedist",
type=int,
default=40,
help='Max distance to consider for optical duplicates (default: %(default)s). '
'Larger distance removes more duplicates but is '
'more likely to remove PCR rather than optical '
'duplicates. This is platform-specific; '
'recommendations:'
'NextSeq --dupedist=40 (and --spany), '
'HiSeq 1T --dupedist=40, '
'HiSeq 2500 --dupedist=40, '
'HiSeq 3k/4k --dupedist=2500, '
'Novaseq --dupedist=12000.'
)
parser.add_argument(
'--JVMmemory',
default=tools.picard.FilterSamReadsTool.jvmMemDefault,
help='JVM virtual memory size (default: %(default)s)',
dest='JVMmemory'
)
util.cmd.common_args(parser, (('threads', None), ('loglevel', None), ('version', None), ('tmp_dir', None)))
util.cmd.attach_main(parser, rmdup_clumpify_bam, split_args=True)
return parser
__commands__.append(('rmdup_clumpify_bam', parser_rmdup_clumpify_bam))


def _merge_fastqs_and_mvicuna(lb, files):
readList = mkstempfname('.keep_reads.txt')
Expand Down
Loading