A tool to analyze haplotype-specific chromosome-scale somatic copy number aberrations and aneuploidy using long reads (Oxford Nanopore, PacBio). Wakhan takes long-read alignment and phased heterozygous variants as input, and first uses extends the phased blocks, taking advantage of the CNA differences between the haplotypes. Wakhan then generates inetractive haplotype-specific coverage plots.
git clone https://github.com/KolmogorovLab/Wakhan.git
cd Wakhan/
conda env create -f environment.yml -n Wakhan
conda activate Wakhan
cd src/
python main.py --threads <4> --reference <ref.fa> --target-bam <data.tumor.bam> --normal-phased-vcf <data.normal_phased.vcf.gz> --genome-name <cellline/dataset name> --out-dir-plots <genome_abc_output> --breakpoints <severus-sv-VCF>
python main.py --threads <4> --reference <ref.fa> --target-bam <data.tumor_haplotagged.bam> --tumor-vcf <data.tumor_phased.vcf.gz> --genome-name <cellline/dataset name> --out-dir-plots <genome_abc_output> --breakpoints <severus-sv-VCF>
Wakhan accepts Severus or any other structural variant caller VCF as breakpoints with param --breakpoints
inputs to detect copy number changes and this option is highly recommended.
However, if --breakpoints
option is not used, --change-point-detection-for-cna
should be used instead to use change point detection algorithm ruptures alternatively.
User can input both --ploidy-range
[default: 1.5-5.5 -> [min-max]] and --purity-range
[default: 0.5-1.0 -> [min-max] to inform copy number model about normal contamination in tumor to estimate copy number states correctly.
By default, Wakhan uses COSMIC cancer genes to display corresponding copy number states in <genome_name>_<ploidy>_<purity>_<confidence>_genes_genome.html
file.
User can also input path through param --user-input-genes
to custom input genes/subset of genes bed file these genes will be used in plots instead of default COSMIC cancer genes.
grch38 reference genes will be use as default, user can input alternate (i.e, chm13) --reference-name
to change to T2T-CHM13 instead.
Wakhan produces reads coverage coverage.csv
(bin-size based reads coverage) and phasesets reads coverage coverage_ps.csv
data, phase-corrected coverage phase_corrected_coverage.csv
(as well as tumor BAM pileup pileup_SNPs.csv
in case Tumor/normal mode) and stores in directory coverage_data
inside --out-dir-plots
location.
If this data has already been generated in a previous Wakhan run, user can rerun the Wakhan with additionally passing --quick-start
and --quick-start-coverage-path <path to coverage_data directory -> e.g., /home/rezkuh/data/1437/coverage_data>
in addition to required params in above example runs.
This will save runtime significantly by not invoking coverage and pileup methods.
Few cell lines arbitrary phase-switch correction and copy number estimation output with coverage profile is included in the examples directory.
-
--reference
Reference file path -
--target-bam
path to target bam files (must be indexed) -
--out-dir-plots
path to output coverage plots -
--genome-name
genome cellline/sample name to be displayed on plots -
--normal-phased-vcf
normal phased VCF file to generate het SNPs frequncies pileup for tumor BAM (if tumor-only mode, use phased--tumor-vcf
instead) -
--tumor-vcf
phased VCF is required in tumor-only mode
-
--breakpoints
For segmentation to use in CN estimation, structural variations/breakpoints VCF file is required -
--breakpoints-min-length
To adjust breakpoints min length to be included in copy number analysis [default: 10000] -
--cpd-internal-segments
For change point detection algo on internal segments after breakpoint/cpd algo for more precise segmentation. -
--copynumbers-subclonal-enable
Enabling subclonal/fractional copy number states in plots -
--loh-enable
Enabling LOH regions display in CN plots -
--phaseblock-flipping-disable
disabling phaseblock flipping if traget tumor BAM doesn't need phase-correction (default: enabled) -
--phaseblocks-enable
enabling phaseblocks display in coverage plots -
--contigs
List of contigs (chromosomes, default: chr1-22) to be included in the plots [e.g., chr1-22,X,Y] -
--cut-threshold
Maximum cut threshold for coverage (readdepth) plots [default: 100] -
--centromere
Path to centromere annotations BED file [default: annotations/grch38.cen_coord.curated.bed] -
--cancer-genes
Path to Cancer Genes in TSV format to display in CNA plots [default: annotations/CancerGenes.tsv] -
--pdf-enable
Enabling PDF output for plots
Wakhan can also be used in case phasing is not good in input tumor or analysis is being performed without considering phasing:
--without-phasing
enable it if CNA analysis is being performed without phasing in conjunction with--phaseblock-flipping-disable
with all other required parameters as mentioned in example command
Here is a sample copy number/breakpoints output plot without phasing.
Based on best confidence scores, tumor purity and ploidy values are calculated and copy number analysis is performed.
Each subfolder in output directory represents best <ploidy
><purity
><confidence
> values.
<genome-name>_genome_copynumber.html
Genome-wide copy number plots with coverage information on same axis<genome-name>_copynumber_breakpoints.html
Genome-wide copy number plots with coverage information on opposite axis, additionally breakpoints and genes annotations<genome-name>_copynumber_breakpoints_subclonal.html
Genome-wide subclonal/fractional copy number plots with coverage information on opposite axis, additionally breakpoints and genes annotations (--copynumbers-subclonal-enable
)bed_output
It contains copy numbers segments in bed formatvariation_plots
Copy number chromosomes-scale plots with segmentation, coverage and LOH
Following are coverage and SNPs/LOH plots and bed directories in output folder, independent of CNA analysis
snps_loh_plots
SNPs and SNPs ratios plots with LOH representation in chromosomes-scale and genome-wide<genome-name>_genome_loh.html
Genome-wide LOH plotbed_output
It contains LOH segments in bed formatcoverage_plots
Haplotype specific coverage plots for chromosomes with option for unphased coveragephasing_output
Phase-switch error correction plots and phase corrected VCF file (*rephased.vcf.gz)
This tool requires haplotagged tumor BAM and phased VCF in case tumor-only mode and normal phased VCF in case tumor-normal mode. This can be done through any phasing tools like Margin, Whatshap and Longphase. Following commands could be helpful for phasing VCFs and haplotagging BAMs.
# ClairS phase and haplotag both normal and tumor samples
singularity run clairs_latest.sif /opt/bin/run_clairs --threads 56 --phase_tumor True --use_whatshap_for_final_output_haplotagging --use_whatshap_for_final_output_phasing --tumor_bam_fn normal.bam --normal_bam_fn tumor.bam --ref ref.fasta --output_dir clairS --platform ont_r10
or
# Phase normal sample
pepper_margin_deepvariant call_variant -b normal.bam -f ref.fasta -o pepper/output -t 56 --ont_r9_guppy5_sup -p pepper --phased_output
# Haplotag tumor sample with normal phased VCF (phased.vcf.gz) output from previous step
whatshap haplotag --ignore-read-groups phased.vcf.gz tumor.bam --reference ref.fasta -o tumor_whatshap_haplotagged.bam
# Phase and haplotag tumor sample
singularity run clair3_latest.sif /opt/bin/run_clair3.sh --use_whatshap_for_final_output_haplotagging --use_whatshap_for_final_output_phasing --bam_fn=tumor.bam --ref_fn=ref.fasta --threads=56 --platform=ont --model_path=r941_prom_sup_g5014 --output=clair3 --enable_phasing
or
# Phase and haplotag tumor sample
pepper_margin_deepvariant call_variant -b tumor.bam -f ref.fasta -o pepper/output -t 56 --ont_r9_guppy5_sup -p pepper --phased_output