This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The pipeline is built using Nextflow and processes whole-exome sequencing (WES), or whole-genome sequencing (WGS) data following the steps presented in the main README file.
Briefly, the workflow runs several quality controls from the raw and aligned data in order to validate both frozen and FFPE samples.
Then, several tools can be run in order to detect germline single nucleotide variants (SNVs) with haplotypecaller
, somatic SNVs with mutect2
, structural variants (SVs) with MANTA
or copy number variants (CNV) with ASCAT
and FACETS
. Finaly, complementary analysis based on variant calling can also be performed. Annotation can be done with SnpEff
and SnpSift
. Tumor Mutational Burden is assessed with pyTMB
tool and Microsatellite Instability is assesed with MSIsensor-pro
.
The directories listed below will be created in the output directory after the pipeline has finished.
FastQC gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences.
For further reading and documentation see the FastQC help.
NB: The FastQC plots displayed in the MultiQC report shows the input reads. In theory, they should be already trimmed for adapter sequence and potentially regions with low quality. For details about reads trimming, see the
raw_qc
pipeline.
Output directory: fastqc
[SAMPLE]_fastqc.html
- FastQC report, containing quality metrics for your untrimmed raw fastq files
zips/[SAMPLE]_fastqc.zip
- zip file containing the FastQC report, tab-delimited data file and plot images
Raw reads are aligned on the reference genome by default with BWA-mem
. The mapping statistics (Total Reads
, Aligned Reads
, High-Quality Alignment
, Low-Quality Alignment
) are also presented in the main summary table.
Note that if multiple sequencing lanes from the same samples (same sampleID, sampleName) are specified, the bam files are merged just after BWA-mem
.
Output directory: preprocessing/bams/bwa/
[SAMPLE].bam
and[SAMPLE].bam.bai
- Aligned reads with BAM index
The mapping statistics are presented in the MultiQC report as follows.
In general, we expect more than 80% of aligned reads. Samples with less than 50% of mapped reads should be further investigated, and check for adapter content, contamination, etc.
NB: Note that by default, these mapping files are not saved. Use
--saveAlignedIntermediates
to save them.
Sambamba is used to mark the duplicates.
The results are presented in the General Metrics
table.
Output directory: preprocessing/bams/markDuplicates
-
[SAMPLE].md.bam
- Aligned reads marked for duplicates
-
stats/[SAMPLE].md.flagstats
- Number of alignments for each FLAG type
NB: Note that by default, these mapping files are not saved. Use
--saveAlignedIntermediates
to save them.
In the context of WES analysis, the aligned reads are intersected with their targets, defined with the --targetBed
parameter.
The percentage of reads on targets are presented in the General Metrics
table.
Output directory: preprocessing/bams/onTarget
-
[SAMPLE].onTarget.bam
- Aligned reads restricted to the genomic targets.
-
stats/[SAMPLE].onTarget.flagstat
- Number of alignments for each FLAG type
NB: Note that by default, these mapping files are not saved. Use
--saveAlignedIntermediates
to save them.
Aligned reads are then filtered-out in order to remove non informative reads for the downstream analysis.
The aligned reads can be filtered out as follow :
mapq
: discard reads aligned with a mapping quality lower than--mapQual
--keepDups false
: discard reads flagged as duplicates--keepSingleton false
: discard reads for which the paired mate is not aligned--keepMultiHits
: discard reads aligned several times on the genome
By default the filters are defined to remove low mapq, dupicated, singleton and multi hit reads.
The fraction of remaining reads after filtering is also presented in the General Metrics
table.
Output directory: preprocessing/bams/filtering/
[SAMPLE].filtered.bam
and[SAMPLE].filtered.bam.bai
- Aligned and filtered reads with BAM index
[SAMPLE].filtered.idxstats
- Alignment summary statistics
[SAMPLE].filtered.flagstats
- Number of alignments for each FLAG type
From the filtered and aligned reads files, the pipeline then runs several quality control steps presented below.
The Preseq package is aimed at predicting and estimating the complexity of a genomic sequencing library, equivalent to predicting and estimating the number of redundant reads from a given sequencing depth and how many will be expected from additional sequencing using an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples. The dashed line shows a perfectly complex library where total reads = unique reads. Note that these are predictive numbers only, not absolute. The MultiQC plot can sometimes give extreme sequencing depth on the X axis - click and drag from the left side of the plot to zoom in on more realistic numbers.
Output directory: preprocessing/metrics/preseq
The fragment length is calculated from paired-end reads as the distance between the two mates with picard
. The mean value is presented in the General Metrics
table and the distribution is presented by MultiQC as follow :
Output directory: preprocessing/metrics/fragSize
-
[SAMPLE]_insert_size_metrics.txt
- Fragment size values reported by
picard
- Fragment size values reported by
-
[SAMPLE]_insert_size_hist.pdf
- Graphical representation
The mean sequencing depth and the percentage of the genome (or targets) covered at soem threshold (X) are calculated with mosdepth
.
The coverage at 30X, 50X (hidden column) and 100X are available in the General Metrics
table.
In addition, the same analysis is repeated for exonic regions only. In the context of WES analysis, only the exonic regions overlapping with the targets are used.
The results are presented in the 'Genes Coverage' section of the MultiQC report.
Output directory: preprocessing/metrics/depth
*global*
files are themosdepth
outputs for stantard coverage on the genome*regions*
files are themosdepth
outputs for the gene coverage
The picard collectWgsMetrics
tool is run to collect some additional statistics on reads mapping.
Among them, the fraction of bases covered by both R1 and R2 mates are available in the General Metrics
table.
In the case of FFPE samples for which the fragment size is usually smaller, this metric can help adjusting the sequencing length. In addition, overlapping read pairs can sometimes be an issue for downstream analysis, and a reads trimming (or merge) can be an interesting option.
In order to check the association between pairs of normal/tumor samples, a list of common SNPs (--polym
) is used to cluster all the samples.
The results are displayed as a dendrogram.
Output directory: Identito
-
[SAMPLE].matrix.tsv
- results of the SNPs calling for the list of SNPs
-
clustering_plot_identito.csv
- Identito dendrogram
The current workflow follows the GATK good practices with base recalibration.
This step is usally recommanded to detects systematic errors in the data, but can be skipped with the option --skipBQSR
.
These files are used as inputs of all germline and somatic SNVs calling.
Output directory: preprocessing/bams/bqsr
[SAMPLE].recal.bam
and[SAMPLE].recal.bam.bai
- Aligned data after base recalibration with BAM index
Germline variants are then called using haplotypecaller
following good practices (HaplotypeCaller, GenotypeGVCFs).
The number of detected variants are presented as a table in the MultiQC report.
Output directory: HaplotypeCaller
[SAMPLE]_HaplotypeCaller.vcf.gz
and[SAMPLE]_HaplotypeCaller.vcf.gz.tbi
- vcf file with the variants detected by HaplotypeCaller with Tabix index
The somatic mutations calling requires pairs of normal/tumor samples defined in the design
file.
The mutect2
tool is used to call somatic variants following the GATK good practices (Mutect2, MergeMutectStats, GetPileupSummaries, GatherPileupSummaries, CalculateContamination, LearnReadOrientationModel, FilterMutectCall).
Output directory: Mutect2
[TUMORSAMPLE]_vs_[NORMALSAMPLE]_Mutect2_unfiltered.vcf.gz
and[TUMORSAMPLE]_vs_[NORMALSAMPLE]_Mutect2_unfiltered.vcf.gz.tbi
- Mutect2 somatic variants before filtering with Tabix index
[TUMORSAMPLE]_vs_[NORMALSAMPLE]_Mutect2_filtered.vcf.gz
and[TUMORSAMPLE]_vs_[NORMALSAMPLE]_Mutect2_filtered.vcf.gz.tbi
- Mutect2 somatic variants after filtering with Tabix index
[TUMORSAMPLE]_vs_[NORMALSAMPLE]_Mutect2_filtered_pass_norm.vcf.gz
and[TUMORSAMPLE]_vs_[NORMALSAMPLE]_Mutect2_filtered_pass_norm.vcf.gz.tbi
- Mutect2 somatic variants after filtering for PASS variants only and normalization with
bcftools norm
and with Tabix index
- Mutect2 somatic variants after filtering for PASS variants only and normalization with
For each filtered vcf files, the current workflow calculate the number of transition (A>G,T>C,C>T,G>A), transversions (A>C,T>G,C>A,G>T,A>T,T>A,C>G,G>C) and short insertions/delations (indels). The results are available as table and presented in MultiQC.
Output directory: HaplotypeCaller/[SAMPLE]/tstv
[SAMPLE]_filtered.vcf.Mutect2.table.tsv
- Number of bases substitution and indels for each type
Output directory: Mutect2/[SAMPLE]/tstv
-
[TUMORSAMPLE]_vs_[NORMALSAMPLE]_filtered.vcf.Mutect2.table.tsv
- Number of bases substitution and indels for each type
Each filtered VCF file is then annotated using snpeff
and SnpSift
.
All annotated vcf files are saved and the following summary metrics are displayed in MultiQC.
Output directory: snpEff/[SAMPLE]
[SAMPLE]_snpeff.ann.vcf.gz
- annotated variants with SnpEff
[SAMPLE]_dbNSFP.vcf.gz
- annotated variants with SnpSift and DBNSFP database
[SAMPLE]_GnomAD.vcf.gz
- annotated variants with SnpSift and Gnomad database
[SAMPLE]_CancerHotspots.vcf.gz
- annotated variants with SnpSift and CancerHotspots database
[SAMPLE]_ICGC.vcf.gz
- annotated variants with SnpSift and ICGC database
[SAMPLE]_COSMIC.vcf.gz
- annotated variants with SnpSift and COSMIC database
CNVs calling can be run using ASCAT
and FACETS
.
Both tools require pairs of tumor/normal samples.
ASCAT and Facets are two software for performing allele-specific copy number analysis of tumor samples and for estimating tumor ploidy and purity (normal contamination). They infer tumor purity and ploidy and calculates allele-specific copy number profiles. Both tools provide several images and tables as output.
Output directory: Facets
[TUMORSAMPLE]_vs_[NORMALSAMPLE]_cnv_amp_del.txt
- list of amplification and deletion of potential interest.
[TUMORSAMPLE]_vs_[NORMALSAMPLE]_cnv.pdf
- CNVs plot
[TUMORSAMPLE]_vs_[NORMALSAMPLE]_cnv_ploidy_cellularity.txt
- table with cellularity and ploidy for this sample.
Output directory: ASCAT
[TUMORSAMPLE].BAF
and[NORMALSAMPLE].BAF
- file with beta allele frequencies generated by AlleleCount
[TUMORSAMPLE].LogR
and[NORMALSAMPLE].LogR
- file with total copy number on a logarithmic scale generated by AlleleCount
[TUMORSAMPLE].ASCATprofile.png
- Image with information about ASCAT profile
[TUMORSAMPLE].ASPCF.png
- Image with information about ASPCF
[TUMORSAMPLE].rawprofile.png
- Image with information about raw profile
[TUMORSAMPLE].sunrise.png
- Image with information about sunrise
[TUMORSAMPLE].tumour.png
- Image with information about tumor
[TUMORSAMPLE].cnvs.txt
- file with information about CNVS
[TUMORSAMPLE].LogR.PCFed.txt
- file with information about LogR
[TUMORSAMPLE].BAF.PCFed.txt
- file with information about BAF
[TUMORSAMPLE].purityploidy.txt
- file with information about purity ploidy
The text file [TUMORSAMPLE].cnvs.txt
countains predictions about copy number state for all the segments.
The output is a tab delimited text file with the following columns:
- chr: chromosome number
- startpos: start position of the segment
- endpos: end position of the segment
- nMajor: number of copies of one of the allels (for example the chromosome inherited from the father)
- nMinor: number of copies of the other allele (for example the chromosome inherited of the mother)
Structural variants and indels are called using MANTA
with matched control.
It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs.
For all samples :
Output directory: Manta
Manta_[SAMPLE].candidateSmallIndels.vcf.gz
andManta_[SAMPLE].candidateSmallIndels.vcf.gz.tbi
VCF
with Tabix index
Manta_[SAMPLE].candidateSV.vcf.gz
andManta_[SAMPLE].candidateSV.vcf.gz.tbi
VCF
with Tabix index
For Normal sample only:
Manta_[NORMALSAMPLE].diploidSV.vcf.gz
andManta_[NORMALSAMPLE].diploidSV.vcf.gz.tbi
- VCF with Tabix index
For a Tumor sample only:
Manta_[TUMORSAMPLE].tumorSV.vcf.gz
andManta_[TUMORSAMPLE].tumorSV.vcf.gz.tbi
- VCF with Tabix index
For Tumor/Normal pair :
Output directory: Manta
Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSmallIndels.vcf.gz
andManta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSmallIndels.vcf.gz.tbi
VCF
with Tabix index
Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSV.vcf.gz
andManta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].candidateSV.vcf.gz.tbi
VCF
with Tabix index
Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].diploidSV.vcf.gz
andManta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].diploidSV.vcf.gz.tbi
VCF
with Tabix index
Manta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].somaticSV.vcf.gz
andManta_[TUMORSAMPLE]_vs_[NORMALSAMPLE].somaticSV.vcf.gz.tbi
VCF
with Tabix index
Mictosatellite instability is assesed with MSIsensor-pro
.
Output directory: MSI/[SAMPLE]
[SAMPLE]
- The final report contain all detected microsatellites , the unstable(somatic) microsatellites and the MSI score.
[SAMPLE]_dis
[SAMPLE]_germline
[SAMPLE]_somatic
Tumor Mutational Burden is assesed with pyTMB
?
Output directory: MSI/[SAMPLE]
[SAMPLE]_tmb.txt
- results file
This final step transforms the fully annotated vcf into a tabulated file to ease the analysis of the results.
Output directory: tableReport/[SAMPLE]
[SAMPLE].tsv
- results file
MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory.
The pipeline has special steps which allow the software versions used to be reported in the MultiQC output for future traceability.
Output directory: results/multiqc
Project_multiqc_report.html
- MultiQC report - a standalone HTML file that can be viewed in your web browser
Project_multiqc_data/
- Directory containing parsed statistics from the different tools used in the pipeline
For more information about how to use MultiQC reports, see http://multiqc.info. See the file 'test/multiqc_report.html' for an example on the test dataset.