Skip to content

Latest commit

 

History

History
192 lines (159 loc) · 11.5 KB

README.md

File metadata and controls

192 lines (159 loc) · 11.5 KB

NOTA BENE!!! When running snakemake for the first time with this repository, it may take many hours, as it will download both all the software environment needed to run PCGR mutation impact reports, and all the large public resource files needed for the same (by automatically running workflow/scripts/download_resources.py). If you intererupt the downloading and unpacking of these files, you will need to rerun the download script manually.

Configuring Tamor to Analyze Your Cancer Cases

Cases are organized into logical units: Projects (a.k.a. cohorts), that have Subjects (a.k.a. patients) that have Samples (e.g. biopsy or blood).

Cases from the same cohort will be outputted into the same output folder, for organizational purposes.

A Subject must have at least one normal/germline sample.

A Subject can have one or more tumor samples (e.g. primary and refractory). Each tumor must have a DNA sample, and optionally an RNA sample.

The default config files are preconfigured for didactic purposes with a public leukaemia genome+transcriptome case from the NCBI Short Read Archive. This case is part of the cohort PR-TEST-CLL, with the patient labelled as PR-TEST-CLL-SAMN08512283, and there are three sets of input FASTQ files downloaded/generated by running workflow/scripts/download_testdata.py. The three samples are PR-TEST-CLL-SAMN08512283-SRR6702602-T (tumor DNA) , PR-TEST-CLL-SAMN08512283-SRR6702602-N (pseudonormal DNA generated by the script since no actual normal is available), and PR-TEST-CLL-SAMN08512283-SRR6702601-T (tumor RNA). Such long systematic names are not necessary, but in practice we have found them very useful as you start accumulating larger cohorts.

Sequencing instrument run IDs and sample IDs are typically rather opaque and automatically assigned by the sequencing lab. These are not part of the config files, nor reported out by Tamor, but rather linked to designated Subjects and Samples via the Illumina Samplesheets.

config/config.yaml

config/config.yaml is the file that you can customize for your site-specific settings. By default the config is set up to read input files from the resources folder, and write result files under the results folder. By default the genome index and annotation files, as well as the PCGR data bundle, are expected in resources. This is where workflow/scripts/download_resources.py puts those files.

Tamor's default config has the input lists of paired tumor-normal samples (with minimal metadata, described below) in files called config/dna_samples.tsv and config/rna_samples.tsv. These TSVs are the main config files that you will need to edit to run your own samples through the workflow.

config/dna_samples.tsv

Has 10 columns to be specified:

subjectID<tab>
tumorSampleID<tab>
TrueOrFalseTumorHasPCRDuplicates<tab>
germlineSampleID<tab>
TrueOrFalseGermlineHasPCRDuplicates<tab>
TrueOrFalse_germline_contains_some_tumor<tab>
PCGRTissueSiteNumber<tab>
OncoTreeCode<tab>
TCGACode<tab>
ProjectID

The subjectID, tumorSampleID and germlineSampleID must:

  • CONTAIN NO UNDERSCORES
  • The subjectID must be between 6 and 35 characters (due to a PCGR naming limitation)
  • tumorSampleID and germlineSampleID must be the exact Sample_Name values you used in your Illumina sequencing sample spreadsheets (see samplesheet section below for details).

The third and fifth column tell Dragen whether to consider (in tumor and germline respectively) as PCR duplicates read pairs that map to the same start and end in the reference genome. If you used a PCR-free library prep, set this to False, otherwise set it to True.

The tenth column is a unique project ID to which the subject belongs. For example if you have two cohorts of lung and breast cancer, assigning individuals to two projects would be logical. All project output files go into their own output folders, even if they were sequenced together on the same Illumina sequencing runs.

The sixth column of the paired input sample TSV file is usually False, unless your germline sample is from a leukaemia or perhaps a poor quality histology section from a tumor, in which case use True. This instructs Dragen to consider low frequency variants in the germline sample to still show up as somatic variants in the tumor analysis output (see default of 0.05 under tumor_in_normal_tolerance_proportion in config.yaml)

For the eighth column OncoTree codes for cancer types can be found here: https://oncotree.mskcc.org/

For the ninth column The Cancer Genome Atlas codes can be found here: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations

For the seventh column, the list of tissue site numbers for the version of PCGR included here is:

                        0  = Any
                        1  = Adrenal Gland
                        2  = Ampulla of Vater
                        3  = Biliary Tract
                        4  = Bladder/Urinary Tract
                        5  = Bone
                        6  = Breast
                        7  = Cervix
                        8  = CNS/Brain
                        9  = Colon/Rectum
                        10 = Esophagus/Stomach
                        11 = Eye
                        12 = Head and Neck
                        13 = Kidney
                        14 = Liver
                        15 = Lung
                        16 = Lymphoid
                        17 = Myeloid
                        18 = Ovary/Fallopian Tube
                        19 = Pancreas
                        20 = Peripheral Nervous System
                        21 = Peritoneum
                        22 = Pleura
                        23 = Prostate
                        24 = Skin
                        25 = Soft Tissue
                        26 = Testis
                        27 = Thymus
                        28 = Thyroid
                        29 = Uterus
                        30 = Vulva/Vagina

config/rna_samples.tsv

Has 6 columns to be specified:

subjectID<tab>
tumorRNASampleID<tab>
matchedTumorDNASampleID<tab>
ProjectID<tab>
ImmuneDeconvCancerType<tab>
CohortNameForExpressionAnalysis

If you have both normal and tumor RNA samples available, it is critical to list the tumor RNA sample first.
The first RNA sample listed in the file is the one that will be included on the PCGR report for matchedTumorDNASampleID, and typically you want to report out regarding the tumor RNA.

The last column CohortNameForExpressionAnalysis is used for Djerba cohort reporting, e.g. to identify Z-score and percentile rank outliers genes in this sample compared to others being processed at the same time and nominally of the same cancer/tissue type as defined by the user.

The fifth column, ImmuneDeconvCancerType, is one of the following (pick what seems closest if no exact match is available):

                        acc  = Adrenocortical carcinoma
                        blca = Bladder Urothelial Carcinoma
                        lgg  = Brain Lower Grade Glioma
                        brca = Breast invasive carcinoma
                        cesc = Cervical squamous cell carcinoma and endocervical adenocarcinoma
                        chol = Cholangiocarcinoma
                        coad = Colon adenocarcinoma
                        esca = Esophageal carcinoma
                        gbm  = Glioblastoma multiforme
                        hnsc = Head and Neck squamous cell carcinoma
                        kich = Kidney Chromophobe
                        kirc = Kidney renal clear cell carcinoma
                        kirp = Kidney renal papillary cell carcinoma
                        lihc = Liver hepatocellular carcinoma
                        luad = Lung adenocarcinoma
                        lusc = Lung squamous cell carcinoma
                        dlbc = Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
                        meso = Mesothelioma
                        ov   = Ovarian serous cystadenocarcinoma
                        paad = Pancreatic adenocarcinoma
                        pcpg = Pheochromocytoma and Paraganglioma
                        prad = Prostate adenocarcinoma
                        read = Rectum adenocarcinoma
                        sarc = Sarcoma
                        skcm = Skin Cutaneous Melanoma
                        stad = Stomach adenocarcinoma
                        tgct = Testicular Germ Cell Tumors
                        thym = Thymoma
                        thca = Thyroid carcinoma
                        ucec = Uterine Corpus Endometrial Carcinoma
                        uvm  = Uveal Melanoma
                        ucs  = Uterine Carcinosarcoma

Samplesheets

These sample sheets are the only other metadata to which Tamor has access. Place all the Illumina experiment sample sheets for your project into resources/spreadsheets by default (see the samplesheets_dir setting in config/config.yaml). They must be called runID.csv, where runID is typically the Illumina folder name in the format YYMMDD_machineID_SideFlowCellID.

Note that a sample can actually be sequenced across multiple runs, Tamor will aggregate the sequence data across the runs to generate a single report (e.g. a primary run and some top-up sequencing due to unexpected low read count on the first run). The same sample name can have the same sample ID or different sample IDs across runs, they will be aggregated regardless. This allows for a single tumor sample to be prepared using two different sequencing library preps for example.

If you are providing the FASTQs directly as input to Tamor, they must also be in the resources/analysis/primary/sequencerName/runID directory, with a corresponding Illumina Experiment Manager samplesheet resources/spreadsheets/runID.csv. Why? This is required because Tamor reads the sample sheet to find the correspondence between Sample_Name and Sample ID for each sequencing library, also analysis for DNA samples differs from that for RNA samples, so the sample sheet must also contain a Sample_Project column. Sample projects with names that contain "RNA" in them will be processed as such, all others are assumed to be DNA. The Sample_Project is not used for any other purpose than distinguishing RNA and DNA, and does not need to be the same as the ProjectIDs listed in the config folder files.

The samplesheet is also used to determine if Unique Molecular Indices were used to generate the sequencing libraries, which requires different handling in Dragen during genotyping downstream.

If you provide FASTQ files directly, they must be timestamped later than the corresponding Illumina Experiment Manager spreadsheet, otherwise Snakemake will assume you've consequentially changed the spreadsheet and try to automatically regenerated all FASTQs for that run -- from potentially non-existent BCLs.

If you are starting with BCLs, the full Illumina experiment output folders (which contain the requisite Data/Intensities/Basecalls subfolder) are expected by in resources/bcls/runID (see bcl_dir setting inconfig.yaml). Tamor will perform BCL to FASTQ conversion, with the FASTQ output into results/analysis/primary/sequencer/runID (see analysis_dir setting in config.yaml, and the default sequencer is HiSeq per the test data mentioned earlier).