Skip to content

CNV Array Pipeline ReadMe

MikeWLloyd edited this page Aug 16, 2024 · 1 revision

CNV Array Analysis Workflow

CNV Pipeline (--workflow cnv)

• Step 1: Takes in a CSV and parse the file to verify sampleID, idat_red, and idat_green fields, checks valid gender values.
• Step 2: IAAP CLI converts the IDAT files to gtc format.
• Step 3 : Takes in GTC files, processes them through a series of BCFtools commands to convert them into a sorted, normalized, and indexed VCF file.
• Step 4: Extracts BAF and LRR values from a BCF file, formats the values, and outputs them into separate files 'bcftools_convert.BAF' and 'bcftools_convert.LRR'.
• Step 5: This Module uses the ASCAT package to analyze BAF and LRR data for identifying CNVs.
• Step 6: This module annotates CNV segments with gene information and produces visualizations.

CNV Flowchart

flowchart TB
    p0((Sample))
    p1[IAAP_CLI]
    p2[BCFTOOLS_GTC2VCF]
    p3[BCFTOOLS_QUERY_ASCAT]
    p4[ASCAT]
    p5[ASCAT_ANNOTATION]
    o1([VCF with BAF/LRR]):::output
    o2([BAF File]):::output
    o3([LRR File]):::output
    o4([Raw CNV segments]):::output
    o5([Sample ploidy]):::output
    o6([Additional ASCAT Output]):::output
    o7([Genes Annotated with CNV]):::output
    o8([Annotated CNV Segments]):::output

    p0 -->|IDAT Files\nRed/Green| p1

    subgraph " "
    p1 --> p2
    p2 --> o1
    p2 --> p3
    p3 --> o2
    p3 --> o3
    o2 --> p4
    o3 --> p4
    p4 --> o4
    p4 --> o5
    p4 --> o6
    o4 --> p5
    o5 --> p5
    p5 --> o7
    p5 --> o8
    end

classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
Loading

Parameters for the Workflow

  • --pubdir

    • Default: /<PATH>
    • Comment: The directory that the saved outputs will be stored.
  • -w

    • Default: /<PATH>
    • Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
  • --bpm

    • Default: /<PATH>
    • Comment: The path to the BPM file.
  • --egt

    • Default: /<PATH>
    • Comment: The path to the EGT file.
  • --gtc_csv

    • Default: /<PATH>
    • Comment: The path to the GTC CSV file.
  • --gtc_output

    • Default: /<PATH>
    • Comment: The directory of GTC files output from the previous step.
  • --ref_fa

    • Default: /<PATH>
    • Comment: The path to the reference FASTA file.
  • --BAF

    • Default: /<PATH>
    • Comment: The BAF file output from the BCFTOOLS_QUERY_ASCAT module.
  • --LRR

    • Default: /<PATH>
    • Comment: The LRR file output from the BCFTOOLS_QUERY_ASCAT module.
  • --segments_raw

    • Default: *segments_raw.txt
    • Comment: The raw segments file.
  • --ploidy

    • Default: *ploidy.txt
    • Comment: The ploidy file.
  • --chromosome_arm

    • Default: /<PATH>
    • Comment: The path to the chromosome arm file.
  • --cnv_gene_annotation

    • Default: /<PATH>
    • Comment: The path to the CNV gene annotation file.

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by sample names/id's when the pipeline is run.

Naming Convention Description
*_convert.vcf VCF file containing B allele frequency and LogR ratios for each SNP in the array
*_convert_info.tsv The TSV file contains additional information extracted from the IDAT files, which include metadata and auxiliary information
*_convert.BAF The BAF file is a measure which represent the reads that support the B allele at a particular variant site
*_convert.LRR LRR file has the log ratio of observed read depth to the expected read depth at a particular variant site
*_sample.QC.txt Quality control metrics for each sample
*.png PNG image files generated by the ASCAT process
*_ASCAT_objects.Rdata R objects from the ASCAT analysis containing ASCAT data, and quality control metrics
*.segments_raw.extend.txt Raw segmented data, including start and end positions of the chromosomes and the number of probes in each segment
*.ploidy.txt Estimated sample ploidy, as calculated by ASCAT
*.ensgene_cnvbreak.txt Ensembl gene information annotationed with CNV breakpoints information

Pipeline Options Outputs

Naming Convention Description
*.gtc Genotype call files generated by the IAAP_CLI process
iaap_cli.log Log file capturing the execution details of the IAAP_CLI process

CSV Input Sample Sheet

The required input header is: sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

  • The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
  • The gender column contains gender information for the sample. Accepted values are 'XX', 'XY' or '' (unknown).
  • The idat_red and idat_green columns must contain absolute paths to the red and green IDAT files output from an Illumina array.

Basic examples:

An example of the csv file:

sampleID,gender,idat_red,idat_green
Sample_42,XY,206967180008_R01C01_Red.idat,206967180008_R01C01_Grn.idat
Sample_101,XY,206967180008_R02C02_Red.idat,206967180008_R02C02_Grn.idat
Sample_10191,,206967180180_R02C02_Red.idat,206967180180_R02C02_Grn.idat
Clone this wiki locally