VCF2DIPLOID_DIR
: vcf2diploid (available from http://alleleseq.gersteinlab.org/tools.html)
PL
: path to AlleleSeq2
LIFTOVER
: UCSC liftOver tool
BEDTOOLS_intersectBed
: Bedtools intersectBed
SAMTOOLS
: Samtools
STAR
: STAR aligner
N_THREADS
: number or threads (for STAR genomeGenerate)
REFGENOME_VERSION
: reference genome version, 'GRCh37' or 'GRCh38'
REFGENOME
: path to the reference genome .fasta file
FILE_PATH_VCF
: path to VCF
VCF_SAMPLE_ID
: sample name in VCF
FILE_PATH_BAM
: path to WGS bam
OUTPUT_DIR
: output folder
make -f makePersonalGenome.mk \
N_THREADS=8 \
VCF_SAMPLE_ID=sge_Aug_encodev2_2_local \
REFGENOME_VERSION=GRCh38 \
OUTPUT_DIR=pgenome_ENC-003 \
FILE_PATH_BAM=ENCFF200XCY.bam \
FILE_PATH_VCF=enc003.spliced.scrubbed.vcf
scipy
numpy
pandas
VGAM
ggplot2
PL
: path to AlleleSeq2
SAMTOOLS
: samtools
PICARD
: Broad picard tools
STAR
: STAR aligner
FASTQC
: FastQC quality control tool
CUTADAPT
: Cutadapt to remove adapter sequences (ATAC-seq samples)
READS_R1
: path to input .fastq file (R1)
READS_R2
: path to input .fastq file (R2, if PE sequencing)
PGENOME_DIR
: path to personal genome folder
VCF_SAMPLE_ID
: sample name in VCF
ALIGNMENT_MODE
: 'ASE' for RNA-seq, 'ASB' for ChIP-seq' and 'ASCA' for ATAC-seq
RM_DUPLICATE_READS
: 'on' to remove duplicate reads with picard tools
STAR_readFilesCommand
: --readFilesIn option in STAR
REFGENOME_VERSION
: reference genome version, 'GRCh37' or 'GRCh38'
Annotation_diploid
: path to diploid GENCODE gene annotation (for 'ASE')
FDR_CUTOFF
: FDR threshold
Cntthresh_tot
: threshold for the total number of reads mapped to hetSNV
Cntthresh_min
: threshold for the minimal number of reads mapped to each allele
make -f PIPELINE.mk \
PGENOME_DIR=pgenome_ENC-003 \
REFGENOME_VERSION=GRCh38 \
NTHR=8 \
READS_R1=ENCFF337ZBN.fastq.gz \
READS_R2=ENCFF481IQE.fastq.gz \
STAR_readFilesCommand=zcat \
ALIGNMENT_MODE=ASE \
RM_DUPLICATE_READS=on \
FDR_CUTOFF=0.10 \
VCF_SAMPLE_ID=sge_Aug_encodev2_2_local
The main output file containing AS hetSNVs is
ENCFF337ZBN_ENCFF481IQE_interestingHets.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt.tsv
PL
: path to AlleleSeq2
PGENOME_DIR
: path to personal genome folder
VCF_SAMPLE_ID
: sample name in VCF
INPUT_UNIQ_READS_PILEUP_FILES
: list of .mpileup files with uniquely mapped reads to aggregate generated in (1)
INPUT_MMAP_READS_PILEUP_FILES
: list of .mpileup files with multi-mapping reads to aggregate generated in (1)
PREFIX
: prefix for output file names
Cntthresh_tot
: threshold for the total number of reads mapped to hetSNV
Cntthresh_min
: threshold for the minimal number of reads mapped to each allele
FDR_CUTOFF
: FDR threshold
make -f PIPELINE_aggregated_counts.mk \
PGENOME_DIR=pgenome_ENC-003 \
INPUT_UNIQ_READS_PILEUP_FILES="../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_hap1_uniqreads.mpileup ../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_hap2_uniqreads.mpileup ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_hap1_uniqreads.mpileup ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_hap2_uniqreads.mpileup" \
INPUT_MMAP_READS_PILEUP_FILES="../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_hap1_mmapreads.mpileup ../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_hap2_mmapreads.mpileup ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_hap1_mmapreads.mpileup ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_hap2_mmapreads.mpileup" \
PREFIX=ENCSR238ZZD \
FDR_CUTOFF=0.10 \
VCF_SAMPLE_ID=sge_Aug_encodev2_2_local
The main output file is ENCSR238ZZD_interestingHets.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt.tsv
Makefile options (can be specified in PIPELINE_aggregate_over_genomic_regions.mk
or as command-line arguments)
PL
: path to AlleleSeq2
BEDTOOLS_intersectBed
: Bedtools intersectBed
SAMTOOLS
: Samtools
PREFIX
: prefix for output file names
Cntthresh_tot
: threshold for the total number of reads mapped to hetSNV
Cntthresh_min
: threshold for the minimal number of reads mapped to each allele
FDR_CUTOFF
: FDR threshold
COUNTS_FILE
: filtered read count file generated in (1) or (2).
REGIONS_FILE
: path to a .bed file with genomic elments (e.g. gene, cCRE) coordinates
UNIQ_ALN_FILES
: .bam(s) with uniquely mapping reads generated in (1)
MMAP_ALN_FILES
: .bam(s) with multimapping reads generated in (1)
make -f PIPELINE_aggregate_over_genomic_regions.mk \
PREFIX=ENCSR238ZZD \
REGIONS_FILE="gencode.v24_all_genes.bed" \
COUNTS_FILE=../ENCSR238ZZD_combined/ENCSR238ZZD_filtered_counts.tsv \
UNIQ_ALN_FILES='../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_ASE-params_crdsorted_uniqreads_over_hetSNVs.bam ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_ASE-params_crdsorted_uniqreads_over_hetSNVs.bam' \
MMAP_ALN_FILES='../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_ASE-params_crdsorted_mmapreads_over_hetSNVs.bam ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_ASE-params_crdsorted_mmapreads_over_hetSNVs.bam' \
FDR_CUTOFF=0.10
The main output file listing AS genomic elements is ENCSR238ZZD_interesting_regions.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt.tsv
(4) Additional scripts to calculate hap-specific read coverage are provided under the directory calc_read_coverage
The catalog file is hetSNVs_default_AS.tsv.
(6) Additional scripts to perform high-powered AS calling on samples with low read count are provided under the directory high_power
This requires files generated from (1), calling AS hetSNVs from a single sample. Where indicated, use the same values for options as in (1)
Two methods are included: the one-sided betabinomial test and the relaxed-FDR method. By default, both are calculated. To run only one, use "make onesided" or "make fdr", respectively.
pandas
VGAM
PL
: path to AlleleSeq2 (same as in (1))
READS_R1
: path to input .fastq file (R1) (same as in (1))
READS_R2
: path to input .fastq file (R2, if PE sequencing) (same as in (1))
FDR_CUTOFF
: FDR threshold (same as in (1))
Cntthresh_tot
: threshold for the total number of reads mapped to hetSNV (same as in (1))
Cntthresh_min
: threshold for the minimal number of reads mapped to each allele (same as in (1))
AS_PRIOR
: path to tab-separated file of hetSNVs that are likely to be AS. Required columns are: "chr", "ref_coord", and "preferred_allele" (1 if ref allele is favored, 0 otherwise)
RELAXED_FDR_CUTOFF
: for relaxed-FDR method, the less stringent FDR threshold
FDR_GRANULARITY
: maximum p-value at which to check FDR with fine granularity (0.01 is usually sufficient for FDR = 0.1, 0.03 is sufficient for FDR = 0.2)
make -f PIPELINE_high_power.mk \
READS_R1=ENCFF000FAH.fastq.gz \
FDR_CUTOFF=0.10 \
FDR_RELAXED=0.2 \
FDR_GRANULARITY=0.03 \
AS_PRIOR=AS_prior.tsv
The main output files are ENCFF000FAH_interestingHets.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt_relaxed_fdr.tsv and ENCFF000FAH_interestingHets.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt_onesided.tsv