Skip to content
/ isoworm Public

A Snakemake pipeline developed to quantify isoforms expression levels in large RNA-seq datasets and find poly-A sites

Notifications You must be signed in to change notification settings

ctglab/isoworm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Snakemake Testing GitHub issues GitHub open pull requests GitHub commit activity GitHub last commit GitHub contributors GitHub Website GitHub forks GitHub Repo stars GitHub watchers

What is Isoworm

IsoWorm, is a Snakemake pipeline developed to quantify isoforms expression levels in large RNA-seq datasets (paired-end short-reads). The pipeline consists of a series of interconnected modules that perform various stages of data analysis. It starts with a txt file containing SRA IDs, while the indications about the RNAseq library type, or a BAM file, and the references files (FASTA and GTF) are in the snakemake config file. The custom module of IsoWorm could be used to specifically analyse isoforms (in our case study, BRAF), using custom gtf files to quantify isoform-specific genomic regions. The quantification is made through Stringtie and all the plots are generated with R language. Conversely, the Salmon module of IsoWorm was used to quantify all the isoforms annotated in Ensembl db (our reference). An R script generates pie charts for genen isoform expression. IsoWorm implent also a module for single-end reads tp identifies polyA sites using custom R scripts, starting form Quant Seq 3' REV sequencing data.

Getting Started

Input

The input files and parameters are specified in config_final.yml, and for R plots and script in config file for R:

top level directories

  • workflow_type: "" - options: "polyA_module", "salmon_module", "custom_module", "custom_and_salmon_modules"
  • sourcedir: - your output directory
  • refdir: - your gtf fasta and all reference files directory
  • sampledir: - your txt samples files directory
  • envsdir: - your envs files directory
  • workflow: - your workflow (.smk) files directory
  • samples: - your txt file containig the sra samples here!

reference files, genome indices and data

  • stargenomedir, GRCh38.primary_assembly.genome: - directory for STAR genome
  • fasta: GRCh38.primary_assembly.genome: - genome fasta reference file for STAR
  • fasta_salmon: GRCh38.primary_assembly.genome: - transcript fasta reference for salmon
  • gtf: GRCh38.primary_assembly.genome: - gtf file for all transcripts
  • gtf_personal: GRCh38.primary_assembly.genome: - gtf file customize for your transcript of interest

Output

polyA modules

  • SAindex - star index
  • {sample_name}_SE_small_Aligned.sortedByCoord.out.bam - sliced bam of you gene of interest (BRAF in our case study), single end
  • polyA_filtered_3UTR204.csv - peaks for poly A in BRAF-204 UTRs
  • polyA_filtered_3UTR220.csv - peaks for poly A in BRAF-220 UTRs

salmon modules

  • salmon_index - salmon index
  • quant.sf - all transcript quantified by salmon
  • ratio_salmon.pdf - box plots ratio between our two isoforms of interest
  • pie_charts.pdf - pie charts expressions values of all our isoforms of interest
  • total_salmon.pdf - total expression levels of our gene of interest

custom modules

  • SAindex - star index
  • {file}_small_Aligned.sortedByCoord.out.bam - sliced bam of you gene of interest (BRAF in our case study)
  • ratio_BRAF.pdf - box plots ratio between our two isoforms of interest

Dependencies

  • miniconda - install it according to the instructions.
  • snakemake install using conda.
  • The rest of the dependencies are automatically installed using the conda feature of snakemake.

Installation

Clone the repository:

git clone https://github.com/ctglab/isoworm

Usage

Edit config.yml to set the input datasets and parameters, edit config.R to set the input datasets and parameters for R and edit script.sh with the directory where you want to download your fastqs, then issue:

snakemake -s snakefile_final.smk --use-conda --rerun-incomplete --core 2 -k

About

A Snakemake pipeline developed to quantify isoforms expression levels in large RNA-seq datasets and find poly-A sites

Topics

Resources

Stars

Watchers

Forks