-
Notifications
You must be signed in to change notification settings - Fork 0
/
Manual
170 lines (157 loc) · 18.8 KB
/
Manual
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
Supplemental Manual:
The pipeline incorporates open source tools into one easy to use virtual machine pipeline. The user has the option to run the script to run the pipeline in limited user input options. The user is also free to run the code themselves from the command prompt. There are also additional tools added to the virtualbox but that are not used in the default pipeline. The user is free to use these for the optional steps indicated below. The program and code used in the pipeline are described in detail below including the default pipeline steps and optional additions.
Step 1: The user can download the virtualbox image and load it into their virtual machine on their local machine following oracle virtualbox set up instructions. This is ready to use and you don’t need to do the optional setup instructions.
Optional Step 1: The user can download the set_up_instructions and follow to create their own virtualbox following the instructions provided.
Optional Step 1: You can use the docker image set up instead of virtual box see the docker instructions at the end of this manual.
Step 2: Set up external, internal hard drive, or shared folder path.
A. If you choose the external drive make sure you set up the virtual box to recognize your drive. Do this by going to the settings and selecting USB. Make sure you have the right 2.0 or 3.0 option selected and click add device. Then select the appropriate device from the menu. Then apply changes and restart the virtualbox. The final path should be /media/user/”whatever you named you external drive”
B. If you choose the internal drive use these instructions to add an internal hard drive space. Click settings and then select Storage. Highlight Controller: SATA and then at the bottom there is a blue drive button with a green plus select this one. Two options will appear select the add hard disk option. Click create new disk and in the new window select VHD (virtual hard disk) then click Next. Select Dynamically allocated up to appropriate size for your project and machine we suggest no smaller then 500GB. The new drive should appear in the list. Now you can open the virtual machine and you will have to format the new disk before you can use it. Go to search computer and type in disks. Click on the disks icon that appears. A new window will pop up and on the left will be a list of disks. Select your new hard drive and then click on the circle icon in the upper right corner which is the settings menu. Select format from this menu. In the new window select quick overwrite and click format. Once this is done select the + icon near the middle of the window. In the new pop up window don’t change any options just add a name in the name box. Once this is done close click the triangle “play” button to mount the drive and close the windows. Your new drive should appear below computer in the folder menu. Supply this path to the first prompt in the pipeline. The final path should be /media/user/”whatever you named it”
C. The last option is to create a shared folder on your host system. Create your folder on your host system and make sure you give it share permissions. Go into the setting in the virtualbox and click on shared folders. Click on the folder icon with the + sign on top of it. This will create a popup window and you should select you folder path by clicking the drop down option and click on other. This will create a pop up window and you can select your new shared folder. After you select the folder the pop up will close and you should make sure the auto-mount box is checked and make permanent box if you want the folder to stay shared for more then one session. Then click ok. Your new folder should show up on the list then click ok. Now start the virtual box and you should see you new folder under devices in the folder menu. Then use this path for the pipeline it should be /media/sf_”name of folder”.
Step 3: Create your experimental index file. Go to /home/user/Pipelines/index/PHENO_DATA.csv and fill this in with your correct experimental information then save and close the file and run the RNAseq pipeline script provided in the virtualbox.
Optional Step 3: Follow the following script prompts and manually run the installed tools using either provided scripts or your own code.
Step 4: User prompts will direct the user to enter the path to an external storage with at least 500G, SRA numbers, which references they would prefer to use from the options, whether the datasets are single or paired, which type of design will be used i.e. how many conditions your experimental design has, and lastly whether or not to run variant calling part of the pipeline. See code below for specifics
Optional Step 4: The user can by pass the SRA download by simply placing there experimental files into the correct path /home/user/ncbi/public/sra and then entering those file names instead of SRA numbers when prompted.
Code:
echo "Please enter the path to external hard drive with at least 1 terabyte of space if you want to run all 18 files"
##this path can be to any hard drive including an external drive which can be added through USB drive.
read path
mkdir $path
cd $path
echo "Please enter up to 18 sra file numbers seperated by a space with no punctuation for example SRR0000000 SRR0000000 SRR0000000."
read varname1 varname2 varname3 varname4 varname5 varname6 varname7 varname8 varname9 varname10 varname11 varname12 varname13 varname14 varname15 varname16 varname17 varname18
echo "Please enter HISAT2 index file choice from one of the following options: GRCh37, GRCh37_snp, GRCh37_tran, GRCh37_snp_tran, GRCh38, GRCh38_snp, GRCh38_tran, GRCh38_snp_tran"
read variable
echo "Please enter library layout type: single or paired"
read library
echo "Choose one of the following DESeq3 design set-ups: 1.) if your PHENO table contains just one condition1 column enter design. 2.) if you PHENO table contains two conditions both condition1 and condition2 columns are present in the table then enter design2."
read design
echo "Would you like to do variant calling?"
read variant
Step 5: This is the step where the pipeline will download all necessary reference files for the pipeline. The user input will be the instructions for which set of references and indexes to download. See the code below for more details.
Optional Step 5: If the user prefers to use their own or need something other then human reference files these reference can be downloads separately however we do not recommend using anything other then ensembl indexes and references because the downstream tools are designed to take ensembl annotated data and the other references and indexes need extra organizing and filtering steps not provided in this pipeline. Once you have downloads and followed the following code to properly organize these files they should be moved to the working directory selected in the first prompt.
Code:
echo "Now Downloading sra files and moving them to $path"
prefetch $varname1 $varname2 $varname3 $varname4 $varname5 $varname6 $varname7 $varname8 $varname9 $varname10 $varname11 $varname12 $varname13 $varname14 $varname15 $varname16 $varname17 $varname18
for file in /home/user/ncbi/public/sra/*sra
do
name="$(basename "$file" .sra)"
echo "Moving $name"
mkdir -p "$name"
mv "$file" "$name"
done
if [ "$variable" == "GRCh37_snp_tran" ]; then
echo "downloading, extracting, and moving index files"
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch37_snp_tran.tar.gz
echo "Uncompressing and moving file"
tar xzvf grch37_snp_tran.tar.gz
mv $path/grch37*/genome_snp_tran.{1..8}.ht2 $path/genome.{1..8}.ht2
wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz
echo "Uncompressing and moving file"
gunzip Homo_sapiens.GRCh37.75.cdna.all.fa.gz
mv $path/*.fa $path/ref.fa
wget ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
echo "Uncompressing and moving file"
gunzip Homo_sapiens.GRCh37.75.gtf.gz
mv $path/*.gtf $path/ref.gtf
wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
mv $path/*.primary_assembly.fa $path/ref1.fa
perl -e 'use File::Temp qw/tempdir/; use IO::File; $d=tempdir; $fh; map{if(m/^\>(\S+)\s/){$fh=IO::File->new(">$d/$1.fa");} print $fh $_;}`cat ref1.fa`; foreach $c(1..22,X,Y,MT){print `cat $d/$c.fa`}; print `cat $d/GL*`' > ref2.fa
java -jar /home/user/Pipelines/HISAT2_pipeline/picard.jar CreateSequenceDictionary REFERENCE=$path/ref2.fa OUTPUT=$path/ref2.dict
samtools faidx $path/ref2.fa
wget https://sourceforge.net/projects/snpeff/files/databases/v4_3/snpEff_v4_3_GRCh37.75.zip
unzip snpEff_v4_3_GRCh37.75.zip
wget ftp://[email protected]/bundle/b37/dbsnp_138.b37.vcf.gz
gunzip dbsnp_138.b37.vcf.gz
mv dbsnp_138.b37.vcf dbsnp.vcf
fi
Step 6: Conversion of SRA files to fastq files and alignment using HISAT2. There is a user defined option at the beginning asked for paired or single end reads. At this time the default pipeline is set to paired end reads and only paired end reads can be used for variant calling for the default pipeline. Single end reads do not have the accuracy needed during alignment and re-mapping during variant calling to be accurate enough for RNA editing detection.
Optional Step 6A: If you already have fastq files these can manually be moved into the working directory in a folder with that samples name containing all fastq files associated with that sample name.
Optional Step 6B: If you prefer to use STAR as an aligner this is tool is included as well as Salmon but these are not recommended in this pipeline. STAR requires at least 30G of RAM to use human reference genomes which most computers do not have access to and Salmon does not provide an accurate reliable alignment especially for single end reads for further downstream processing (supplemental figure X).
Code:
if [ "$library" == "single" ]; then
for fn in $varname{1..18};
do
samp=`basename ${fn}`
echo "Running fastq-dump to convert .sra to .fastq ${samp}"
fastq-dump ${fn}/${samp}.sra \
-O $path/${samp}
done
for fn in $varname{1..18};
do
samp=`basename ${fn}`
echo "Running HISAT2 alignment for single layout and with previously downloaded reference sequences for ${samp}"
hisat2 -q -x $path/genome -p3 --dta-cufflinks \
-U ${fn}/${samp}.fastq \
-t --summary-file $path/${samp}/${samp}.txt -S $path/${samp}/${samp}.sam
done
fi
if [ "$library" == "paired" ]; then
for fn in $varname{1..18};
do
samp=`basename ${fn}`
echo "Running fastq dump to convert .sra file to 2 mate .fastq files for ${samp}"
fastq-dump ${fn}/${samp}.sra \
-I --split-files \
-O $path/${samp}
done
for fn in $varname{1..18};
do
samp=`basename ${fn}`
echo "Running HISAT2 alignment for paired layout with preiously downloaded reference sequences for ${samp}"
hisat2 -q -x $path/genome -p3 --dta-cufflinks \
-1 ${fn}/${samp}_1.fastq \
-2 ${fn}/${samp}_2.fastq \
-t --summary-file $path/${samp}/${samp}.txt -S $path/${samp}/${samp}.sam
Done
fi
Step 7: The next steps are conversion from sam files to bam files using picard java tool. These bam files are then used for the transcriptome assembly and counting step. Stringtie is the tool used for transcriptome assembly because it simulataneous counts and assembles the transcript creating a more accurate count of individual transcripts due to a more accurate assembly of different isoforms.
Optional Step 7A: The user can choose to convert sam files to bam files using samtools which is installed in the virtualbox but not utilized in the pipeline.
Optional Step 7B: The user can choose to use another assembly tool such as cuffdiff which is included in the virtualbox but not used in the current pipeline due to recommended use of stringtie with HISAT2 alignment tool.
Code:
for fn in $varname{1..18};
do
samp=`basename ${fn}`
echo "Starting sorting with java picard tool to convert .sam to .bam for ${samp}"
java -jar /home/user/Pipelines/HISAT2_pipeline/picard.jar SortSam INPUT=$path/${samp}/${samp}.sam OUTPUT=$path/${samp}/${samp}.bam SORT_ORDER=coordinate
done
echo "Strating stringtie to create count matrics files to be used in downstream differential expression analysis for ${samp}"
stringtie ${samp}/${samp}.bam \
-p3 -G $path/ref.gtf -A $path/Counts/${samp}.tab -l -B -b $path/ballgown_in/${samp} -e -o $path/ballgown/${samp}.gtf
Done
Step 8: At this point DEXseq and DESeq2 input count matrixes are created from stringtie output files using python scripts included in the R packages for those programs. Although the DEXseq files are not currently used in this pipeline they are created for independent use by the user in an option step. The DESeq2 files are created to use in downstream analysis for differential gene expression. Another important step in this part is to construct and move files to the correct directories for the python script to work properly.
Optional Step 8A: The user can use the created DEXseq files to run an exon count usage for isoform discovery however this step in currently not included in the pipeline but there are plans to include it in future updates.
Optional Step 8B: If the user chooses to manually run tools make sure the .gtf files and in this folder structure working_directory/ballgown/sample01/sample01.gtf and so on for all the files. If they are not in this folder structure then the python scripts will not work.
Code:
echo "Starting python script to find exon counts for DEXSeq2"
for i in 0{1..9} {10..18}
do
python /home/user/Pipelines/HISAT2_pipeline/bin/dexseq_prepare_annotation.py $path/ballgown/${samp}.gtf $path/DEXSeq/${samp}.gff
done
echo "Starting python script to find exon counts for DEXSeq2"
for i in 0{1..9} {10..18}
do
python /home/user/Pipelines/HISAT2_pipeline/bin/dexseq_count.py $path/ballgown/${samp}.gff $path/${samp}/${samp}.sam $path/DEXSeq/${samp}
done
echo "Starting python script to find gene and transcript counts"
python /home/user/Pipelines/HISAT2_pipeline/bin/prepDE.py -g $path/Results/gene_count_matrix.csv -t $path/Results/transcript_count_matrix.csv
Step 9: The first of two provided Rscripts will run for gene level differential expression analysis (GLDE.R). This Rscript is designed to use DESeq2 to analysis differential expression at the gene level. There are also filtering steps and “shrinkage” steps to help eliminate false fold change detection in low read count samples as well as control for extreme outliers in counts. This Rscript code is provided below and will provide the principal component analysis of the experiment followed by the creation of all figures for publication at the gene level.
Tools used: DESeq2 for complete R packages list see end of code below.
Input: gene_count_matrix.csv (created from the python scripts from step 7)
Output Files: PCA plots, MA plots, log gene count plots, table containing all differentially expressed genes with fold changes and statistical analysis columns, top 60 differentially expression genes hierarchal clustering heat map, bar graphs and line plots of ADAR and VEGF gene counts, tables of ADAR and VEGF gene counts with differential expression statistical analysis columns, heatmaps of genes involved in pathways provided and tables of genes involved in those pathways including counts and statistical analysis columns.
Rscript used: GLDE.R
Indexes or References used: pathways included in the virtualbox these include: neural development, brain development, pns development, cns development, VEGFA expression pathway, myeloid cell differentiation up and down regulation pathways as well as myeloid cell homeostasis pathway.
Optional Step 9A: If the user prefers to manually enter the code and use the same code provided it must be done in the order it is provided or certain parts might not work correctly.
Optional Step 9B: If the user would like different genes then ADAR or VEGF the code can easily be manually changes in the Rscripts by replacing all ADAR with the gene name the user chooses and then running the pipeline or R scrips can be run manually entering code after replacing all ADAR with the gene of interest.
Optional Step 9C: The user can provide there own pathway by creating an ensembl gene list of genes involved in that pathway as a comma deliminated text file. Simply move that .csv file into the working directory before strating the pipeline. Then insert the name of that file into the optional code included at the end of the GLDE.R file. Then run the pipeline from the beginning or manually enter the R code into the command prompt.
Code: see GLDE.R script included in the Pipelines directory on the virtualbox for code details and descriptions
Step 10: Is the same as step 8 but with transcript level counts. DESeq2 is used to analysis differential expression of transcripts to discover transcriptome diversity. Again ADAR and VEGF isoforms are shown in the pipeline.
Optional Step 10A: Again if the user manually enters the code into the command prompt instead of using script the steps must be done in the correct order or downstream figures cannot be generated.
Optional Step 10B: If the user wishes to explore other genes transcriptome diversity besides ADAR or VEGF simply replace ADAR with the gene of interest in the code and run as usual from the beginning. Or if is preferred run the pipeline and then manually enter the R script with the replaced ADAR gene with the user gene of interest.
Code: See TLDE.R script included in the Pipelines directory on the virtualbox for code details and descriptions
Step 11: This step runs R package topGO to do gene enrichment analysis on the top and down regulated genes over 2 fold expression as calculated by DESeq2 in the previous steps. We use a classic algorithm and a fisher statistic to calculate root nodes for the pathway maps as the resulting output.
Optional Step 11A: these tests can be changes in the code to any of the other statistical tests preferred this was simiply picked as the tutorial example as they are good tests to choose in our case and then the pipeline can be run similar
Code: see topGO.R script included in the Pipelines directory on the virtualbox for code details and descriptions
Step 12: Variant Calling pipeline for RNA editing detection. This is provided as a separate script as well as an option in the pipeline. The variant calling tools used require substantially more memory and more time then the differential expression analysis so it is included as an option and doesn’t have to be run with every experiment.
The variant calling pipeline was taken from GATK best practices and recommendations made for RNA editing. The parameters and options were suggestions made to decrease false positive discovery rates as well as increase accuracy for RNA editing.
Code: see VariantCalling.sh bash script included in the Pipelines directory on the virtualbox for the code details and descriptions.