I am shamelessly forking the RNAseq pipeline repo managed by Bastian Schiffthaler
The dockerhub page for the original pipeleine is here!
I`ll be running tests on modifications and adjustments to the pipeleine updating tools versions to the latest/or stable ones, so work is not guaranteed.
Below I leave some of the original commentaries by Bastian Shiffthaller. Later I might modify the README instructions based on the adjusments I might introduce into the pipeline.
As of 2024-11-26 added rootless user, need to specify during build.
Build the docker with:
docker build -t <my_image_name> base
To build image under the current user. Or you may check your user using id
command in terminal and changing the dockerfile specs.
docker build --no-cache \
--build-arg USER_ID=1001 \
--build-arg GROUP_ID=1001 \
--build-arg USER=mogilenko_lab \
--build-arg GROUP=mogilenko_lab \
-t rnaseq-pipeline . > >(tee -a log.txt) 2> >(tee -a log.txt >&2)
In case of errors, it`s usefull to build an image from scratch without using the cache from interim builds logging all the stderr and stdout:
docker build --no-cache -t <my_image_name> . > >(tee -a log.txt) 2> >(tee -a log.txt >&2)
Ready-to-work docker for next generation sequence analysis including binaries:
- Sequence data QC (FastQC, MultiQC) [15]
- Trimming (Trimmomatic) [1]
- Genome mapping (STAR [3] , BWA [4] , salmon[16])
- Feature Summarisation (HTSeq) [5]
- File manipulation and exploration (samtools,htslib,bcftools) [8],[9]
- Alignment visualisation (JBrowse) [10]
- Peak calling (MACS2) [11]
- Sequence data analysis (Useq) [12]
- Binding site determination (SISSRs) [13]
One tools functionality has so far been added in comparison to the original repo:
A couple of tools have been silenced from the original repo. These are:
- rRNA filtering (SortMeRNA) [2]
- kallisto [14]
I`ve added a couple of scripts, that made my life a little bit easier. Complete list of scripts and description is here. Some of the main scripts are the ones that set the core of the pipeline:
getPreAlignmentQC.sh
runSTARalign.sh
getPostAlignmentQC.sh
htseqCheckStrand.sh
runFeatureCounts.sh
What the scripts do is pretty self-explanatory. For complete list and detailed description look here
R functionality has been completely turned off and silenced for now. Since the original repo the docker base bioconductor/release_base2 [6] has been deprecated. The current installation is based on Ubuntu 20.04, which is somewhat heavy (The resulting docker image is around 7Gb).
The source (+Dockerfile) which was used to build this container, is here on GitHub!
The tutorial below is for the time being is a duplicate of the Bastian Schiffthaler`s](https://github.com/bschiffthaler) RNAseq pipeline repo. I`ll be populating it later with my own understanding on how I like to use the container.
Below is the general logic of the pipeline.
Note! The pipeline is just a set of separate scripts currently. You need to tailor the scripts, the paths to your use case. It`s not a press-one-button story for the moment.
Even if meant for training purposes, this Docker should allow a scientist to carry out an RNA-Seq based differential expression study as described in this protocol.
For reasons of brevity, I will not be going into great detail (see the protocol), but I will give a short overview of common use cases in differential expression analysis.
- Basics
- Mounting host directories inside the docker container
- Technical quality control with FastQC
- rRNA sorting with SortMeRNA
- Quality trimming with Trimmomatic
- Mapping to the reference genome with STAR
- Creating a STAR genome
- Mapping my reads
- Visualisation of the alignments in JBrowse
- Feature summarisation using HTSeq
- Downstream analysis using RStudio
- References
I like to mess with my tools interactively, using Docker container as a virtual environment. For that purpose I prefer to run a Docker off of my built Docker image, and then work from inside the Docker container`s interactive bash shell. The downside is that once you leave the shell the container stops, and need to be restarted again to work with it. If you left tools working inside the container, they`ll abort. Spo here`s my standard approach.
First, I create a tmux session. I name the session with project I`m working on
tmux new -s <session_name>
Then I create a docker container based off of the Docker image, I built earlier. I mount all the necessary folders, that I`ll need to have for my work
docker run --name <docker_container_name> \
-v /path/to/Docker/base/scripts:/scripts \
-v /path/to/data/folder/with/fasta_reads:/data/Mecr \
-v /path/to/reference_genome:/reference_genome \
-it <docker_image_name> bash
Once you run the command you end up in the interactive bash
shell under the Docker container environment. Keep in mind that all the paths will be relative to the Docker virtual environment.
After that I perform all my processing from the Docker container shell. Oftentimes, I start the tools and leave them work under the Docker environment.
If a particular tools is running and doing it`s job, I leave the tmux session by pressing 'Ctrl & B' + 'D', consequtively. This detaches you from the tmux session, while the Docker container will keep on running. Tmux session will prevent Docker container from stopping or exiting, even after the tools under the Docker container will finish their job.
To re-attach to the tmux session with the Docker container shell environment you just need to type
tmux attach -t <tmux_session>
You will end up in the Docker container shell.
If you need to kill the session, which essentially will stop the container as well (unless open in parallel in some other tmux session), you just type 2 times
exit
Which get`s you out of the container (which essentially will stop the container as well, unless open in parallel in some other tmux session).
If you ever need to reconnect to the Docker container shell once again (while it is running), you just type
docker exec -it <docker_container_name> bash
If the docker container has been stopped, for whatever the reason, you can just start it once again with
docker start -ai <stopped_docker_container_name>
To check, which containers are currently running
docker ps
To check all container, including the stopped ones
docker ps -a
To delete a docker container
docker rm <docker_container_name OR identifier>
To delete a docker image
docker rmi <docker image_name OR identifier>
In order to analyse your data from within the docker, you need to mount the directory containing it in the docker container. In this example, I have a FASTQ file called my_sample.fq.gz in my current working directory and would like to analyse it with software from my docker. I therefore mount my current working directory in the /data
folder within the container. The --rm
flag tells docker to clean up the process after it's done.:
docker run --rm -v $(pwd):/data bschiffthaler/ngs zcat /data/my_sample.fq.gz | head
The argument -v $(pwd):/data
tells docker to mount the current working directory (from the pwd
command) in the folder called /data
within the container. Then, the zcat
command is executed, which decompresses the file and streams the contents to the terminal. We pipe (using |
) the output to the head
command so that we only read the first few lines. Of course, this example is almost entirely non practical as you could do all of this without having the docker image.
As a small prologue to this section I would like to differentiate between technical and biological QA. As an example, technical QA would show sequencing errors, PCR over-amplification and similar issues, while a sample swap during the RNA extraction would not show up during this part. For the lack of a better term, we call methods to determine the latter type of issues "biological QA". This part is about technical QA.
After you receive your sequencing data, and after every step that modifies it (trimming, rRNA sorting) QA should be done. Here, we will be using Simon Andrews' FastQC:
#Here I assume that I have a FASTQ file called 202_subset_1.fq.gz in my current working directory
docker run --rm -v $(pwd):/data bschiffthaler/ngs fastqc /data/202_subset_1.fq.gz
This will output the FastQC report as html in the directory. For interpretation of that report, please refer to the manual or the aforementioned protocol.
Sorting out rRNA contamination with sortmerna
can be carried out in a similar manner. Be mindful, whether rRNA sorting is indeed necessary. The rRNA databases which come with sortmerna
are in /usr/share/rRNA_databases
and have been pre-indexed during the build of the docker. The indexes are in /usr/share/rRNA_databases/index
and are always prefixed with the same name (minus the ".fasta" ending) as the source FASTA file. The correct syntax of the command can be taken from sortmerna -h
and the sortmerna manual.
To list the available databases:
docker run --rm bschiffthaler/ngs find /usr/share/rRNA_databases -name "*.fasta"
For convenience, I included an environment variable $SORTMERNA_DB, which can be passed to sortmerna
's --ref
argument:
docker run --rm bschiffthaler/ngs bash -c 'sortmerna --ref $SORTMERNA_DB <...>'
Wrapping the command in bash -c
here is crucial, since otherwise the environment variable would not be passed correctly.
If I would like to sort my FASTQ file which I got from sequencing:
docker run -i -t --rm -v $(pwd):/data bschiffthaler/ngs sortmerna --ref \
/usr/share/rRNA_databases/silva-bac-16s-id90.fasta,/usr/share/rRNA_databases/index/silva-bac-16s-id90 \
--reads /data/202_subset_1.fq \
--other /data/202_subset_1_sorted \
--aligned /data/202_subset_1_rRNA_hits --fastx
This outputs two files:
- 202_subset_1_sorted.fq - which contains all reads which show NO hits to an rRNA
- 202_subset_1_rRNA_hits.fq - which contains all reads which DO show hits to an rRNA
In case quality trimming (i.e.: dropping low quality reads or adapter sequences) is desired, this docker comes with Trimmomatic. I will again leave the specifics to the Trimmomatic manual and refer to the protocol for discussion about trimming. In the example case, I decide to trim my already rRNA-sorted data with Trimmomatic. To do so, I run:
docker run -i -t --rm -v $(pwd):/data bschiffthaler/ngs trimmomatic SE -phred64 \
/data/202_subset_1_sorted.fq /data/202_subset_1_sorted_trimmed.fq LEADING:10 \
TRAILING:10 SLIDINGWINDOW:5:20 \
ILLUMINACLIP:/usr/share/Trimmomatic-0.33/adapters/TruSeq2-SE.fa:2:30:10
This leaves me with a quality trimmed output file called "202_subset_1_sorted_trimmed.fq", which I will now map to the genome in the next step.
After downloading an appropriate reference genome assembly for my organism, I generate a STAR genome which STAR will use during the mapping step. NOTE: This step might require workstation with a somewhat large amount of RAM (see STAR manual). If you do not have that available there are several parameters available which will reduce the memory requirement (at the cost of mapping speed). Those are described in the appendix of the STAR manual. If available, it is highly advised to feed the annotation in the form of a GTF file into STAR's genomeGenerate
. If the annotation is provided as GFF, the cufflinks
software suite comes with a tool called gffread
that is capable of converting the formats:
docker run -i -t --rm -v $(pwd):/data bschiffthaler/ngs STAR --runMode \
genomeGenerate --genomeDir /data/Potri_star/ --genomeFastaFiles \
/data/Ptrichocarpa_v3.0_210.fa --sjdbGTFfile /data/Ptrichocarpa_v3.0_210_gene_exons.gtf \
--sjdbOverhang 99
The folder "Potri_star" (which I created beforehand) now holds the STAR genome.
As always, refer to the manual for exact details on STAR's command line options. To summarise quickly, I direct STAR to my previously created genome directory and my sequencing read files, I set my maximum desired intron length to 11,000 (as this is the longest intron in the annotation) and finally, I specify that I would like the output file in the BAM format and sorted by coordinate.
docker run -i -t --rm -v $(pwd):/data bschiffthaler/ngs STAR --genomeDir \
/data/Potri_star --readFilesIn /data/202_subset_1_sorted_trimmed.fq \
--alignIntronMax 11000 --outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix /data/202_subset_1_sorted_trimmed_STAR
Unlike the previous commands, we need to keep the docker container alive and interactive, as JBrowse requires several steps to be functional. We will further need to map port 80 from the docker to the host in order to view alignments in apache. The option -p 80:80
maps the port from the container to the host. Once we start apache, you can then open a web browser and navigate to your IP (on Linux hosts you can simply type "localhost" in the browser's address bar, on OSX and Windows hosts, you can find your docker's IP address if you run boot2docker ip
from the boot2docker command. Look here for more info.) The other options -ti
tell docker that we want an interactive session, and that we want a (pseudo-)terminal to enter commands.
docker run -p 80:80 -v $(pwd):/data -ti bschiffthaler/ngs bash -l
First, we start by adding our reference(s) to JBrowse. This is done with perl scripts which are delivered with JBrowse. The reference sequence in FASTA format and the reference annotation in GFF3 format are in my current working directory, as are the alignment results in BAM format.
Let's start by adding the FASTA and GFF3 files as tracks in JBrowse:
cd /var/www/html/JBrowse-1.11.6
bin/prepare-refseqs.pl --fasta /data/Ptrichocarpa_v3.0_210.fa
bin/flatfile-to-json.pl --gff /data/Ptrichocarpa_210_gene_exons.gff3 --trackType CanvasFeatures --trackLabel P.trichocarpa_v3.0_gene_exons
This has now added the reference sequence and annotation as tracks in JBrowse. Now let's add our alignments. I added a script to this docker, which automatically adds a read alignment viewer track as well as a SNP and coverage histogram track to the file /var/www/html/JBrowse-1.11.6/data/tracks.conf
. My alignment files are in a sub-folder within my current working directory. Before you run the script to add tracks, please make sure that your alignment file is BAM formatted, sorted and indexed. If you ran STAR
as described above, you should already have a sorted BAM file. If you haven't you can sort the file:
samtools sort /data/<my_BAM_file.bam> -o /data/<my_sorted_BAM_file.bam>
Then index the file
samtools index /data/<my_sorted_BAM_file.bam>
After that is done, we can add the tracks to JBrowse
add_JBrowse_tracks.sh /data
Finally, we can start the web server apache:
service apache2 start
Now, if we open a browser on the host and navigate to the aforementioned IP address (or localhost), we should see a link to JBrowse as well as the training user's home folder. Click on the JBrowse link, which will start JBrowse and browse your alignments!
I added a screenshot here of what you should see!
The final step before the actual differential expression testing is the feature summaristaion using HTSeq, specifically htseq-count
. In this case we need to run the executable from within bash -c
like this: bash -c 'htseq-count <arguments>'
. The reason is that htseq-count does not write to a file, but writes to stdout, which means that we need to redirect stdout to a file. We could do docker run <...> htseq-count <...> > my_counts.txt
, but only the htseq-count command would be run within docker and the redirection would be handled by the host, which has the unfortunate consequence that stderr output would be included in the file. This means that my_counts.txt would also contain diagnostic messages like "1,000,000 SAM lines processed". We work around this by running. Here is the htseq-count documentation to read about the options:
docker run -i -t --rm -v $(pwd):/data bschiffthaler/ngs bash -c 'htseq-count -f bam -r \
pos -s no -t exon -i Parent /data/202_subset_1_sorted_trimmed_STARAligned.sortedByCoord.out.bam \
/data/Ptrichocarpa_v3.0_210_synthetic-gene-models.gff3 > \
/data/202_subset_1_sorted_trimmed_STAR_HTSeq.txt'
Finally, the data is ready to be analysed by your differential expression testing tool of choice (e.g.: edgeR, DESeq2, limma, ...). We can therefore start RStudio like this:
docker run -p 8787:8787 --rm -v $(pwd):/data bschiffthaler/ngs
This spawns an RStudio server, which you can reach by navigating to the host's IP on port 8787 in your web browser i.e.: if my IP is 192.168.1.10 (you can check this by running ifconfig
on UNIX based systems and ipconfig
on Windows) I would open my broweser and type:
192.168.1.10:8787
in the address bar. You can then login with the credentials rstudio as both username and password. On windows and Mac OS, docker runs inside a virtual machine, you therefore need to provide the virtual machine's IP (which you see when you run boot2docker
or by checking the $DOCKER_HOST environment variable). From within RStudio, you could run:
setwd("/data")
dir()
You should see your generated data, which you can now further process inside R. CAVEAT: Save any results etc. only into the folder you have mounted inside the container (in this example "/data") as docker is ephemeral (when run with the '--rm' option) and will not keep any files.
For the actual analysis, I would recommend the documentation of DESeq2, edgeR or limma. An R transcript of how the example data was analysed is available here.
[1] Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics (Oxford, England), 1–7. doi:10.1093/bioinformatics/btu170
[2] Kopylova, E., Noé, L., & Touzet, H. (2012). SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics, 28, 3211–3217. doi:10.1093/bioinformatics/bts611
[3] Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., … Gingeras, T. R. (2013). STAR: Ultrafast universal RNA-seq aligner. Bioinformatics, 29, 15–21. doi:10.1093/bioinformatics/bts635
[4] Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760. doi:10.1093/bioinformatics/btp324
[5] Anders, S., Pyl, P. T., & Huber, W. (2014). HTSeq A Python framework to work with high-throughput sequencing data. bioRxiv. doi:10.1101/002824
[6] Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., … Zhang, J. (2004). Bioconductor: open software development for computational biology and bioinformatics. Genome Biology, 5, R80. doi:10.1186/gb-2004-5-10-r80
[7] R Core Team. (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, URL http://www.R–project.org/.
[8] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … Durbin, R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078–2079. doi:10.1093/bioinformatics/btp352
[9] Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27, 2987–2993. doi:10.1093/bioinformatics/btr509
[10] Skinner, M. E., Uzilov, A. V., Stein, L. D., Mungall, C. J., & Holmes, I. H. (2009). JBrowse: A next-generation genome browser. Genome Research, 19, 1630–1638. doi:10.1101/gr.094607.109
[11] Zhang, Y., Liu, T., Meyer, C. A., Eeckhoute, J., Johnson, D. S., Bernstein, B. E., … Liu, X. S. (2008). Model-based analysis of ChIP-Seq (MACS). Genome Biology, 9(9), R137. doi:10.1186/gb-2008-9-9-r137
[12] Nix, D. A., Courdy, S. J., & Boucher, K. M. (2008). Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks. BMC Bioinformatics, 9, 523. doi:10.1186/1471-2105-9-523
[13] Jothi, R., Cuddapah, S., Barski, A., Cui, K., & Zhao, K. (2008). Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Research, 36(16), 5221–5231. doi:10.1093/nar/gkn488
[14] Bray, N. L., Pimentel, H., Melsted, P., & Pachter, L. (2015). Near-optimal RNA-Seq quantification. aRxiv. http://doi.org/arXiv:1505.02710
[15] Ewels, P., Magnusson, M., Lundin, S., & K??ller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047–3048. http://doi.org/10.1093/bioinformatics/btw354
[16] Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods.