Learning objectives:
- Install software (fastqc, multiqc, trimmomatic) via conda
- download data
- visualize read quality
- quality filter and trim reads
If you don't have one running currently, start up a Jetstream m1.medium or larger instance (as detailed here), and then ssh
login through your terminal (as shown here).
You should now be logged into your Jetstream computer! You should see something like this:
(base) dibada@js-171-9:~$
Change to your home directory:
cd ~/
Let's make sure our conda channels are loaded, just in case.
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
and install FastQC, MultiQC, and trimmomatic:
conda install -y -c bioconda fastqc multiqc trimmomatic
Make a "data" directory:
cd ~/
mkdir data
cd data
and download some data from the Schurch et al, 2016 yeast RNAseq study. We will download three wildtype yeast datasets, and three mutant datasets (SNF2 knockout; SNF2 is a global transcription regulator).
curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz
curl -L https://osf.io/2wvn3/download -o ERR458495.fastq.gz
curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz
curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz
curl -L https://osf.io/qfsze/download -o ERR458502.fastq.gz
curl -L https://osf.io/9c5jz/download -o err_md5sum.txt
We also downloaded a file holding some md5sum values for our sequence files ("err_md5sum.txt"). An md5sum hash is like a unique fingerprint for a file. If we compute the md5sum for the files we downloaded, and they don't match what is specified in the "err_md5sum.txt" file that came with them, then we would know if any of the files were corrupted or cut short while we were downloading them.
If we look at what's in that file:
head err_md5sum.txt
We can see a value for each of our sequence files:
2b8c708cce1fd88e7ddecd51e5ae2154 ERR458493.fastq.gz
36072a519edad4fdc0aeaa67e9afc73b ERR458494.fastq.gz
7a06e938a99d527f95bafee77c498549 ERR458495.fastq.gz
107aad97e33ef1370cb03e2b4bab9a52 ERR458500.fastq.gz
fe39ff194822b023c488884dbf99a236 ERR458501.fastq.gz
db614de9ed03a035d3d82e5fe2c9c5dc ERR458502.fastq.gz
Those are the md5sums they are supposed to have if they transferred properly. We can have the computer calculate md5sums for the files we have now and check that they match up like this:
md5sum -c err_md5sum.txt
And if all is well, it will print out "OK" for each of the files:
ERR458493.fastq.gz: OK
ERR458494.fastq.gz: OK
ERR458495.fastq.gz: OK
ERR458500.fastq.gz: OK
ERR458501.fastq.gz: OK
ERR458502.fastq.gz: OK
Often the sequencing facility will provide an md5sum file with your data, or you may see them with large datasets in public repositories.
If we wanted to create our own md5sum file to keep with our data, we could do so like this for one file:
md5sum ERR458493.fastq.gz > my_md5sum.txt
Or we could use the *
wildcard to help specify all of our .fastq.gz files here (overwriting the previous file):
md5sum *.fastq.gz > my_md5sum.txt
Back to our data, if we type:
ls -l *.fastq.gz
We should see something like:
-rw-rw-r-- 1 dibada dibada 59532325 Jul 4 03:23 ERR458493.fastq.gz
-rw-rw-r-- 1 dibada dibada 58566854 Jul 4 03:24 ERR458494.fastq.gz
-rw-rw-r-- 1 dibada dibada 58114810 Jul 4 03:24 ERR458495.fastq.gz
-rw-rw-r-- 1 dibada dibada 102201086 Jul 4 03:24 ERR458500.fastq.gz
-rw-rw-r-- 1 dibada dibada 101222099 Jul 4 03:24 ERR458501.fastq.gz
-rw-rw-r-- 1 dibada dibada 100585843 Jul 4 03:24 ERR458502.fastq.gz
These are six data files from the yeast study.
One problem with these files is that they are writeable – which means we can edit or delete them. By default, UNIX makes things writeable by the file owner. This poses an issue with creating typos or errors in raw data. Let's fix that before we go on any further:
chmod a-w *
Here we are using the chmod
command (for change mode, which handles permissions of files/directories), and providing the argument a-w
, which means for all users (a
), take away (-
) the permission to write (w
).
If we take a look at their permissions now:
ls -l
We can see that the 'w' in the original permission string
(-rw-rw-r--
) has been removed from each file and now it should look like
-r--r--r--
.
We'll talk about what these files are below.
First, make a new working directory:
mkdir ~/quality
cd ~/quality
Now, we're going to make a "link" to our quality-trimmed data in our current working directory:
ln -fs ~/data/* .
and you will see that they are now linked in the current directory when you do an
ls
. These links save us from having to specify the full path (address) to their location on the computer, without us needing to actually move or copy the files. But note that changing these files here still changes the original files!
These are FASTQ files -- let's take a look at them:
gunzip -c ERR458493.fastq.gz | head
FASTQ files are sequencing files that contain reads and quality information about each base-call within each read. A score of 0 is the lowest quality score, and a score of 40 is the highest. Instead of typing the scores as numbers, they are each encoded as a character.
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
| | | | |
Quality score: 0........10........20........30........40
A score of 10 indicates a 1 in 10 chance that the base is inaccurate. A score of 20 is a 1 in 100 chance that the base is inaccurate. 30 is 1 in 1,000. And 40 in 1 in 10,000.
Links:
We're going to use FastQC summarize the data. We already installed 'fastqc' above, with the conda command.
Now, run FastQC on two files:
fastqc ERR458493.fastq.gz
fastqc ERR458500.fastq.gz
Now let's use ls
:
ls *fastqc.zip
to list the files, and you should see:
ERR458493_fastqc.zip
ERR458500_fastqc.zip
Inside each of those zip files are directories holding reports from the fastqc program. But these are also nicely summarized in HTML files:
ls *.html
Let's transfer the HTML files to our local computers so we can open and view
them in our browsers. We'll use the command scp
to do this.
scp
stands for secure copy. We can use this command to transfer files between
two computers. We need to run this command on our local computers (i.e. from
your laptop).
The scp
command looks like this:
scp <file I want to move> <where I want to move it>
First we will make a new directory on our computer to store the HTML files we’re
transfering. Let’s put it on our desktop for now. Make sure the terminal program
you used this morning to ssh
onto your instance is open. If you're in it now,
you can open a new tab you can use the pull down menu at the top of your
screen or the cmd + t
or maybe ctrl + t
keyboard shortcut). Type:
mkdir ~/Desktop/fastqc_html
Now we can transfer our HTML files to our local computer using scp
, which will look something like this:
scp username@ipaddress:~/quality/*.html ~/Desktop/fastqc_html/
The first part of the command is the address for your remote
computer. Make sure you replace everything after username@ with your IP address (the one you used to log in with the ssh
command).
The second part starts with a : and then gives the absolute path of the files
you want to transfer from your remote computer. Don’t forget the :
. We use a
wildcard (*.html) to indicate that we want all of the HTML files.
The third part of the command gives the absolute path of the location you want
to put the files. This is on your local computer and is the directory we just
created ~/Desktop/fastqc_html
.
You will be prompted for your password. Enter the password we set for your instance during the morning session.
You should see a status output like this:
ERR458493_fastqc.html 100% 642KB 370.1KB/s 00:01
ERR458500_fastqc.html 100% 641KB 592.2KB/s 00:01
Now we can go to our new directory and open the HTML files.
You can also download and look at these copies of them:
These files contain a lot of information! Depending on your application, some of these modules are more helpful than others. For example, for RNA-seq data, failing duplication levels aren't necessarily a problem -- we expect to have duplicated sequences given that some transcripts are very abundant.
Questions:
- What should you pay attention to in the FastQC report?
- Is one sample better than another?
Links:
There are some important things to know about FastQC - the main one is that it only calculates certain statistics (like duplicated sequences) for subsets of the data (e.g. duplicate sequences are only analyzed within the first 100,000 sequences in each file). You can read more about each module at their docs site.
Now we're going to do some trimming! Let's switch back to our instance terminal, as we will be running these commands on our remote computers. We'll be using Trimmomatic, which (as with fastqc) we've already installed via conda.
The first thing we'll need is a file holding the adapters we need to trim off. These are artificial and added as part of the sequencing process. The trimmomatic program comes with standard adapter files, so here we are going to copy them over to our current working directory.
cp /opt/miniconda/pkgs/trimmomatic-*/share/trimmomatic-*/adapters/TruSeq2-SE.fa .
(you can look at the contents of this file with cat TruSeq2-SE.fa
)
Now, let's run Trimmomatic on a couple samples:
trimmomatic SE ERR458493.fastq.gz \
ERR458493.qc.fq.gz \
ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
trimmomatic SE ERR458500.fastq.gz \
ERR458500.qc.fq.gz \
ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
CODE BREAKDOWN
Note that trimmomatic is actually a java program that we call from the command line, so it's syntax is a little different than what we've been using. The arguments are followed by a colon, and then what you are specifying to them.
SE
- this is specified because we are working with single-ended data, rather than paired-end data where there are forward and reverse readsERR458493.fastq.gz
- the first positional argument we are providing is the input fastq file (note it is okay to provide compressed .gz files to trimmomatic)ERR458493.qc.fq.gz
- the second positional argument specifies the name of the output files the program will generateILLUMINACLIP:TruSeq2-SE.fa:2:0:15
- here we are specifying the argument we're addressing "ILLUMINACLIP", first specifying the file holding the adapter sequences, then 3 numbers: "2" which states how many mismatches between the adapter sequence and what's found are allowed; "0" which is only relevant when there are forward and reverse reads; and "15" which specifies how accurate the match must beLEADING:2 TRAILING:2
- both of these are stating the minimum quality score at the start or end of the read, if lower, it will be trimmed offSLIDINGWINDOW:4:2
- this looks at 4 basepairs at a time, and if the average quality of that window of 4 drops below 2, the read is trimmed thereMINLEN:25
- after all the above trimming steps, if the read is shorter than 25 bps, it will be discarded
You should see output that looks like this:
...
Input Reads: 1093957 Surviving: 1092715 (99.89%) Dropped: 1242 (0.11%)
TrimmomaticSE: Completed successfully
We can also run the same process for all 6 samples more efficiently using a for
loop, as follows. We're going to use a
new command called basename
in the for loop, which we introduce below.
basename
is a function in UNIX that is helpful for removing a uniform part of a name from a list of files.
In this case, we will use basename to remove the .fastq.gz
extension from the files that we’ve been working with.
basename ERR458493.fastq.gz .fastq.gz
We see that this returns just the SRR accession, and no longer has the .fastq file extension on it.
ERR458493
If we try the same thing but use .fasta
as the file extension instead, nothing happens.
This is because basename only works when it exactly matches a string in the file.
basename ERR458493.fastq.gz .fasta
ERR458493.fastq.gz
Basename is really powerful when used in a for loop. It allows to access just the file prefix, which you can use to name things. Let's try this.
Inside our for loop, we create a new name variable. We call the basename function inside the parenthesis,
then give our variable name from the for loop, in this case ${filename}
, and finally state that .fastq.gz
should be removed from the file name. It’s important to note that we’re not changing the actual files, we’re
creating a new variable called name. The line > echo $name will print to the terminal the variable name each
time the for loop runs. Because we are iterating over two files, we expect to see two lines of output.
for filename in *.fastq.gz
do
name=$(basename ${filename} .fastq.gz)
echo ${name}
done
A note on variables: We can either refer a variable as $filename
or ${filename}
. They mean the same thing,
but using {}
explicitly tells bash when we are done refering to the variable. This makes it safer to refer
to variables, especially when we're combining it with other text. This is different than ()
, which tells bash
that we want to execute something.
Now we will use basename
inside of our for loop to run trimmomatic on all 6 of our files.
for filename in *.fastq.gz
do
# first, make the base by removing fastq.gz
base=$(basename $filename .fastq.gz)
echo $base
trimmomatic SE ${base}.fastq.gz \
${base}.qc.fq.gz \
ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
done
The for
loop above will go through each for the filenames that end with fastq.gz
and
run Trimmomatic for it.
Questions:
- How do you figure out what the parameters mean?
- How do you figure out what parameters to use?
- What adapters do you use?
- What version of Trimmomatic are we using here? (And FastQC?)
- Do you think parameters are different for RNAseq and genomic data sets?
- What's with these annoyingly long and complicated filenames?
For a discussion of optimal trimming strategies, see MacManes, 2014 -- it's about RNAseq but similar arguments should apply to metagenome assembly.
Links:
Let's take a look at the output files:
gunzip -c ERR458493.qc.fq.gz | head
It's hard to get a feel for how different these files are by just looking at them, so let's run FastQC again on the trimmed files:
fastqc ERR458493.qc.fq.gz
fastqc ERR458500.qc.fq.gz
And now view my copies of these files:
Alternatively, if there's time, you can use scp
to view the files from your
local computer.
MultiQC is a great tool that aggregates multiple fastqc results into a single report for easy comparison.
Run Mulitqc on both the untrimmed and trimmed files
multiqc .
And now you should see output that looks like this:
[INFO ] multiqc : This is MultiQC v1.0
[INFO ] multiqc : Template : default
[INFO ] multiqc : Searching '.'
Searching 15 files.. [####################################] 100%
[INFO ] fastqc : Found 4 reports
[INFO ] multiqc : Compressing plot data
[INFO ] multiqc : Report : multiqc_report.html
[INFO ] multiqc : Data : multiqc_data
[INFO ] multiqc : MultiQC complete
You can view the output html file multiqc_report.html by going to RStudio, selecting the file, and saying "view in Web browser."
Questions:
- is the quality trimmed data "better" than before?
- Does it matter that you still have adapters!?