Skip to content

Latest commit

 

History

History
508 lines (363 loc) · 17.8 KB

quality-and-trimming.md

File metadata and controls

508 lines (363 loc) · 17.8 KB

Short read quality and trimming

Learning objectives:

  • Install software (fastqc, multiqc, trimmomatic) via conda
  • download data
  • visualize read quality
  • quality filter and trim reads

If you don't have one running currently, start up a Jetstream m1.medium or larger instance (as detailed here), and then ssh login through your terminal (as shown here).


You should now be logged into your Jetstream computer! You should see something like this:

(base) dibada@js-171-9:~$ 

Getting started

Change to your home directory:

cd ~/

Let's make sure our conda channels are loaded, just in case.

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

and install FastQC, MultiQC, and trimmomatic:

conda install -y -c bioconda fastqc multiqc trimmomatic

Data source

Make a "data" directory:

cd ~/
mkdir data
cd data

and download some data from the Schurch et al, 2016 yeast RNAseq study. We will download three wildtype yeast datasets, and three mutant datasets (SNF2 knockout; SNF2 is a global transcription regulator).

curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz
curl -L https://osf.io/2wvn3/download -o ERR458495.fastq.gz
curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz
curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz
curl -L https://osf.io/qfsze/download -o ERR458502.fastq.gz
curl -L https://osf.io/9c5jz/download -o err_md5sum.txt

We also downloaded a file holding some md5sum values for our sequence files ("err_md5sum.txt"). An md5sum hash is like a unique fingerprint for a file. If we compute the md5sum for the files we downloaded, and they don't match what is specified in the "err_md5sum.txt" file that came with them, then we would know if any of the files were corrupted or cut short while we were downloading them.

If we look at what's in that file:

head err_md5sum.txt

We can see a value for each of our sequence files:

2b8c708cce1fd88e7ddecd51e5ae2154  ERR458493.fastq.gz
36072a519edad4fdc0aeaa67e9afc73b  ERR458494.fastq.gz
7a06e938a99d527f95bafee77c498549  ERR458495.fastq.gz
107aad97e33ef1370cb03e2b4bab9a52  ERR458500.fastq.gz
fe39ff194822b023c488884dbf99a236  ERR458501.fastq.gz
db614de9ed03a035d3d82e5fe2c9c5dc  ERR458502.fastq.gz

Those are the md5sums they are supposed to have if they transferred properly. We can have the computer calculate md5sums for the files we have now and check that they match up like this:

md5sum -c err_md5sum.txt

And if all is well, it will print out "OK" for each of the files:

ERR458493.fastq.gz: OK
ERR458494.fastq.gz: OK
ERR458495.fastq.gz: OK
ERR458500.fastq.gz: OK
ERR458501.fastq.gz: OK
ERR458502.fastq.gz: OK

Often the sequencing facility will provide an md5sum file with your data, or you may see them with large datasets in public repositories.

If we wanted to create our own md5sum file to keep with our data, we could do so like this for one file:

md5sum ERR458493.fastq.gz > my_md5sum.txt

Or we could use the * wildcard to help specify all of our .fastq.gz files here (overwriting the previous file):

md5sum *.fastq.gz > my_md5sum.txt

Back to our data, if we type:

ls -l *.fastq.gz

We should see something like:

-rw-rw-r-- 1 dibada dibada  59532325 Jul  4 03:23 ERR458493.fastq.gz
-rw-rw-r-- 1 dibada dibada  58566854 Jul  4 03:24 ERR458494.fastq.gz
-rw-rw-r-- 1 dibada dibada  58114810 Jul  4 03:24 ERR458495.fastq.gz
-rw-rw-r-- 1 dibada dibada 102201086 Jul  4 03:24 ERR458500.fastq.gz
-rw-rw-r-- 1 dibada dibada 101222099 Jul  4 03:24 ERR458501.fastq.gz
-rw-rw-r-- 1 dibada dibada 100585843 Jul  4 03:24 ERR458502.fastq.gz

These are six data files from the yeast study.

One problem with these files is that they are writeable – which means we can edit or delete them. By default, UNIX makes things writeable by the file owner. This poses an issue with creating typos or errors in raw data. Let's fix that before we go on any further:

chmod a-w *

Here we are using the chmod command (for change mode, which handles permissions of files/directories), and providing the argument a-w, which means for all users (a), take away (-) the permission to write (w).

If we take a look at their permissions now:

ls -l

We can see that the 'w' in the original permission string (-rw-rw-r--) has been removed from each file and now it should look like -r--r--r--.

We'll talk about what these files are below.

1. Linking data to our working location

First, make a new working directory:

mkdir ~/quality
cd ~/quality

Now, we're going to make a "link" to our quality-trimmed data in our current working directory:

ln -fs ~/data/* .

and you will see that they are now linked in the current directory when you do an ls. These links save us from having to specify the full path (address) to their location on the computer, without us needing to actually move or copy the files. But note that changing these files here still changes the original files!

These are FASTQ files -- let's take a look at them:

gunzip -c ERR458493.fastq.gz | head

FASTQ files are sequencing files that contain reads and quality information about each base-call within each read. A score of 0 is the lowest quality score, and a score of 40 is the highest. Instead of typing the scores as numbers, they are each encoded as a character.

Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
                  |         |         |         |         |
Quality score:    0........10........20........30........40  

A score of 10 indicates a 1 in 10 chance that the base is inaccurate. A score of 20 is a 1 in 100 chance that the base is inaccurate. 30 is 1 in 1,000. And 40 in 1 in 10,000.

Links:

2. FastQC

We're going to use FastQC summarize the data. We already installed 'fastqc' above, with the conda command.

Now, run FastQC on two files:

fastqc ERR458493.fastq.gz
fastqc ERR458500.fastq.gz

Now let's use ls:

ls *fastqc.zip

to list the files, and you should see:

ERR458493_fastqc.zip
ERR458500_fastqc.zip

Inside each of those zip files are directories holding reports from the fastqc program. But these are also nicely summarized in HTML files:

ls *.html

Let's transfer the HTML files to our local computers so we can open and view them in our browsers. We'll use the command scp to do this.

scp stands for secure copy. We can use this command to transfer files between two computers. We need to run this command on our local computers (i.e. from your laptop).

The scp command looks like this:

scp <file I want to move> <where I want to move it>

First we will make a new directory on our computer to store the HTML files we’re transfering. Let’s put it on our desktop for now. Make sure the terminal program you used this morning to ssh onto your instance is open. If you're in it now, you can open a new tab you can use the pull down menu at the top of your screen or the cmd + t or maybe ctrl + t keyboard shortcut). Type:

mkdir ~/Desktop/fastqc_html 

Now we can transfer our HTML files to our local computer using scp, which will look something like this:

scp username@ipaddress:~/quality/*.html ~/Desktop/fastqc_html/

The first part of the command is the address for your remote computer. Make sure you replace everything after username@ with your IP address (the one you used to log in with the ssh command).

The second part starts with a : and then gives the absolute path of the files you want to transfer from your remote computer. Don’t forget the :. We use a wildcard (*.html) to indicate that we want all of the HTML files.

The third part of the command gives the absolute path of the location you want to put the files. This is on your local computer and is the directory we just created ~/Desktop/fastqc_html.

You will be prompted for your password. Enter the password we set for your instance during the morning session.

You should see a status output like this:

ERR458493_fastqc.html                      100%  642KB 370.1KB/s   00:01    
ERR458500_fastqc.html                      100%  641KB 592.2KB/s   00:01  

Now we can go to our new directory and open the HTML files.

You can also download and look at these copies of them:

These files contain a lot of information! Depending on your application, some of these modules are more helpful than others. For example, for RNA-seq data, failing duplication levels aren't necessarily a problem -- we expect to have duplicated sequences given that some transcripts are very abundant.

Questions:

  • What should you pay attention to in the FastQC report?
  • Is one sample better than another?

Links:

There are some important things to know about FastQC - the main one is that it only calculates certain statistics (like duplicated sequences) for subsets of the data (e.g. duplicate sequences are only analyzed within the first 100,000 sequences in each file). You can read more about each module at their docs site.

3. Trimmomatic

Now we're going to do some trimming! Let's switch back to our instance terminal, as we will be running these commands on our remote computers. We'll be using Trimmomatic, which (as with fastqc) we've already installed via conda.

The first thing we'll need is a file holding the adapters we need to trim off. These are artificial and added as part of the sequencing process. The trimmomatic program comes with standard adapter files, so here we are going to copy them over to our current working directory.

cp /opt/miniconda/pkgs/trimmomatic-*/share/trimmomatic-*/adapters/TruSeq2-SE.fa .

(you can look at the contents of this file with cat TruSeq2-SE.fa)

Now, let's run Trimmomatic on a couple samples:

trimmomatic SE ERR458493.fastq.gz \
        ERR458493.qc.fq.gz \
        ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
        LEADING:2 TRAILING:2 \
        SLIDINGWINDOW:4:2 \
        MINLEN:25
        
trimmomatic SE ERR458500.fastq.gz \
        ERR458500.qc.fq.gz \
        ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
        LEADING:2 TRAILING:2 \
        SLIDINGWINDOW:4:2 \
        MINLEN:25
        

CODE BREAKDOWN

Note that trimmomatic is actually a java program that we call from the command line, so it's syntax is a little different than what we've been using. The arguments are followed by a colon, and then what you are specifying to them.

  • SE - this is specified because we are working with single-ended data, rather than paired-end data where there are forward and reverse reads
  • ERR458493.fastq.gz - the first positional argument we are providing is the input fastq file (note it is okay to provide compressed .gz files to trimmomatic)
  • ERR458493.qc.fq.gz - the second positional argument specifies the name of the output files the program will generate
  • ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 - here we are specifying the argument we're addressing "ILLUMINACLIP", first specifying the file holding the adapter sequences, then 3 numbers: "2" which states how many mismatches between the adapter sequence and what's found are allowed; "0" which is only relevant when there are forward and reverse reads; and "15" which specifies how accurate the match must be
  • LEADING:2 TRAILING:2 - both of these are stating the minimum quality score at the start or end of the read, if lower, it will be trimmed off
  • SLIDINGWINDOW:4:2 - this looks at 4 basepairs at a time, and if the average quality of that window of 4 drops below 2, the read is trimmed there
  • MINLEN:25 - after all the above trimming steps, if the read is shorter than 25 bps, it will be discarded

You should see output that looks like this:

...
Input Reads: 1093957 Surviving: 1092715 (99.89%) Dropped: 1242 (0.11%)
TrimmomaticSE: Completed successfully

We can also run the same process for all 6 samples more efficiently using a for loop, as follows. We're going to use a new command called basename in the for loop, which we introduce below.

Using basename in for loops

basename is a function in UNIX that is helpful for removing a uniform part of a name from a list of files. In this case, we will use basename to remove the .fastq.gz extension from the files that we’ve been working with.

basename ERR458493.fastq.gz .fastq.gz

We see that this returns just the SRR accession, and no longer has the .fastq file extension on it.

ERR458493

If we try the same thing but use .fasta as the file extension instead, nothing happens. This is because basename only works when it exactly matches a string in the file.

basename ERR458493.fastq.gz .fasta
ERR458493.fastq.gz

Basename is really powerful when used in a for loop. It allows to access just the file prefix, which you can use to name things. Let's try this.

Inside our for loop, we create a new name variable. We call the basename function inside the parenthesis, then give our variable name from the for loop, in this case ${filename}, and finally state that .fastq.gz should be removed from the file name. It’s important to note that we’re not changing the actual files, we’re creating a new variable called name. The line > echo $name will print to the terminal the variable name each time the for loop runs. Because we are iterating over two files, we expect to see two lines of output.

for filename in *.fastq.gz
do
  name=$(basename ${filename} .fastq.gz)
  echo ${name}
done

A note on variables: We can either refer a variable as $filename or ${filename}. They mean the same thing, but using {} explicitly tells bash when we are done refering to the variable. This makes it safer to refer to variables, especially when we're combining it with other text. This is different than (), which tells bash that we want to execute something.

Trimming files using basename and for loops

Now we will use basename inside of our for loop to run trimmomatic on all 6 of our files.

for filename in *.fastq.gz
do
        # first, make the base by removing fastq.gz
        base=$(basename $filename .fastq.gz)
        echo $base

        trimmomatic SE ${base}.fastq.gz \
                ${base}.qc.fq.gz \
                ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
                LEADING:2 TRAILING:2 \
                SLIDINGWINDOW:4:2 \
                MINLEN:25
done

The for loop above will go through each for the filenames that end with fastq.gz and run Trimmomatic for it.

Questions:

  • How do you figure out what the parameters mean?
  • How do you figure out what parameters to use?
  • What adapters do you use?
  • What version of Trimmomatic are we using here? (And FastQC?)
  • Do you think parameters are different for RNAseq and genomic data sets?
  • What's with these annoyingly long and complicated filenames?

For a discussion of optimal trimming strategies, see MacManes, 2014 -- it's about RNAseq but similar arguments should apply to metagenome assembly.

Links:

4. FastQC again

Let's take a look at the output files:

gunzip -c ERR458493.qc.fq.gz | head                                                        

It's hard to get a feel for how different these files are by just looking at them, so let's run FastQC again on the trimmed files:

fastqc ERR458493.qc.fq.gz
fastqc ERR458500.qc.fq.gz

And now view my copies of these files:

Alternatively, if there's time, you can use scp to view the files from your local computer.

5. MultiQc

MultiQC is a great tool that aggregates multiple fastqc results into a single report for easy comparison.

Run Mulitqc on both the untrimmed and trimmed files

multiqc .

And now you should see output that looks like this:

[INFO   ]         multiqc : This is MultiQC v1.0
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '.'
Searching 15 files..  [####################################]  100%
[INFO   ]          fastqc : Found 4 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : multiqc_report.html
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete

You can view the output html file multiqc_report.html by going to RStudio, selecting the file, and saying "view in Web browser."

Questions:

  • is the quality trimmed data "better" than before?
  • Does it matter that you still have adapters!?