Skip to content

Commit

Permalink
Update 4: aligment tools
Browse files Browse the repository at this point in the history
  • Loading branch information
FabianAndradeLozano committed Sep 2, 2024
1 parent e0ab682 commit 5b7b490
Show file tree
Hide file tree
Showing 6 changed files with 73 additions and 20 deletions.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2022 Biocore@CRG
Copyright (c) 2024 Biocore@CRG

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
File renamed without changes.
File renamed without changes.
37 changes: 22 additions & 15 deletions docs/3- Quality Control and Preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,23 +29,24 @@ the extensions supported are:

The html report generated for each file its divided in the following modules:

#1. **Basic Statistics**: display the information related with the file, number and leght of the sequences, and overall %GC.
#2. **Per base sequence quality**: shows how the quality score (y axis) varys throughout the sequence reads (x axis). For each position a BoxWhisker is displayed, the red line represents the median and the blue the mean. Commonly the quality score tend to decrease at the end of the reads, because the polymerase tends to make more mistakes as the read progresses.
1. **Basic Statistics**: display the information related with the file, number and leght of the sequences, and overall %GC.
2. **Per base sequence quality**: shows how the quality score (y axis) varys throughout the sequence reads (x axis). For each position a BoxWhisker is displayed, the red line represents the median and the blue the mean. Commonly the quality score tend to decrease at the end of the reads, because the polymerase tends to make more mistakes as the read progresses.
is the median os any base is less than 25 a warning will arise.
#3. **Per tile sequence quality**: shows the quality score distribution for each tile in the flowcell.
#4. **Per sequence quality score**: shows the distribution of the quality scores for all the reads in the file. If a huge amount of reads subset have a poor average quality this could indicate a systematic problem.
#5. **Per base sequence content**: proportion of each base position for the four nucleotides. A strong bias in the nucleotide composition could indicate a problem in the library preparation.
#6. **Per sequence GC content**: GC content distribution for all the reads in the file, and compared to a modelled normal distribution of human GC content.
3. **Per tile sequence quality**: shows the quality score distribution for each tile in the flowcell.
4. **Per sequence quality score**: shows the distribution of the quality scores for all the reads in the file. If a huge amount of reads subset have a poor average quality this could indicate a systematic problem.
5. **Per base sequence content**: proportion of each base position for the four nucleotides. A strong bias in the nucleotide composition could indicate a problem in the library preparation.
6. **Per sequence GC content**: GC content distribution for all the reads in the file, and compared to a modelled normal distribution of human GC content.

.. danger::
If the GC content is not close to the normal distribution, this could indicate a contamination or a problem in the library preparation.
Also, depending on the organism the GC content could vary, so it is important to know the GC content of the organism of interest (so avoid comparison with reference curve).
.. danger::
If the GC content is not close to the normal distribution, this could indicate a contamination or a problem in the library preparation.
Also, depending on the organism the GC content could vary, so it is important to know the GC content of the organism of interest (so avoid comparison with reference curve).

#7. **Per Base N content**: If the sequencer is unable to determine the base in a position, it will be represented as an 'N'. This section shows the distribution of Ns in the reads.
#8. **Sequence Lenght Distribution**: distribution of fragment sizes, for delimited size lenght (number of cycles) a peak only at one size is observed.
#9. **Duplicate Sequences**: shows the number of duplicated sequences in the file. a high level of duplication could indicate a enrichment bias (i.e. PCR amplification). Low level of duplication may indicate a very high level of coverage of the target sequence.
#10. **Overrepresented sequences**: show in a single sequence is very overrepresented in the file. This could indicate a contamination or a problem in the library preparation.
#11. **Adapter content **: shows the presence of adapter sequences in the reads. If there is presence of adapters, the reads should be trimmed before further analysis.

7. **Per Base N content**: If the sequencer is unable to determine the base in a position, it will be represented as an 'N'. This section shows the distribution of Ns in the reads.
8. **Sequence Lenght Distribution**: distribution of fragment sizes, for delimited size lenght (number of cycles) a peak only at one size is observed.
9. **Duplicate Sequences**: shows the number of duplicated sequences in the file. a high level of duplication could indicate a enrichment bias (i.e. PCR amplification). Low level of duplication may indicate a very high level of coverage of the target sequence.
10. **Overrepresented sequences**: show in a single sequence is very overrepresented in the file. This could indicate a contamination or a problem in the library preparation.
11. **Adapter content **: shows the presence of adapter sequences in the reads. If there is presence of adapters, the reads should be trimmed before further analysis.
.. seealso::
.. _FASTQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/
Expand Down Expand Up @@ -107,4 +108,10 @@ Fastp performs in all one the following corrections:
- Poly-G tails are recognised and removed (Sequencing error in the end of the read produced by some artifacts, such as Illumina and Novaseq, for the use of two colors to detect the four bases)

After preprocessing our reads, its important to check again the Quality. Fastp generates both htm and json report for asses the quality of our reads.
The json reports could be aggregated with MULTIQC.
The json reports could be aggregated with MULTIQC.

Example of fastp report.




53 changes: 49 additions & 4 deletions docs/4- Quality of the mapping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,18 @@ Introduction to Mapping and tools
Once our reads are clean and with good Quality, most of the analysis requires the aligment of this reads respect a reference genome.
Depending on the origin of our sequencing data (WGS, WES, RNA-seq, Chip-seq, ...) and the downstream analysis, several alingers are available to adjust to the necessities of our analysis.

- BWA-MEM
- bowtie
**Basic aligment**: Based on the Smith-Waterman algorithm, needs the creation of an index of the reference genome, used as a dictionary to query the reads.
- BWA-MEM: by default perform local aligment,
- bowtie2: by default perform global aligment, is faster than BWA but less sensitive.

**RNA-seq splice-aware aligner**
- STAR
-
- TopHat
- HISAT2

**Pseudo-Aligner - Quasi-mapping**: very fast, map to transciptome and does quantitation.
- Salmon
- Kallisto

Previous aligment of the reads, a reference genome in fasta format is needed, Typical sources to look up are UCSC, Ensembl or Gencode. An indexing of the reference genome is perfomed to create a dictionary database of the redundant sequences of the genome and facilitate and accelerate the query of the reads respect this regions, thus, minimizing the the memory footprint.

Expand All @@ -22,6 +30,7 @@ SAM format




BAM QC
===========================

Expand All @@ -42,9 +51,45 @@ The confidence of the alignment is higher when the MAPQ value is higher.

Main Tools to asses the quality of the mapping are:

- **SAMStat**: Is a CLI tool that offers Statistics of SAM/BAM files of unmapped, poorly and accuretly mapped raads.
.. note::
The value og the MAPQ is Algorithm-specific, so the values of MAPQ are not comparable between different aligners.

**SAMStat**
------------

Is a CLI tool that offers Statistics of SAM/BAM files of unmapped, poorly and accuretly mapped raads.
.. seealso::
.. _SAMStat: https://github.com/TimoLassmann/samstat


BAM format. Note, that the BAM file has to be sorted by chromosomal coordinates. Sorting can be performed with samtools sort.

**Qualimap**
------------

Qualimap is a platform-independent application written in Java and R that provides both a GUI and a command-line interface to facilitate the quality control of alignment sequencing data.
It can be used to assess the quality of the alignment of reads to a reference genome, the coverage of the genome, the distribution of reads across the genome and helps to detect biases.
Qualimap generates a series of plots and tables that can be used to evaluate the quality of the alignment and identify potential problems with the data.
It can be used to assess the quality of alignments generated by a variety of alignment tools, including BWA, Bowtie, and STAR.

Requires a sorted BAM file as input and the origin data supported are WGS, WES, RNA-seq and Chip-seq.

.. note::
.. _Samtools: https://www.htslib.org/doc/samtools-sort.html
BAM file sorting by chromosomal coordinates can be performed with samtools sort Samtools_.

.. seealso::
.. _Qualimap: http://qualimap.bioinfo.cipf.es/

Depending on the origin of our data, exist different modes for quality asses.

- BamQC: provides evaluation of the mapping quality of the reads. And if an annotation file is provided (Typically bed file of the coverage refgions by the Library prepa kit), it can also provide information about the coverage of the genes.

- Rna-seq: specific for whole transciptome sequencing data. Can be use as a complementary tool with BamQC.

- Multi-sample BamQC: When working with multiple samples (i.e. sequencing data in cancer) and theese belongs to an specific group.
This allow to detect if all the samples in an specific group pass the quality control.

- Counts QC: When working with RNA-seq data, this mode allows to asses the differential expresion betweeen two or more experimental conditions.

**Picard Tools - RNAseqMetrics**
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ Contents:
1- Library_preparation
2- Sequencing_technologies
3- Quality Control and Preprocessing
4- Quality of the mapping

0 comments on commit 5b7b490

Please sign in to comment.