From 5b7b490aeffa061dce84ee045841baddf491f96b Mon Sep 17 00:00:00 2001 From: Fabian Andrade Date: Mon, 2 Sep 2024 12:53:21 +0200 Subject: [PATCH] Update 4: aligment tools --- LICENSE | 2 +- ...aration.rst => 1- Library preparation.rst} | 0 ...ies.rst => 2- Sequencing technologies.rst} | 0 docs/3- Quality Control and Preprocessing.rst | 37 +++++++------ docs/4- Quality of the mapping.rst | 53 +++++++++++++++++-- docs/index.rst | 1 + 6 files changed, 73 insertions(+), 20 deletions(-) rename docs/{1- Library_preparation.rst => 1- Library preparation.rst} (100%) rename docs/{2- Sequencing_technologies.rst => 2- Sequencing technologies.rst} (100%) diff --git a/LICENSE b/LICENSE index c218e7e..45db2f6 100644 --- a/LICENSE +++ b/LICENSE @@ -1,6 +1,6 @@ MIT License -Copyright (c) 2022 Biocore@CRG +Copyright (c) 2024 Biocore@CRG Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/docs/1- Library_preparation.rst b/docs/1- Library preparation.rst similarity index 100% rename from docs/1- Library_preparation.rst rename to docs/1- Library preparation.rst diff --git a/docs/2- Sequencing_technologies.rst b/docs/2- Sequencing technologies.rst similarity index 100% rename from docs/2- Sequencing_technologies.rst rename to docs/2- Sequencing technologies.rst diff --git a/docs/3- Quality Control and Preprocessing.rst b/docs/3- Quality Control and Preprocessing.rst index 99ee4cc..09c0fc0 100644 --- a/docs/3- Quality Control and Preprocessing.rst +++ b/docs/3- Quality Control and Preprocessing.rst @@ -29,23 +29,24 @@ the extensions supported are: The html report generated for each file its divided in the following modules: - #1. **Basic Statistics**: display the information related with the file, number and leght of the sequences, and overall %GC. - #2. **Per base sequence quality**: shows how the quality score (y axis) varys throughout the sequence reads (x axis). For each position a BoxWhisker is displayed, the red line represents the median and the blue the mean. Commonly the quality score tend to decrease at the end of the reads, because the polymerase tends to make more mistakes as the read progresses. + 1. **Basic Statistics**: display the information related with the file, number and leght of the sequences, and overall %GC. + 2. **Per base sequence quality**: shows how the quality score (y axis) varys throughout the sequence reads (x axis). For each position a BoxWhisker is displayed, the red line represents the median and the blue the mean. Commonly the quality score tend to decrease at the end of the reads, because the polymerase tends to make more mistakes as the read progresses. is the median os any base is less than 25 a warning will arise. - #3. **Per tile sequence quality**: shows the quality score distribution for each tile in the flowcell. - #4. **Per sequence quality score**: shows the distribution of the quality scores for all the reads in the file. If a huge amount of reads subset have a poor average quality this could indicate a systematic problem. - #5. **Per base sequence content**: proportion of each base position for the four nucleotides. A strong bias in the nucleotide composition could indicate a problem in the library preparation. - #6. **Per sequence GC content**: GC content distribution for all the reads in the file, and compared to a modelled normal distribution of human GC content. + 3. **Per tile sequence quality**: shows the quality score distribution for each tile in the flowcell. + 4. **Per sequence quality score**: shows the distribution of the quality scores for all the reads in the file. If a huge amount of reads subset have a poor average quality this could indicate a systematic problem. + 5. **Per base sequence content**: proportion of each base position for the four nucleotides. A strong bias in the nucleotide composition could indicate a problem in the library preparation. + 6. **Per sequence GC content**: GC content distribution for all the reads in the file, and compared to a modelled normal distribution of human GC content. -.. danger:: - If the GC content is not close to the normal distribution, this could indicate a contamination or a problem in the library preparation. - Also, depending on the organism the GC content could vary, so it is important to know the GC content of the organism of interest (so avoid comparison with reference curve). + .. danger:: + If the GC content is not close to the normal distribution, this could indicate a contamination or a problem in the library preparation. + Also, depending on the organism the GC content could vary, so it is important to know the GC content of the organism of interest (so avoid comparison with reference curve). - #7. **Per Base N content**: If the sequencer is unable to determine the base in a position, it will be represented as an 'N'. This section shows the distribution of Ns in the reads. - #8. **Sequence Lenght Distribution**: distribution of fragment sizes, for delimited size lenght (number of cycles) a peak only at one size is observed. - #9. **Duplicate Sequences**: shows the number of duplicated sequences in the file. a high level of duplication could indicate a enrichment bias (i.e. PCR amplification). Low level of duplication may indicate a very high level of coverage of the target sequence. - #10. **Overrepresented sequences**: show in a single sequence is very overrepresented in the file. This could indicate a contamination or a problem in the library preparation. - #11. **Adapter content **: shows the presence of adapter sequences in the reads. If there is presence of adapters, the reads should be trimmed before further analysis. + + 7. **Per Base N content**: If the sequencer is unable to determine the base in a position, it will be represented as an 'N'. This section shows the distribution of Ns in the reads. + 8. **Sequence Lenght Distribution**: distribution of fragment sizes, for delimited size lenght (number of cycles) a peak only at one size is observed. + 9. **Duplicate Sequences**: shows the number of duplicated sequences in the file. a high level of duplication could indicate a enrichment bias (i.e. PCR amplification). Low level of duplication may indicate a very high level of coverage of the target sequence. + 10. **Overrepresented sequences**: show in a single sequence is very overrepresented in the file. This could indicate a contamination or a problem in the library preparation. + 11. **Adapter content **: shows the presence of adapter sequences in the reads. If there is presence of adapters, the reads should be trimmed before further analysis. .. seealso:: .. _FASTQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/ @@ -107,4 +108,10 @@ Fastp performs in all one the following corrections: - Poly-G tails are recognised and removed (Sequencing error in the end of the read produced by some artifacts, such as Illumina and Novaseq, for the use of two colors to detect the four bases) After preprocessing our reads, its important to check again the Quality. Fastp generates both htm and json report for asses the quality of our reads. -The json reports could be aggregated with MULTIQC. \ No newline at end of file +The json reports could be aggregated with MULTIQC. + +Example of fastp report. + + + + diff --git a/docs/4- Quality of the mapping.rst b/docs/4- Quality of the mapping.rst index 780f747..8b0060d 100644 --- a/docs/4- Quality of the mapping.rst +++ b/docs/4- Quality of the mapping.rst @@ -10,10 +10,18 @@ Introduction to Mapping and tools Once our reads are clean and with good Quality, most of the analysis requires the aligment of this reads respect a reference genome. Depending on the origin of our sequencing data (WGS, WES, RNA-seq, Chip-seq, ...) and the downstream analysis, several alingers are available to adjust to the necessities of our analysis. - - BWA-MEM - - bowtie +**Basic aligment**: Based on the Smith-Waterman algorithm, needs the creation of an index of the reference genome, used as a dictionary to query the reads. + - BWA-MEM: by default perform local aligment, + - bowtie2: by default perform global aligment, is faster than BWA but less sensitive. + +**RNA-seq splice-aware aligner** - STAR - - + - TopHat + - HISAT2 + +**Pseudo-Aligner - Quasi-mapping**: very fast, map to transciptome and does quantitation. + - Salmon + - Kallisto Previous aligment of the reads, a reference genome in fasta format is needed, Typical sources to look up are UCSC, Ensembl or Gencode. An indexing of the reference genome is perfomed to create a dictionary database of the redundant sequences of the genome and facilitate and accelerate the query of the reads respect this regions, thus, minimizing the the memory footprint. @@ -22,6 +30,7 @@ SAM format + BAM QC =========================== @@ -42,9 +51,45 @@ The confidence of the alignment is higher when the MAPQ value is higher. Main Tools to asses the quality of the mapping are: -- **SAMStat**: Is a CLI tool that offers Statistics of SAM/BAM files of unmapped, poorly and accuretly mapped raads. +.. note:: + The value og the MAPQ is Algorithm-specific, so the values of MAPQ are not comparable between different aligners. + +**SAMStat** +------------ + +Is a CLI tool that offers Statistics of SAM/BAM files of unmapped, poorly and accuretly mapped raads. .. seealso:: .. _SAMStat: https://github.com/TimoLassmann/samstat BAM format. Note, that the BAM file has to be sorted by chromosomal coordinates. Sorting can be performed with samtools sort. + +**Qualimap** +------------ + +Qualimap is a platform-independent application written in Java and R that provides both a GUI and a command-line interface to facilitate the quality control of alignment sequencing data. +It can be used to assess the quality of the alignment of reads to a reference genome, the coverage of the genome, the distribution of reads across the genome and helps to detect biases. +Qualimap generates a series of plots and tables that can be used to evaluate the quality of the alignment and identify potential problems with the data. +It can be used to assess the quality of alignments generated by a variety of alignment tools, including BWA, Bowtie, and STAR. + +Requires a sorted BAM file as input and the origin data supported are WGS, WES, RNA-seq and Chip-seq. + +.. note:: + .. _Samtools: https://www.htslib.org/doc/samtools-sort.html + BAM file sorting by chromosomal coordinates can be performed with samtools sort Samtools_. + +.. seealso:: + .. _Qualimap: http://qualimap.bioinfo.cipf.es/ + +Depending on the origin of our data, exist different modes for quality asses. + + - BamQC: provides evaluation of the mapping quality of the reads. And if an annotation file is provided (Typically bed file of the coverage refgions by the Library prepa kit), it can also provide information about the coverage of the genes. + + - Rna-seq: specific for whole transciptome sequencing data. Can be use as a complementary tool with BamQC. + + - Multi-sample BamQC: When working with multiple samples (i.e. sequencing data in cancer) and theese belongs to an specific group. + This allow to detect if all the samples in an specific group pass the quality control. + + - Counts QC: When working with RNA-seq data, this mode allows to asses the differential expresion betweeen two or more experimental conditions. + +**Picard Tools - RNAseqMetrics** diff --git a/docs/index.rst b/docs/index.rst index 837d861..d9b976c 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -16,3 +16,4 @@ Contents: 1- Library_preparation 2- Sequencing_technologies 3- Quality Control and Preprocessing + 4- Quality of the mapping