Skip to content

Commit

Permalink
index and 3: reports images added
Browse files Browse the repository at this point in the history
  • Loading branch information
FabianAndradeLozano committed Sep 2, 2024
1 parent 5b7b490 commit 8d3ef45
Show file tree
Hide file tree
Showing 13 changed files with 77 additions and 10 deletions.
83 changes: 75 additions & 8 deletions docs/3- Quality Control and Preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,24 +30,75 @@ the extensions supported are:
The html report generated for each file its divided in the following modules:

1. **Basic Statistics**: display the information related with the file, number and leght of the sequences, and overall %GC.
2. **Per base sequence quality**: shows how the quality score (y axis) varys throughout the sequence reads (x axis). For each position a BoxWhisker is displayed, the red line represents the median and the blue the mean. Commonly the quality score tend to decrease at the end of the reads, because the polymerase tends to make more mistakes as the read progresses.

2. **Per base sequence quality**: shows how the quality score (y axis) varys throughout the sequence reads (x axis).
For each position a BoxWhisker is displayed, the red line represents the median and the blue the mean.
Commonly the quality score tend to decrease at the end of the reads, because the polymerase tends to make more mistakes as the read progresses.
is the median os any base is less than 25 a warning will arise.

.. image:: images/FASTQC_report_images/Per_base_seq_quality.png
:width: 400
:align: center
:alt: *Per Base Sequence Quality FASTQC module*

3. **Per tile sequence quality**: shows the quality score distribution for each tile in the flowcell.
4. **Per sequence quality score**: shows the distribution of the quality scores for all the reads in the file. If a huge amount of reads subset have a poor average quality this could indicate a systematic problem.
5. **Per base sequence content**: proportion of each base position for the four nucleotides. A strong bias in the nucleotide composition could indicate a problem in the library preparation.

.. image:: images/FASTQC_report_images/Per_tile_seq_quality.png
:width: 400
:align: center
:alt: *Per Tile Sequence Quality FASTQC module*

4. **Per sequence quality score**: shows the distribution of the quality scores for all the reads in the file.
If a huge amount of reads subset have a poor average quality this could indicate a systematic problem.

.. image:: docs/images/FASTQC_report_images/Per_seq_quality_scores.png
:width: 400
:align: center
:alt: *Per Sequence Quality FASTQC module*

5. **Per base sequence content**: proportion of each base position for the four nucleotides.
A strong bias in the nucleotide composition could indicate a problem in the library preparation.

.. image:: images/FASTQC_report_images/Per_base_seq_content.png
:width: 400
:align: center
:alt: *Per Base Sequence Content FASTQC module*

6. **Per sequence GC content**: GC content distribution for all the reads in the file, and compared to a modelled normal distribution of human GC content.

.. danger::
If the GC content is not close to the normal distribution, this could indicate a contamination or a problem in the library preparation.
Also, depending on the organism the GC content could vary, so it is important to know the GC content of the organism of interest (so avoid comparison with reference curve).
.. image:: images/FASTQC_report_images/Per_seq_GC_content.png
:width: 400
:align: center
:alt: *Per Sequence GC Content FASTQC module*

.. danger::
If the GC content is not close to the normal distribution, this could indicate a contamination or a problem in the library preparation.
Also, depending on the organism the GC content could vary, so it is important to know the GC content of the organism of interest (so avoid comparison with reference curve).

7. **Per Base N content**: If the sequencer is unable to determine the base in a position, it will be represented as an 'N'. This section shows the distribution of Ns in the reads.
8. **Sequence Lenght Distribution**: distribution of fragment sizes, for delimited size lenght (number of cycles) a peak only at one size is observed.

.. image:: images/FASTQC_report_images/Per_base_N_content.png
:width: 400
:align: center
:alt: *Per Base N Content FASTQC module*

9. **Duplicate Sequences**: shows the number of duplicated sequences in the file. a high level of duplication could indicate a enrichment bias (i.e. PCR amplification). Low level of duplication may indicate a very high level of coverage of the target sequence.

.. image:: images/FASTQC_report_images/Seq_duplication_levels.png
:width: 400
:align: center
:alt: *Duplicate Sequences FASTQC module*

10. **Overrepresented sequences**: show in a single sequence is very overrepresented in the file. This could indicate a contamination or a problem in the library preparation.

11. **Adapter content **: shows the presence of adapter sequences in the reads. If there is presence of adapters, the reads should be trimmed before further analysis.
.. image:: images/FASTQC_report_images/Adapter_content.png
:width: 400
:align: center
:alt: *Adapter Content FASTQC module*


.. seealso::
.. _FASTQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/

Expand All @@ -69,15 +120,31 @@ In human sequencing data the standard reference genomes to check are:
- Yeast
- Arabidopsis
- E.coli
- Mitochondrial: in single nucleus RNA-seq is a good control of the nuclear isolation during the DNA extraction.

Also, other sources of contaminats could be checked:

- PhiX: is a control used by Illumina to check the quality of the sequencing run (if the library is under or overloaded)
- rRNA: in RNA-seq is a good control of rRNA depletion during library preparation has not beeen amplified.
- Mitochondrial: in single nucleus RNA-seq is a good control of the nuclear isolation during the DNA extraction.
- Lambda
- Vectors: to check that vectors used during library preprartion
- Adapters

Example of a FASTQ-Screen report:

- Mapping result tables with the percentage of reads that map to each reference genome.

.. image:: images/FASTQ-Screen/Mapping_results_tables.png
:width: 400
:align: center
:alt: *Adapter Content FASTQC module*

- Mapping results tables values in a plot.

.. image:: images/FASTQ-Screen/Mapping_results_plots.png
:width: 400
:align: center
:alt: *Adapter Content FASTQC module*

When working with several samples and reports theese could be aggregate in a unique report using "MULTIQC"" (https://multiqc.info/)

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Contents:
:maxdepth: 1

about
1- Library_preparation
2- Sequencing_technologies
1- Library preparation
2- Sequencing technologies
3- Quality Control and Preprocessing
4- Quality of the mapping

0 comments on commit 8d3ef45

Please sign in to comment.