Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality control of BBS data #16

Open
mdavy86 opened this issue Dec 16, 2014 · 3 comments
Open

Quality control of BBS data #16

mdavy86 opened this issue Dec 16, 2014 · 3 comments
Labels

Comments

@mdavy86
Copy link
Contributor

mdavy86 commented Dec 16, 2014

This is a placeholder to discuss what we are doing in terms of Quality control of BBS data

Plant and Food Research

We have some perl scripts, knitr Rmarkdown scripts, and a shiny application looking at quality control aspects of GBS restriction sites for bam alignments.

The shiny application does some exploratory analysis summarizing 96 wells * 2 bam files for ~1.5 Million restriction sites/tags in real time checking the sampled yield distributions versus the known population of restriction sites for samples, investigatng coverage depth, and fragment distribution before considering SNP discovery.

The perl script sanity checks restriction fragments (probably unnecessary), and summarises sites in the following form;

$ perl gbsSites.pl
NAME
    gbsSites.pl - BAM to location terminal ends

DESCRIPTION
    Process a bam file for GBS restriction sites

SYNOPSIS
     gbsSites.pl [options]

    Where options and [defaults] are:

     -bam <BAM file>    Path to a bam file. Multiple options allowed      []

     -enzyme <Enzyme name> Which restriction enzyme? BamHI, ApeKI etc     [BamHI]

     -format < narrow|wide > Options: 'wide' or 'narrow' formats          [wide]

     -out <output file> Filename for tab delimited report                 [report.txt]

## Example output
Sample  Chromosome      cutSite Count   fwdCount        revCompCount
[BAMFile]   1       8312    1       0       1
[BAMFile]   1       17201   340     340     0
[BAMFile]   1       33026   2       0       2
[BAMFile]   1       35031   1       1       0
[BAMFile]   1       50458   54      0       54
@rbrauning
Copy link

To enable biologists and lab staff to contribute to qc efforts I've put together questions of interest to be asked from a GBS run. Technical details are left out to draw non-bifos in.

  1. Fastq
    • Did we get per lane what’s promised in terms of output?
    • How does the sequence quality look like?
    • How pure is the data (adapters, other species)? What are contaminants?
  2. Barcodes
    • How many reads have recognizable barcodes?
    • What are the reads without barcodes?
    • Are all barcodes represented equally?
    • Are negative controls blank?
  3. Mapping
    • How many reads can get mapped to a reference?
    • What does the mapping quality look like?
    • How much of the genome gets covered by reads?
    • What does the coverage depth distribution look like?
    • What does the theoretical fragment size distribution look like? Contrast to observed fragment size distribution.
    • How many reads do we see per fragment? Are there fragments that absorb most of the reads?
    • Do the reads map within 100bp of the fragment ends?
    • How do the start and end sequences of fragments look like theoretically and what gets observed?
  4. SNPs
    • How many SNPs do we see per sample?
    • Do GBS SNP calls agree with SNP chip data / WGS data?

@mdavy86
Copy link
Contributor Author

mdavy86 commented Jan 27, 2015

Thats good, many of the questions cover more detail than in the last meeting minutes.

We have some code investigating post aligning QC, fragment distributions, modeled as an exponential decay (where applicable), size selection bias relative to the population of known tag sites, depth distribution, reml mixed model analysis of 96 technical samples for 6 genotypes.

@lranjard
Copy link

lranjard commented Feb 3, 2015

Link to fastq_screen, that utility that subsample reads in fastq files to check for contaminations against a configurable set of Bowtie2 genome indexes:
http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants