Skip to content

Preparing remove contamination reference data

martinghunt edited this page Oct 30, 2019 · 3 revisions

Background

The clockwork pipeline to remove contamination uses BWA to map to genomes, to decide if each read pair should be removed or not. This page describes how to use your genomes of choice for the pipeline. If you are processing Mtb data, then we recommend that you use the built-in script do make these files (please read how to set up reference genomes).

Methods

The method used to determine if a read pair is contamination is as follows. We need reference genomes, which are split into two subsets: 1) genomes that are contamination; 2) genomes that we are interested in that are not contamination. For example, for Mtb data we could use the human genome for contamination and the reference Mtb genome as our genome of interest. BWA MEM is used to map the reads. If a read or its mate (or both reads) map to a genome of interest, the read pair is kept. If neither of the reads map to a genome of interest, and one or both reads map to contamination, then the read pair is removed. If both reads are unmapped, then the read pair is kept. In particular, if a read maps to a contamination genome, but its mate maps to a genome of interest then the read pair is kept (this should be an unlikely scenario!).

The reference sequences must be put into groups. For example, all the human genome chromosomes could be put into a group called "human". Or you may have a collection of virus genomes, which could be put into a group called "virus". The pipeline outputs counts of reads that matched to each group, not sequence.

Input files

To use your own data, you will need to make two files:

  • A FASTA file with all the genome(s) (can be gzipped).
  • A tab-delimited file with information on whether each sequence in the FASTA file is from a genome to be kept, or to be removed. It should have no header line, and have three columns:
    1. The group name
    2. A 1 or 0. 0=not contamination. 1=contamination.
    3. Sequence name (which must also be in the FASTA file)

Important: there must be a one-to-one correspondence between the sequence names in the FASTA file and column 3 of the TSV file.

Here is a toy example. FASTA file:

>contam1
AAAAAAGCTAGCTAGCTCAT
>contam2
TCATGCATGACTGACG
>contam3
CTATCAGTCGATCGATACG
>Mtb
CTAGTGCATCAGTAGTGATGTG
>interesting_to_count
TCAGCATGACTGACTGCATGCATGACGT

TSV file:

contam_group1   1   contam1
contam_group1   1   contam2
contam_group2   1   contam3
reference   0   Mtb
other   0   interesting_to_count

In that example, we have two groups of contamination genomes, the reference genome, and another genome of interest. The interesting_to_count sequence is not the reference, but we also do not count it as contamination. This means that when the remove contamination pipeline is run, it will report how many reads matched interesting_to_count, but not remove them.

Using the input files

We need to process those input files by running clockwork reference_prepare, which will generate all the files needed to run the remove contamination pipeline (eg running BWA index etc). This needs to be run once only, before processing any samples.

Assume that we have made the FASTA and tab-delimited files, as described above, and they are called references.fasta and reference_info.tsv.

With a database

If you are using a database to track your samples, run this to use your data with the remove contamination pipeline:

clockwork reference_prepare \
  --db_config_file db.ini \
  --pipeline_references_root Pipeline_refs \
  --contam_tsv reference_info.tsv \
  --name my_name \
  references.fasta

Where db.ini should be the name of your database config file, and Pipeline_refs the pipeline references directory. These should have already been made when setting up the clockwork pipelines.

That will make a new reference called "my_name", and add a row to the Reference table in the database. The ID that it is given is the one that should be used with the --ref_id option when running the remove contamination pipeline. For example, you may see something like this in MySQL:

select * from Reference;
+--------------+----------------------+
| reference_id | name                 |
+--------------+----------------------+
|            1 | foo                  |
|            2 | my_name              |
+--------------+----------------------+

and you would need to use --ref_id 2.

Without a database

Run the command

clockwork reference_prepare \
  --contam_tsv reference_info.tsv  \
  --outdir OUT \
  references.fasta

That will create a new directory called OUT. It contains all the files needed to run the remove contamination script on one sample without a database, using the nextflow script remove_contam.nf. The relevant options to that nextflow script are --ref_fasta OUT/ref.fa and --ref_metadata_tsv OUT/remove_contam_metadata.tsv.