-
Notifications
You must be signed in to change notification settings - Fork 22
Preparing remove contamination reference data
The clockwork pipeline to remove contamination uses BWA to map to genomes, to decide if each read pair should be removed or not. This page describes how to use your genomes of choice for the pipeline. If you are processing Mtb data, then we recommend that you use the built-in script do make these files (please read how to set up reference genomes).
The method used to determine if a read pair is contamination is as follows. We need reference genomes, which are split into two subsets: 1) genomes that are contamination; 2) genomes that we are interested in that are not contamination. For example, for Mtb data we could use the human genome for contamination and the reference Mtb genome as our genome of interest. BWA MEM is used to map the reads. If a read or its mate (or both reads) map to a genome of interest, the read pair is kept. If neither of the reads map to a genome of interest, and one or both reads map to contamination, then the read pair is removed. If both reads are unmapped, then the read pair is kept. In particular, if a read maps to a contamination genome, but its mate maps to a genome of interest then the read pair is kept (this should be an unlikely scenario!).
The reference sequences must be put into groups. For example, all the human genome chromosomes could be put into a group called "human". Or you may have a collection of virus genomes, which could be put into a group called "virus". The pipeline outputs counts of reads that matched to each group, not sequence.
To use your own data, you will need to make two files:
- A FASTA file with all the genome(s) (can be gzipped).
- A tab-delimited file with information on whether each sequence in
the FASTA file is from a genome to be kept, or to be removed.
It should have no header line, and have three columns:
- The group name
- A 1 or 0. 0=not contamination. 1=contamination.
- Sequence name (which must also be in the FASTA file)
Important: there must be a one-to-one correspondence between the sequence names in the FASTA file and column 3 of the TSV file.
Here is a toy example. FASTA file:
>contam1
AAAAAAGCTAGCTAGCTCAT
>contam2
TCATGCATGACTGACG
>contam3
CTATCAGTCGATCGATACG
>Mtb
CTAGTGCATCAGTAGTGATGTG
>interesting_to_count
TCAGCATGACTGACTGCATGCATGACGT
TSV file:
contam_group1 1 contam1
contam_group1 1 contam2
contam_group2 1 contam3
reference 0 Mtb
other 0 interesting_to_count
In that example, we have two groups of contamination genomes,
the reference genome, and another genome of interest. The
interesting_to_count
sequence is not the reference, but
we also do not count it as contamination. This means that when
the remove contamination pipeline is run, it will report how many
reads matched interesting_to_count
, but not remove them.
We need to process those input files by running
clockwork reference_prepare
, which will generate
all the files needed to run the remove contamination pipeline
(eg running BWA index etc). This needs to be run once only,
before processing any samples.
Assume that we have made the FASTA and tab-delimited files, as
described above, and they are called references.fasta
and
reference_info.tsv
.
If you are using a database to track your samples, run this to use your data with the remove contamination pipeline:
clockwork reference_prepare \
--db_config_file db.ini \
--pipeline_references_root Pipeline_refs \
--contam_tsv reference_info.tsv \
--name my_name \
references.fasta
Where db.ini
should be the name of your database config file,
and Pipeline_refs
the pipeline references directory.
These should have already been
made when setting up the clockwork pipelines.
That will make a new reference called "my_name", and
add a row to the Reference table in the database.
The ID that it is given is the one that should be used
with the --ref_id
option when running the remove
contamination pipeline. For example, you may
see something like this in MySQL:
select * from Reference;
+--------------+----------------------+
| reference_id | name |
+--------------+----------------------+
| 1 | foo |
| 2 | my_name |
+--------------+----------------------+
and you would need to use --ref_id 2
.
Run the command
clockwork reference_prepare \
--contam_tsv reference_info.tsv \
--outdir OUT \
references.fasta
That will create a new directory called OUT
. It contains
all the files needed to run the remove contamination script
on one sample without a database, using the nextflow script
remove_contam.nf
. The relevant options to that
nextflow script are --ref_fasta OUT/ref.fa
and
--ref_metadata_tsv OUT/remove_contam_metadata.tsv
.