This repository contains tools to separate sequences from different sources by composition, as described here: https://cobiontid.github.io
In many cases, samples of target organisms collected in the wild contain sequences from additional organisms. Identifying the source of a given sequence can be challenging if there are few reference datasets available from sufficiently closely related species. However, differences in sequence composition can nevertheless be used to separate different components of a sample.
Learning two-dimensional embeddings of sequence composition (in this case tetranucleotide counts) with a Variational Autoencoder (VAE) provides a framework to visually explore long-read datasets and detect contaminants or organisms interacting with the target. Sequence characteristics, such as estimated coding density and approximate read coverage, provide additional clues about the contents of the sample. For example, even without taxonomic labels, a microbe could be distinguished from an insect based on its higher density of coding sequences.
A preprint describing the approach in detail is available here: https://www.biorxiv.org/content/10.1101/2024.05.30.596622v1. In addition to the VAE-based workflow for reads, the repository includes some tools to assess sequence assemblies. The documentation in this repository is currently still under construction.
Tallies k-mers in a read set, reduce to two dimensions and visualise read clusters (defaults to tetranucleotides). Annotates read plots with additional sequence features, such as estimated coding density, approximate k-mer coverage and sequence k-mer diversity.
Decomposed read tetranucleotides from Erannis defoliaria indicate the presence of bacteria in the sample (top). In this static plot, The reads are coloured by estimated coding density. The resulting data can also be explored interactively.
As with the reads, tallies and reduces tetranucleotide composition to two dimensions and plots with annotations. In addition to estimated coding density and k-mer diversity, FastK provides a measure of repetitiveness, and coverage for primary Hifiasm assemblies can be extracted and used to annotate the plots. A selection tool allows sequences that are of interest to be selected and downloaded with their annotations. Where Hi-C data are available, a SALSA or YaHs pair file may also be provided to annotate plots with scaffold connectivity information. Take a look at an interactive version of the plot here.
Read k-mer counts are reduced to two dimensions following the method of Kingma and Welling (2013). Outputs two-dimensional representation of the read set and a basic plot.
Generate colour-coded plots of 2D representations learned by the VAE.
Interactively filter and query annotated 2D representations of read data.
Workflow and utilities to generate interactive HTML file of decomposed tetranucleotide plots with binned annotations.
Counts the number of occurences of each k-mer of size k for each record in a fasta file of nucleotide sequences (canonicalised or non-canonicalised). Implemented in Rust, runs approximately ten times faster than the equivalent code in Python.
Count the number of distinct k-mers of size k for each record in a fasta files of nucleotide sequences, and divide by sequence length. Implemented in Rust.
Estimates the coding density using the sum of lengths of putative coding sequences divided by sequence length. The cobiont pipelines previously used a modified version of the old hexamer code. The relevant functionality is now available in an updated version of hexamer from
https://github.com/richarddurbin/hexamer (to extract the estimated density, pipe stdout to awk '{ print $3/$2}'
)
Calculates the median number of times each k-mer of size k (in this case k = 31) occurs across the whole set of sequences. Provides an approximation of coverage for reads (provided they are not highly repetitive), or repetitiveness for assembled contigs or scaffolds.
If you use any of the code in this repository, please cite: Weber CC, 2024. Disentangling cobionts and contamination in long-read genomic data using sequence composition. G3 Genes | Genomes | Genetics, https://doi.org/10.1093/g3journal/jkae187