Version 0.7.0
A flexible and modular software suite for domain-based gene neighborhood and protein search, extraction, and clustering.
- Overview and key concepts
- Program descriptions
- Requirements
- Installation
- Updating
- Examples
- Using Domainator as a python library
- Credits
Please use our discussions page or create an issue
Domainator provides several dozen discrete, flexible programs that can be composed into a broad range of workflows via command line or python scripting. Domainator also treats HMM-profiles as first-class citizens, supporting subsetting of .hmm files and comparison of HMM-profiles, including the construction of profile similarity networks and trees.
Domainator uses the GenBank file format as a carrier of both sequence and annotation data. Independence from a fixed set of sequence sources and the co-location of sequences and all their annotation data in a single file increases data portability and decreases complexity for end-users. Domainator can add functional annotations to sequences by comparison either to HMM-profiles, protein sequences, or both at the same time. For example, in a single call to domainate, a set of genome or metagenome contigs can be annotated with hits to Pfam HMM-profiles and hits to REBASE Gold Standard protein sequences at the same time.
The individual programs that make up Domainator can be roughly classified into six categories corresponding to their typical roles in workflows. The first steps in most workflows typically involve passing sequence data through one or more editors.
Editors are programs whose output format is the same as their input format. Each individual editor performs a simple task, but they can be combined in arbitrarily long chains to accomplish complex transformations. Examples are domain_search, which outputs a subset of the input sequences, based on the presence of a hit to a reference sequence or profile; domainate, which outputs all the input sequence but adds domain annotations based on hits to reference sequences; deduplicate_genbank, which performs similarity clustering using CD-HIT or usearch and outputs only the cluster representatives input sequences; and select_by_cds, which extracts genome neighborhoods around domains of interest.
In Domainator programs, perhaps differently from other software, the file being edited is supplied via the -i
argument and criteria for editing is supplied via other arguments, such as the -r
argument for reference sequences or hmm profiles. For example searching for hits to a query in the UniProt database may be accomplished via domain_search.py -i uniprot.fasta -r query.hmm -o hits.gb
, similarly annotating a set of contigs with Pfam annotations may be accomplished with domainate.py -i unannotated.gb -r pfam-A.hmm -o annotated.gb
.
Summary report programs summarize data into graphs and statistics, for example, the number of sequences in a file and the count of each kind of domain. Record-wise report programs produce tab-separated files, for example where each row corresponds to a genome contig, a protein, or a domain, and values are data such as length, taxonomy ID, domain content, etc.
Record-wise reports are useful for exporting data to programs, such as Excel, which can’t read GenBank or hmm files, and they also find use as intermediary files between some programs in Domainator.
Comparison programs generate pairwise score or distance matrices between contigs or HMM-profiles. Compare_contigs uses the Jaccard and adjacency indexes to compare proteins or gene neighborhoods based on their domain content, whereas seq_dist uses local alignment scores, for example via phmmer , Diamond , hmmsearch , or the Viterbi profile-comparison algorithm. Comparison programs output data matrices.
Plotting programs convert data into formats appropriate for graphical visualization, for example converting score matrices and tabular metadata into trees or similarity networks which can be viewed in Cytoscape or other external visualization tools, depending on the data type.
Finally, there are few programs that defy categorization, these programs perform functions such as downloading data from NCBI or UniProt, converting files between formats, or generating profile-profile alignments.
For detailed documentation for each program, call the program with the -h
option.
Programs that can accept streaming input will read genbank formatted data from stdin
when not supplied with an explicit input. Programs that need to pass over the input multiple times can't support streaming input.
Programs that can produce streaming output will write genbank formatted data to stdout
when not supplied with an explicit output.
You can use shell pipes to move data from a program that supports streaming ouput into a program that supports streaming input (but note that it is more efficient to use domainator as a python library, see below).
Program | Description | input | output | streaming input | streaming output | notes |
---|---|---|---|---|---|---|
build_projection.py | generates 2d or 3d umap, tsne, or pca projections | a dense or sparse data matrix, could be a sequence similarity matrix, or any other kind of data matrix. optionally, a table of node metadata. | typically a Cyoscape xgmml file or a png image file. Other outputs are possible. | no | no | |
build_ssn.py | generates a Cytoscape file with a sequence similarity network (SSN) | a sparse matrix of network edges. optionally, a table of node metadata | typically a Cyoscape xgmml file, other SSN related files are possible | no | no | |
build_tree.py | Converts a distance matrix into a UPGMA tree | a distance matrix | Newick or Cytoscape file | no | no | Can attach annotations to cytoscape files, based on supplied metadata. |
color_genbank.py | creates color qualifiers on CDS and/or domainator features based on domainator annotations | domainator-annotated genbank files. optionally a tab-separated map of domain names to hex colors. | a genbank file | yes | yes | To auto color most domains but give explicit colors to some, run twice. First without a tsv map, then with a tsv map. |
color_table_to_legend.py | Generates an SVG legend based on a color table | tab separated file of annotation names and color codes | an svg file | no | no | |
compare_contigs.py | calculates similarities between contig, based on metrics such as domain adjacency index and domain jaccard index. | domainator-annotated genbank files | Genbank file where contigs are sorted according to hierarchical clustering of similarity. sparse or dense similarity matrices. | no | no | If you want to pipe genbank output you can explicitly supply -o - . Other kinds of output planned for the future, such as phylogenetic trees of contigs. |
deduplicate_genbank.py | runs a clustering algorithm, such as cdhit or usearch on sequences from a genbank file to reduce redundancy. | genbank nucleotide or peptide files | a genbank file | no | yes | |
domain_search.py | Search a small number of hmms against a database | genbank (protein genbank, or nucleotide genbank with CDS annotations) or protein fasta files to search through. hmmer3 hmm (or peptide fasta) files to search for in the sequences. | a genbank file | no | yes | |
domainate.py | annotates contig or protein sequences with protein domains | genbank (protein genbank, or nucleotide genbank with CDS annotations) or protein fasta files to annotate. hmmer3 hmm (or peptide fasta) files to search for in the sequences. | a genbank file with new annotations | no | yes | Consider domain_search.py if searching for small numbers of hmms against large databases. |
domainator_db_download.py | Download databases from the internet in formats appropriate for domainator | None | genbank or fasta | N/A | use -o /dev/stdout |
|
enum_report.py | from a genbank file generates a tab separated file of properties and metadata related to the input, for example the domain content of each contig or cds. | genbank files | a tab separated or html metadata table | yes | yes | |
extract_domains.py | Takes a domainator-genbank file, looks for the specified domains, and extracts them from the contigs. | a (nucleotide or peptide) genbank file. | a genbank or fasta file of the same sequence-type (nucleotide or peptide) as the input | yes | yes | If you want to extract peptide domains from a nucleotide genbank, pass the output of this program into extract_peptides (or vice-versa) |
extract_peptides.py | extracts translated peptide sequences from nucleotide genbank sequences, preserves the domainator annotations (other annotations are lost) | nucleotide genbank file | peptide genbank file | yes | yes | |
extract_unannotated.py | Takes a domainator-genbank file, extracts regions not covered by any domain annotations | a peptide genbank file. | a peptide genbank or fasta file | yes | yes | |
filter_domains.py | applies more stringent filtering to domainator-annotated genbank files. | domainator-annotated genbank files | a genbank file | yes | yes | If you want to annotate the same genbank file with multiple evalue or max_overlap cutoffs, it is often faster to run domainator.py first with loose cutoffs, then run filter_domains.py with more stringent cutoffs. To clear all annotations, just run filter_domains.py with an evalue of -1 |
genbank_to_fasta.py | converts genbank files to fasta files | one or more genbank files | a fasta file | yes | yes | |
hmmer_build.py | runs hmmbuild. Unlike the hmmbuild from the HMMER3 package, hmmer_build.py allows specifying accession and description from the command line | an MSA | a .hmm file | no | no | |
hmmer_compare.py | runs local profile alignment, calculating pairwise hmm scores and position-wise alignments | hmmer3 profiles | scores and (optionally) alignments | yes | no | hmmer_compare, hmmer_search, and seq_dist can all be used to compare profiles, with different outputs. |
hmmer_report.py | Writes a tabular report from .hmm files, with a line for each profile | one or more .hmm files | a tab separated or html metadata table | yes | yes | |
hmmer_search.py | Selects hmm profiles based on similarity scores to a reference set of hmm profiles | hmmer3 profiles | hmmer3 profiles | yes | yes | hmmer_compare, hmmer_search, and seq_dist can all be used to compare profiles, with different outputs. |
hmmer_select.py | Extracts a subset of profiles from .hmm files | one or more .hmm files | an hmm files | yes | yes | |
matrix_report.py | generates a histogram and summary statistics from a data matrix | a dense or sparse data matrix | a report with graphics and statistics | no | no | |
partition_seqfile.py | Scans through a genbank or fasta file and finds file byte offsets to partition the file into subsets. | a genbank or fasta file | tab separted text offset, recs |
no | yes | This is useful for parallelizing searches, in combination with the --offset and --recs_to_read parameters of domainate.py and domain_search.py |
plot_contigs.py | Creates graphical representations of contigs with their annotations. | a genbank or fasta file | an html file | yes | no | |
select_by_cds.py | Extract contigs or contig regions based on CDS names or domain content within individual CDSs | domainator-annotated nucleotide genbank files | a genbank file | yes | yes | Supports boolean logic for domain selector statements. If --pad option is set, then output streaming is not in real-time and memory usage will be higher, because all records need to be read before padding sizes can be calculated. |
select_by_contig.py | Extract a subset of contigs based on domain content of the entire contig | domainator-annotated genbank files | a genbank file | yes | yes | Supports boolean logic for domain selector statements. |
seq_dist.py | calculates similarity scores between protein sequences | two peptide sequence files (genbank or fasta), a sequence file and an hmm file, or two hmm files. | a sparse or dense matrix of similarity scores | no | no | Typically the input and reference will be the same sequence file, creating a pairwise similarity matrix. It can also use an hmm file as a reference to make a table of profile scores for each input peptide. |
summary_report.py | Calculates statistics such as number of contigs, average contig length, and frequency of each domain | a domainator-annotated genbank file | an html file of the report, with figures, statistics etc | yes | no | 'focus domains' can be specified, which will make the output include information about the co-occurrence and distance of those domains with other domains. Can also write a report to the console. |
transform_matrix.py | Converts matrices in score format to matrices in various normalized score formats | a dense or sparse data matrix of scores | a dense or sparse data matrix of normalized scores | no | no | |
trim_contigs.py | trim the ends of contigs | nucleotide or protein genbank or fasta files | a genbank file | yes | yes | Supports boolean logic for domain selector statements. |
Domainator is developed on Linux, and has been tested on Mac. It probably won't work on Windows, except through WSL.
NOTE: For Installation with ESM-2 3B 3Di and foldseek integration, see here
If you don't already have a conda distribution installed, install one, such as miniconda.
or
Go to the Domainator github page and download and extract the zip file for the latest release (see the right side of the github page), or download it via git clone
.
Extract the zip file. In a terminal, navigate to the newly extracted folder and install the package with the command:
conda env create -f conda_env.yml
This will create a new conda environment called "domainator".
conda activate domainator
Navigate to the directory containing this README file. It should also contain a file called apptainer.def
. In that directory execute the command:
apptainer build domainator.sif domainator.def
Or (if you have singularity not apptainer)
singularity build domainator.sif domainator.def
Then to execute domainator commands, prefix any command with
apptainer exec domainator.sif [command here]
# for example:
apptainer exec domainator.sif domainate.py -i test/data/pDONR201_multi_genemark.gb -r test/data/pdonr_hmms.hmm -o example_out.gb
Download and extract the new version, navigate to the folder and install with:
conda env create --yes -f conda_env.yml
This will delete the old domainator
conda environment and replace it with the updated version.
Activate the domainator conda environment
conda activate domainator
To run domainator on a single file
# download and extract Pfam
wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
# run Domainator
domainate.py --cpu 4 --max_overlap 0.6 -i your_genbank.gb -r Pfam-A.hmm -o domainator_output.gb
summary_report.py -i domainator_output.gb --html domain_summary.html
enum_report.py -i domainator_output.gb --by contig --name --taxname superkingdom genus species self --length --domains --html enum_report.html -o enum_report.tsv
plot_contigs.py -i domainator_output.gb --html contigs_plot.html
For more examples see: Basic Examples, and Advanced examples.
Most executables in Domainator are based on a core function that acts as a editors on BioPython SeqRecord objects. So you can chain them within python scripts, keeping sequences as SeqRecords instead of writing and reading Genbank files. Chaining the functions will be similar to piping the output of one program to another, but will be faster because superfluous conversions from Genbank to SeqRecord and back again will be avoided. A caveat is that you lose some input and output validation (See also Interoperability with BioPython).
The general way to access these functions is from domainator.[script_name] import [script_name]
.
Domainator is not directly intercompatible with BioPython. It uses SeqRecord
objects derived from BioPython SeqRecords
, but with some changes and extensions. A modified and extended subset of BioPython is included with Domainator as domainator.Bio
. For reading and writing domainator-compatible sequence files we strongly recommend using the domainator.utils.parse_seqfiles
and domainator.utils.write_genbank
functions, respectively.
Johnson, Sean R., Peter Weigele, Alexey Fomenkov, Andrew Ge, Anna Vincze, James B. Eaglesham, Richard J. Roberts, and Zhiyi Sun. “Domainator, a Flexible Software Suite for Domain-Based Annotation and Neighborhood Analysis, Identifies Proteins Involved in Antiviral Systems.” 2024, Nucleic Acids Research, gkae1175. https://doi.org/10.1093/nar/gkae1175.
- Sean Johnson
- Peter Weigele
- Zhiyi Sun
- Andrew Ge
- Yu-Cheng Lin