PanDelos: a dictionary-based method for pan-genome content discovery
The PanDelos software is organized in 2 python modules and a Java library that are piped together by a bash script.
The bash script, pandelos.sh
, provides the acces point the the PanDelos pipeline.
bash pandelos.sh <input.faa> <output_prefix>
The complete set of (gene) sequences <input.faa>
, belonging to any of the studied genomes, must be provided as a text file.
For each sequence, two lines are reported in the file. An identification line that is composed of three parts separated by a tabulaiton character. The parts represent the genome identifiers, the gene identifier and the gene product.
After the identification line, the complete gene sequence in FASTA amino acid format is reported in a single line. No black lines are admitted between the indetification line and the sequence line, neighter between genes.
A valid file is given by the following example listing 4 genes from 2 genomes:
NC_000913 b0001@NC_000913:1 thr operon leader peptide
MKRISTTITTTITITTGNGAG
NC_000913 b0024@NC_000913:1 uncharacterized protein
MCRHSLRSDGAGFYQLAGCEYSFSAIKIAAGGQFLPVICAMAMKSHFFLISVLNRRLTLTAVQGILGRFSLF
NC_002655 Z_RS03160@NC_002655:1 hok/gef family protein
MLTKYALVAVIVLCLTVPGFTLLVGDSLCEFTVKERNIEFRAVLAYEPKK
NC_002655 Z_RS03165@NC_002655:1 protein HokE
MLTKYALVAVIVLCLTVLGFTLLVGDSLCEFTVKERNIEFKAVLAYEPKK
IMPORTANT: make sure the gene identifiers are unique within the input file. Commonly used file formats used to share genome annotaitons do not require that different locus tags of the same gene must be unique.
We suggest to use the following format to build unique gene identifiers:
gene_identifier@genome_identifier:unique_integer
The fields gene_identifier
and genome_identifier
are the same reported in the input file, while the unique_integer
is used to disitrnghuish multiple copies of the same gene (same gene identifier) wihtin the same genome. The integer starts from 1 and it is incremented according to the order gene are written in the input file.
The examples provided in the examples
folder generate 4 different dataset files, having the .faa
extension, which can be consulted.
The exectution of PanDelos produces an outpu file named [out_prefix].clus
which reports the gene families retrieved by the software.
Each row of the output file represented a specific gene family retrieved by PanDelos.
PanDelos can run on any operating system where Bash, Python 3 (or higher) and Java SE RunTime 8 (or higher) have been previously installed.
The current repository contains a pre-compiled version of the internal Java library needed by PanDelos. The library has been pre-compiled with Java 8, however, it can be compiled on the user system by running the script compile.sh
inside the ig
folder. In case of success, the script replaces the ig.jar
with a new version.
Alternatively, you may run the following instruction form the ig
folder:
javac -classpath ext/commons-io-2.6.jar -sourcepath ./ infoasys/cli/pangenes/Pangenes.java
jar cvf ig.jar infoasys/
The compilation requires that the commands javac
and jar
are located in your system folders.
The script run_examples.sh
inside the examples
folder will run the benchmarks used in the scientific article of PanDelos.
cd example
bash run_examples.sh
Benchmarks regard 4 datasets from which the pan-genome content has been extracted:
- 7 isolates of the Typhi serotype of Salmonella enterica
- 14 Xanthomonas campestris
- 10 Escehrichia coli
- 64 Mycoplasma
The identifiers of the isolates are stored in the .list.txt
files.
Genomes are downloaded from the NCBI repositories by means of the download.sh
script in the form of GenBank gbk
files.
The efetch tool, available via the Entrez E-Utilities toolkit, must be installed in order to download the required genome files from the NCBI database.
The script gbk2ig.py
merge the gbk files and produces a single benchmark file that is used as input for the PanDelos pipeline.
The script quality.py
calculates statistics about the extracted pan-genome content and prints them.
The script gbk2ig.py
(located inside the examples
folder) can also be used to analyse a set of genomes provided in GBK format. The script takes as input the folder where the file are stored (every file with extension .gbk
is taken into account) and the name of an output file. The resultant file can be used as input for the PanDelos pipeline.
PanDelos is distributed under the MIT license. This means that it is free for both academic and commercial use. Note however that some third party components in PanDelos require that you reference certain works in scientific publications. You are free to link or use PanDelos inside the source code of your own program. If do so, please reference (cite) PanDelos and this website. We appreciate bug fixes and would be happy to collaborate for improvements. License
If you have used any of the PanDelos project software, please cite the following paper:
Bonnici, V., Giugno, R., Manca, V.
PanDelos: a dictionary-based method for pan-genome content discovery
BMC bioinformatics 19.15 (2018): 437.