GALA is a Gap-free Long-read Assembler. GALA builds a multi-layer graph from different preliminary assemblies, long-reads, and potentially other sources of information, such as Hi-C assemblies. During this process, it identifies mis-assembled contigs and trim them. The corrected data are then partitioned into multiple scaffolding groups, each representing a single chromosome. Each scaffolding group is assembled independently with existing assembly tools and a simplified version of overlap-graph-based merging algorithm is used to merge multiple contigs if necessary.
GALA has three modules each can be used separately.
GALA assembled a human genome using (HiFi) reads. GALA used canu draft for CHM13 and the current human reference genome GRCh38.p13 as input of GALA. In this way GALA essentially created a reference-guided de novo assembly. GALA assembly comprised of 37 continuous contigs, including 8 telomer-to-telomer gap-free pseudomolecular sequences, 4 near complete chromosomes each with a small telomeric fragment unanchored, 3 with only gapped centromeric regions, and the long arm of acrocentric chromosomes. Human Genome
GALA can be run directly from the gala folder
git clone https://github.com/ganlab/gala.git cd GALA
Or
You can run install
to add it to your PATH
Using GALA pipeline to assemble a genome involves preliminary steps and three main Steps.
Use different software to construct preliminary assemblies from long reads, e.g. (Canu, Flye, MECAT, Miniasm, and Wtdbg2).
- Raw reads and corrected reads if available.
- The user needs to prepare
draft_names_paths.txt
for preliminary assemblies. Here is an example:
draft_01=path/to/draft_fasta_file
draft_02=path/to/draft_fasta_file
draft_03=path/to/draft_fasta_file
draft_n=path/to/draftfasta file
To run GALA using one command user can use the following command:
gala
draft_names_paths.txt
fa/fq
reads_file
platform
In single command mode, GALA used canu for Chromosome-by-Chromosome assembly.
To use another assembler or multiple assemblers, GALA provides three choices Canu, Flye, and Miniasm, pass it to -a
argument with a single space between them.
For sequencing_platform the user needs to provide it in this way:
-pacbio-raw
-pacbio-corrected
-nanopore-raw
-nanopore-corrected
usage: gala -h [options] <draft_names & paths> <fa/fq> <reads> <platform>
GALA Gap-free Long-read Assembler
positional arguments:
draft_names Draft names and paths [required]
input_file input type (fq/fa) [required]
reads raw/corrected reads [required]
sequencing_platform -pacbio-raw -pacbio-corrected -nanopore-raw -nanopore-
corrected [required]
optional arguments:
-h, --help show this help message and exit
-a [ASSEMBLER [ASSEMBLER ...]]
Chr-by_Chr assembler (canu flye miniasm) [default
canu]
-b Alignment block length [default 5000]
-p Alignment identity percentage [default 70%]
-c Shortest contig length [default 5000]
-q Mapping quality [default 20]
-f Output files name [default gathering]
-o output files path [default current directory]
-v, --version show program's version number and exit
- Use the
comp
module to generate adraft_comparison
filecomp
draft_names_paths.txt
- Run
draft_comparison
file to produce drafts comparison paf filessh
draft_compare.sh
- Use the
mdm
module to identify mis-assembled contigs.mdm
comparison_folder
number of assembly drafts
- Use the
newgenome
module to Producemisassembly-free
drafts.newgenome
draft_names_paths.txt
cut_folder
- Use the
comp
module to generate adraft_comparison
file formisassembly-free
drafts.comp
new_draft_names_paths.txt
- Run
draft_comparison
file to produce new drafts comparison paf files.sh
draft_compare.sh
- Run the
ccm
module to produce contigsscaffolding groups
.ccm
comparison_folder
number of assembly drafts
- Note:
You can also use the
reformat
module to generate reformatted paf files and use them to confirmScaffolding groups
.
- Note:
You can also use the
-
Map all drafts against raw long reads and self-corrected reads if available.
bwa index
misassembly-free draft
bwa mem -x pacbio/ont2dmisassembly-free draft
long-reads
-
Use the following commands to separate the read names mapped to each contig
samtools view -H bam_file |grep "SQ"|cut -f 2|cut -d : -f 2 > contig_names
seprator
contig_names
mapping.bam
sh bam_seprator.sh
for i in bams/*; do samtools view $i | cut -f 1 > $i.read_names;done;
-
Use the
cat
command to concatenate read name files belongs to the samescaffolding group
.- For example:
cat contig_1.bam.read_names contig_3.bam.read_names contig_7.bam.read_names > scaffold_1.read_names
- For example:
-
Use the
readsep
Module to separate each scaffold correlated-reads.for i in
scaffold_*.read_names
; do readsepraw/correted-reads
$i
-finput reads file type fa/fq
-
Implement Chromosome-by-Chromosome assembly approach to retrieve the gap-free chromosome-scale assembly by
Assemble each read set from scaffold_*.read.fq with different assembly software, e.g.(Canu, Flye, Mecat, Miniasm, and Wtdbg).
we recommend the user to try different assembly tools especially ( Flye, MECAT/NECAT, and Miniasm)
-
Finally, map the SGAM outcomes against one of the preliminary draft assemblies to confirm that all the contigs in the
scaffolding group
are assembled to the right chromosome/Scaffold.
The comp module used to generate a genome comparison file if the user wants to compare multiple genomes against each other.
usage: comp -h [options] <draft_names & paths>
Generate genome comparison files, part of GALA Gap-free Long-read Assembler
positional arguments:
drafts Draft names and paths [required]
optional arguments:
-h, --help show this help message and exit
-o output files path [default current directory]
-v, --version show program's version number and exit
Miss-assembly Detector Module used to detect misassembled contigs. The algorithm relies on the alignment's contradictory information.
mis-assembly detection module should be applicable for error correction regardless of the specific algorithm used for assembly and can differentiate between misassembly and Structure variation
usage: mdm -h [options] path/to/mapping_files number of drafts
MDM Mis-assembly Detector Module, part of GALA Gap-free Long-read Assembler
positional arguments:
mapping_files mapping paf file [required]
drafts Number of drafts [required]
optional arguments:
-h, --help show this help message and exit
-b Alignment block length [default 5000]
-p Alignment identity percentage [default 70%]
-c Shortest contig length [default 5000]
-q Mapping quality [default 20]
-f Output files name [default gathering]
-o output files path [default current directory]
-v, --version show program's version number and exit
The newgenome module trims the misassembled contigs and gives misassembly free genome. This module used only with multiple samples
usage: newgenome -h [options] <draft_names & paths> <path to cut files>
Produce mis-assembly free genomes, part of GALA Gap-free Long-read Assembler
positional arguments:
draft Draft names and paths [required]
cut_files path_to_cut_files" [required]
optional arguments:
-h, --help show this help message and exit
-f Output files name [default new_genome]
-o output files path [default current directory]
-v, --version show program's version number and exit
Contig Clustering Module used to identify the scaffolding groups
and the contigs overlap information in multiple preliminary assemblies.
ccm could have extended applications in generating consensus assembly from multiple sequences. Besides, it is useful in reference guide scaffolding to determine Chromosomes scaffolding groups
usage: ccm -h [options] <path/to/mapping_files> <number of drafts>
CCM Contig Clustering Module, part of GALA Gap-free Long-read Assembler
positional arguments:
mapping_files mapping paf file [required]
drafts Number of drafts [required]
optional arguments:
-h, --help show this help message and exit
-b Alignment block length [default 5000]
-p Alignment identity percentage [default 70%]
-c Shortest contig length [default 5000]
-q Mapping quality [default 20]
-f Output files name [default scaffolds]
-o output files path [default current directory]
-v, --version show program's version number and exit
the reformat module filters the alignment data in paf mapping files and merge overlapping and continuous alignment intervals into a single mapping interval. So, each contig in query draft will have one alignment interval with the subject draft.
usage: reformat -h [options] <path/to/mapping_files> <number of drafts>
Re-formatting mapping files module, part of GALA Gap-free Long-read Assembler
positional arguments:
mapping_files mapping paf file [required]
drafts Number of drafts [required]
optional arguments:
-h, --help show this help message and exit
-b Alignment block length [default 5000]
-p Alignment identity percentage [default 70%]
-c Shortest contig length [default 5000]
-q Mapping quality [default 20]
-f Output files name [default reformated]
-o output files path [default current directory]
-v, --version show program's version number and exit
The seprator module used to separate contigs alignments in individual bams and separate the read names mapped to each contig in an individual file
usage: seprator -h [options] <contig_names> <bam_file>
Separate each contig correlated read names, part of GALA Gap-free Long-read Assembler
positional arguments:
contig_names contig_names [required]
bam_file mapping bam file [required]
optional arguments:
-h, --help show this help message and exit
-o output files path [default current directory]
-f Output files name [default bam_seprator]
-b output folder name [default bams]
-v, --version show program's version number and exit
Use the following command to produce contig_names file:
samtools view -H <bam_file> |grep 'SQ'|cut -f 2|cut -d : -f 2 > contig_names
The readsep module separates a set of reads from a sequencing dataset according to the read name in the definition line.
usage: readsep -h [options] <reads> <read_titles>
Extract reads from fasta or fastq, part of GALA Gap-free Long-read Assembler
positional arguments:
reads raw/corrected reads [required]
read_titles read names [required]
optional arguments:
-h, --help show this help message and exit
-f input file format (fa/fq)
-v, --version show program's version number and exit
GALA is distributed under MIT license. See the LICENSE file for details.