Skip to content

richarddurbin/gaffer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gaffer - a GFA toolkit

The main product here is gaffer, with currently a single source file gaffer.c. Its initial task is to read in a HiFiAsm unitig file, carry out various QC and cleanup operations, read and write assemblies into ONEcode files, and convert from an overlap to a blunt-ended graph.

The project also contains relevant versions of various Durbin package utility programs:

  • ONEview to view ONEcode files, and convert them between binary and ascii. By convention .1seq files are binary, .seq files are ascii.
  • ONEstat to give stats for 1code files.
  • composition to give information about the data in DNA sequence files: fasta[.gz], fastq[.gz], 1seq (if compiled with -DONEIO, as by default in this project), SAM/BAM/CRAM (if compiled with -DBAMIO, see below), and a custom binary sequence file format (now deprecated with 1seq preferred).
  • seqconvert to convert between DNA file types. This also will homopolymer compress (-H "hoco"), and if writing to a 1seq file while doing this will store the offsets in the original file of each new sequence base position, allowing to unhoco (-U) back to the original sequence. Because ONEcode files compress all lists, this

For each program, running it without any arguments gives usage information.

Building

  git clone https://github.com/richarddurbin/gaffer.git
  make

If you want to be able to read SAM/BAM/CRAM files then you need to install htslib in a parallel directory and use Makefile.bam:

  cd ..
  git clone https://github.com/samtools/htslib.git
  cd htslib
  autoreconf -i  # Build the configure script and install files it uses
  ./configure    # Optional but recommended, for choosing extra functionality
  make
  make install
  cd ../gaffer
  make clean
  make -f Makefile.bam

Libraries

There are various useful libraries, with header files:

  • ONElib.[hc] supports ONEcode file reading and writing, with implementation in the single file ONElib.c with no dependendencies. See https://github.com/VGP/vgp-tools/tree/master/Core for further information.
  • seqio.[hc] supports reading, writing DNA files with a few other basic operations. Implementation in seqio.c, with dependencies on utils.[hc], libz, ONElib and htslib depending on compile operations.
  • utils.[hc] some very low level type definitions (e.g. I8 to I64 and U8 to U64), die(), warn(), new(), new0(), and a timing package. No dependencies beyond normal C run time library. NB there is a handy fzopen() which will silently open .gz files as standard files, but this depends on funopen() which is not available on all systems. If this does not compile/link then you will need to undefine WITH_ZLIB to link.
  • array.[ch], dict.[ch], hash.[ch] respectively implement advanced language style extendable arrays, dictionaries (hashes of strings) and general hashes of basic types (up to 64-bit).

Synopsis

An example usage pattern is given below. Note that as of 17/8/2022 the bluntification step has bugs, so the result is not correct.

> gaffer
Usage: gaffer <commands>
   -read  <stem>   : read ONE files
   -write <stem>   : write ONE files
   -schema         : print ONE schema
   -readGfa <gfa file> : only reads S and L lines for now
   -dna <sequence file matching S line names>
   -removeBadLinks : (for now) remove imperfect overlaps
   -blunt          : makes new non-overlapping graph
   -depth <gfa file with A lines>
   -extend         : adds match blocks for shared incoming edges - needs seq
       
> gaffer -readGfa TEST/daStaPalu1.r_utg.gfa.gz -write daStaPalu1-r
WARNING: ** self link from/to utg010600c 21198 overlap 0 - ignoring
WARNING: ** self link from/to utg010600c 21199 overlap 0 - ignoring
WARNING: ** self link from/to utg017205c 34408 overlap 0 - ignoring
WARNING: ** self link from/to utg017205c 34409 overlap 0 - ignoring
WARNING: ** self link from/to utg018266c 36530 overlap 0 - ignoring
WARNING: ** self link from/to utg018266c 36531 overlap 0 - ignoring
read 48356 S lines and 131656 L lines from GFA file TEST/daStaPalu1.r_utg.gfa.gz
removed 62412 duplicate links, leaving 69244
readGfa: user   122.342262      system  1.100602        max_RSS 1079476224      memory  5966385532
wrote 24178 objects to daStaPalu1-r.1seg    // 1code file of the segments
wrote 24178 objects to daStaPalu1-r.1sgs    // 1code seq file of the DNA sequences of the segments
wrote 69244 objects to daStaPalu1-r.1lnk    // 1code file of the links
write: user     2.277581        system  1.384311        max_RSS 2242887680      memory  5966385549
total: user     124.629004      system  3.074676        max_RSS 3326738432      memory  5966385549

> gaffer -read daStaPalu1-r -removeBadLinks -blunt -write daStaPalu1-blunt
read file daStaPalu1-r.1lnk with 24178 segs
read file daStaPalu1-r.1lnk with 69244 links
read file daStaPalu1-r.1sgs with 24178 sequences
read: user      9.287462        system  0.607038        max_RSS 1794195456      memory  8941142529
10414 imperfect overlaps removed, 58830 remain
removeBadLinks: user    1.296138        system  0.227495        max_RSS 0       memory  8941142529
243080 initial new links before compression
made blunt gfa with 231179 seqs, 160533 links and 48356 walks
all walks check out
blunt: user     34.994954       system  1.648977        max_RSS 2273067008      memory  9007107609
wrote 115590 objects to daStaPalu1-blunt.1seg
wrote 160533 objects to daStaPalu1-blunt.1lnk
wrote 48356 objects to daStaPalu1-blunt.1wlk
write: user     0.113076        system  0.462926        max_RSS 0       memory  9007108270
total: user     45.695350       system  2.947524        max_RSS 4067262464      memory  9007108270

> seqconvert -o daStaPalu1-r.fa.gz daStaPalu1-r.1sgs    // -fa -z are implied by the output filename
reading from file type onecode  with 24178 sequences totLen 2979016643
written 24178 sequences to file type fasta, total length 2979016643, max length 6381111
user    484.026538      system  2.124296        max_RSS 54951936        memory  46323317    // writing gzip files is very slow

> seqconvert -1 -H -o daStaPalu1-r.hoco.1seq daStaPalu1-r.1sgs 
reading from file type onecode  with 24178 sequences totLen 2979016643
written 24178 sequences to file type onecode, total length 2067571198, max length 4464555
user    47.692720       system  2.497245        max_RSS 2053259264      memory  24086839758

> seqconvert -fa -z -U -o daStaPalu1-r.unhoco.fa.gz daStaPalu1-r.hoco.1seq
reading from file type onecode  with 24178 sequences totLen 2067571198
written 24178 sequences to file type fasta, total length 2979016643, max length 6381111
user    559.758163      system  2.878647        max_RSS 107282432       memory  108422711

> ls -l TEST/daStaPalu1.r_utg.gfa.gz daStaPalu* 
-rw-r--r--  1 rd  staff  847123532 Mar 23 13:34 TEST/daStaPalu1.r_utg.gfa.gz
-rw-r--r--  1 rd  staff    1707152 Aug 17 12:35 daStaPalu1-blunt.1lnk
-rw-r--r--  1 rd  staff     460153 Aug 17 12:35 daStaPalu1-blunt.1seg
-rw-r--r--  1 rd  staff    1188974 Aug 17 12:35 daStaPalu1-blunt.1wlk
-rw-r--r--  1 rd  staff     903589 Aug 17 12:28 daStaPalu1-r.1lnk
-rw-r--r--  1 rd  staff     365580 Aug 17 12:28 daStaPalu1-r.1seg
-rw-r--r--  1 rd  staff  745015260 Aug 17 12:28 daStaPalu1-r.1sgs
-rw-r--r--  1 rd  staff  815204880 Aug 17 13:17 daStaPalu1-r.fa.gz
-rw-r--r--  1 rd  staff  890157073 Aug 17 14:22 daStaPalu1-r.hoco.1seq
-rw-r--r--  1 rd  staff  815204880 Aug 17 14:57 daStaPalu1-r.unhoco.fa.gz

> ONEview -h daStaPalu1-r.1lnk | head  // -h does not show the header, just the body
L 0 > 1767 <
O 12648
L 0 < 18185 <
O 10209
L 0 < 20414 <
O 4869
L 1 > 841 >
O 9344
L 1 > 2020 >
O 7852

> ONEview -H daStaPalu1-r.1sgs   // -H shows only the header, with the schema and object statistics
1 3 seq 1 1
2 3 sgs
.
~ O S 1 3 DNA             sequence: the DNA string
~ D I 1 6 STRING          id - sequence identifier; unnecessary for segments
~ D P 1 8 INT_LIST        kmer ploidy estimates - 0 for error, 1,2... for ploidy, R for repeat
~ D Q 1 6 STRING          base quality - Q values (ascii string = q+33)
.
# S 24178         // count of S lines
@ S 6381111       // maximum length of list (in this case DNA) in an S line
+ S 2979016643    // total lengths of all lists in S lines
.

> ./composition -b -l -t daStaPalu1-r.1sgs 
onecode file, 24178 sequences >= 0, 2979016643 total, 123211.87 average, 1884 min, 6381111 max
bases
  a 948668488 31.8 %
  c 540474258 18.1 %
  g 540752467 18.2 %
  t 949121430 31.9 %
approximate N50 932093
length distribution (quadratic bins)        // quadratic bins naturally scale for Poisson counts
  15951         4878
  63806         15381
  143565        834
  255227        576
  398792        553
  574260        481
  781632        365
  1020908       330
  1292086       271
  1595169       148
  1930154       140
  2297043       86
  2695835       46
  3126531       40
  3589130       16
  4083632       13
  4610038       9
  5168347       4
  5758560       5
  6380676       2
user    9.034954        system  0.165228        max_RSS 39763968        memory  30384856

> composition -t daStaPalu1-r.fa.gz
fasta file, 24178 sequences >= 0, 2979016643 total, 123211.87 average, 1884 min, 6381111 max
user    19.503632       system  0.221412        max_RSS 16875520        memory  16780520      // note that this twice as slow when doing less

Note that not only are the 1seq files smaller than compressed fasta, they also read much faster because parsing is trivial (composition reads in each sequence, even though the basic counts are available in the 1seq header.

About

a GFA toolkit

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published