Read and write GFF3 data performantly. This module aims to be a complete implementation of the GFF3 specification.
- streaming parsing and streaming formatting
- proper escaping and unescaping of attribute and column values
- supports features with multiple locations and features with multiple parents
- reconstructs feature hierarchies of both
Parent
andDerives_from
relationships - parses FASTA sections
- does no validation except for
Parent
andDerives_from
relationships - only compatible with GFF3
$ npm install --save @gmod/gff
const gff = require('@gmod/gff').default
// or in ES6 (recommended)
import gff from '@gmod/gff'
// parse a file from a file name
// parses only features and sequences by default,
// set options to parse directives and/or comments
gff.parseFile('path/to/my/file.gff3', { parseAll: true })
.on('data', data => {
if (data.directive) {
console.log('got a directive',data)
}
else if (data.comment) {
console.log('got a comment',data)
}
else if (data.sequence) {
console.log('got a sequence from a FASTA section')
}
else {
console.log('got a feature',data)
}
})
// parse a stream of GFF3 text
const fs = require('fs')
fs.createReadStream('path/to/my/file.gff3')
.pipe(gff.parseStream())
.on('data', data => {
console.log('got item',data)
return data
})
.on('end', () => {
console.log('done parsing!')
})
// parse a string of gff3 synchronously
let stringOfGFF3 = fs
.readFileSync('my_annotations.gff3')
.toString()
let arrayOfThings = gff.parseStringSync(stringOfGFF3)
// format an array of items to a string
let stringOfGFF3 = gff.formatSync(arrayOfThings)
// format a stream of things to a stream of text.
// inserts sync marks automatically.
myStreamOfGFF3Objects
.pipe(gff.formatStream())
.pipe(fs.createWriteStream('my_new.gff3'))
// format a stream of things and write it to
// a gff3 file. inserts sync marks and a
// '##gff-version 3' header if one is not
// already present
myStreamOfGFF3Objects
.pipe(gff.formatFile('path/to/destination.gff3')
In GFF3, features can have more than one location. We parse features
as arrayrefs of all the lines that share that feature's ID.
Values that are .
in the GFF3 are null
in the output.
A simple feature that's located in just one place:
[
{
"seq_id": "ctg123",
"source": null,
"type": "gene",
"start": 1000,
"end": 9000,
"score": null,
"strand": "+",
"phase": null,
"attributes": {
"ID": [
"gene00001"
],
"Name": [
"EDEN"
]
},
"child_features": [],
"derived_features": []
}
A CDS called cds00001
located in two places:
[
{
"seq_id": "ctg123",
"source": null,
"type": "CDS",
"start": 1201,
"end": 1500,
"score": null,
"strand": "+",
"phase": "0",
"attributes": {
"ID": [
"cds00001"
],
"Parent": [
"mRNA00001"
]
},
"child_features": [],
"derived_features": []
},
{
"seq_id": "ctg123",
"source": null,
"type": "CDS",
"start": 3000,
"end": 3902,
"score": null,
"strand": "+",
"phase": "0",
"attributes": {
"ID": [
"cds00001"
],
"Parent": [
"mRNA00001"
]
},
"child_features": [],
"derived_features": []
}
]
parseDirective("##gff-version 3\n")
// returns
{
"directive": "gff-version",
"value": "3"
}
parseDirective('##sequence-region ctg123 1 1497228\n')
// returns
{
"directive": "sequence-region",
"value": "ctg123 1 1497228",
"seq_id": "ctg123",
"start": "1",
"end": "1497228"
}
parseComment('# hi this is a comment\n')
// returns
{
"comment": "hi this is a comment"
}
These come from any embedded ##FASTA
section in the GFF3 file.
{
"id": "ctgA",
"description": "test contig",
"sequence": "ACTGACTAGCTAGCATCAGCGTCGTAGCTATTATATTACGGTAGCCA"
}
Parse a stream of text data into a stream of feature, directive, and comment objects.
Parameters
options
Object optional options object (optional, default{}
)options.encoding
string text encoding of the input GFF3. default 'utf8'options.parseAll
boolean default false. if true, will parse all items. overrides other flagsoptions.parseFeatures
boolean default trueoptions.parseDirectives
boolean default falseoptions.parseComments
boolean default falseoptions.parseSequences
boolean default trueoptions.bufferSize
Number maximum number of GFF3 lines to buffer. defaults to 1000
Returns ReadableStream stream (in objectMode) of parsed items
Read and parse a GFF3 file from the filesystem.
Parameters
filename
string the filename of the file to parseoptions
Object optional options objectoptions.encoding
string the file's string encoding, defaults to 'utf8'options.parseAll
boolean default false. if true, will parse all items. overrides other flagsoptions.parseFeatures
boolean default trueoptions.parseDirectives
boolean default falseoptions.parseComments
boolean default falseoptions.parseSequences
boolean default trueoptions.bufferSize
Number maximum number of GFF3 lines to buffer. defaults to 1000
Returns ReadableStream stream (in objectMode) of parsed items
Synchronously parse a string containing GFF3 and return an arrayref of the parsed items.
Parameters
Returns Array array of parsed features, directives, and/or comments
Format an array of GFF3 items (features,directives,comments) into string of GFF3. Does not insert synchronization (###) marks.
Parameters
items
Returns String the formatted GFF3
Format a stream of items (of the type produced by this script) into a stream of GFF3 text.
Inserts synchronization (###) marks automatically.
Parameters
options
Object
Format a stream of items (of the type produced by this script) into a GFF3 file and write it to the filesystem.
Inserts synchronization (###) marks and a ##gff-version directive automatically (if one is not already present).
Parameters
stream
ReadableStream the stream to write to the filefilename
String the file path to write tooptions
Object (optional, default{}
)
Returns Promise promise for the written filename
There is also a util
module that contains super-low-level functions for dealing with lines and parts of lines.
// non-ES6
const util = require('@gmod/gff').default.util
// or, with ES6
import gff from '@gmod/gff'
const util = gff.util
const gff3Lines = util.formatItem({
seq_id: 'ctgA',
...
}))
- unescape
- escape
- parseAttributes
- parseFeature
- parseDirective
- formatAttributes
- formatFeature
- formatDirective
- formatComment
- formatSequence
- formatItem
Unescape a string value used in a GFF3 attribute.
Parameters
s
String
Returns String
Escape a value for use in a GFF3 attribute value.
Parameters
s
String
Returns String
Parse the 9th column (attributes) of a GFF3 feature line.
Parameters
attrString
String
Returns Object
Parse a GFF3 feature line
Parameters
line
String
Parse a GFF3 directive line.
Parameters
line
String
Returns Object the information in the directive
Format an attributes object into a string suitable for the 9th column of GFF3.
Parameters
attrs
Object
Format a feature object or array of feature objects into one or more lines of GFF3.
Parameters
featureOrFeatures
Format a directive into a line of GFF3.
Parameters
directive
Object
Returns String
Format a comment into a GFF3 comment. Yes I know this is just adding a # and a newline.
Parameters
comment
Object
Returns String
Format a sequence object as FASTA
Parameters
seq
Object
Returns String formatted single FASTA sequence
Format a directive, comment, or feature, or array of such items, into one or more lines of GFF3.
Parameters
MIT © Robert Buels