Read CRAM files (indexed or unindexed) with pure JS, works in node or in the browser.
- Reads CRAM 3.x and 2.x
- Does not read CRAM 1.x
- Can use .crai indexes out of the box, for efficient sequence fetching, but also has an index API that would allow use with other index types
- Does not implement bzip2 or lzma codecs (yet), as these are rarely used in-the-wild; if this is important to your use case, please file an issue
$ npm install --save @gmod/cram
# or
$ yarn add @gmod/cram
const { IndexedCramFile, CramFile, CraiIndex } = require('@gmod/cram')
//Use indexedfasta library for seqFetch, if using local file (see below)
const { IndexedFasta, BgzipIndexedFasta } = require('@gmod/indexedfasta')
const t = new IndexedFasta({
path: '/filesystem/yourfile.fa',
faiPath: '/filesystem/yourfile.fa.fai',
});
// open local files
const indexedFile = new IndexedCramFile({
cramPath: '/filesystem/yourfile.cram',
index: new CraiIndex({
path: '/filesystem/yourfile.cram.crai'),
}),
seqFetch: async (seqId, start, end) => {
// note:
// * seqFetch should return a promise for a string, in this instance retrieved from IndexedFasta
// * we use start-1 because cram-js uses 1-based but IndexedFasta uses 0-based coordinates
// * the seqId is a numeric identifier
return t.getSequence(seqId, start-1, end)
}
},
checkSequenceMD5: false,
})
// example of fetching records from an indexed CRAM file.
// NOTE: only numeric IDs for the reference sequence are accepted.
// For indexedfasta the numeric ID is the order in which the sequence names appear in the header
// Wrap in an async and then run
run = async() => {
const records = await indexedFile.getRecordsForRange(0, 10000, 20000)
records.forEach(record => {
console.log(`got a record named ${record.readName}`)
record.readFeatures.forEach(({ code, pos, refPos, ref, sub }) => {
// process the "read features". this can be used similar to
// CIGAR/MD strings in SAM. see CRAM specs for more details.
if (code === 'X')
console.log(
`${
record.readName
} shows a base substitution of ${ref}->${sub} at ${refPos}`,
)
})
})
}
run()
// can also pass `cramUrl` (for the IndexedCramFile class), and `url` (for the CraiIndex) params to open remote URLs
// alternatively `cramFilehandle` (for the IndexedCramFile class) and `filehandle` (for the CraiIndex) can be used, see for examples https://github.com/gmod/generic-filehandle
- CramRecord - format of CRAM records returned by this API
- ReadFeatures - format of read features on records
- IndexedCramFile - indexed access into a CRAM file
- CramFile - .cram API
- CraiIndex - .crai index API
- Error Classes - special error classes thrown by this API
These are the record objects returned by this API. Much of the data
is stored in them as simple object entries, but there are some accessor
methods used for conveniently getting the values of each of the flags in
the flags
and cramFlags
fields.
- flags (
number
): the SAM bit-flags field, see the SAM spec for interpretation. Some of theis*
methods below interpret this field. - cramFlags (
number
): the CRAM-specific bit-flags field, see the CRAM spec for interpretation. Some of theis*
methods below interpret this field. - sequenceId (
number
): the ID number of the record's reference sequence - readLength (
number
): length of the read in bases - alignmentStart (
number
): start coordinate of the alignment on the reference in 1-based closed coordinates - readGroupId (
number
): ID number of the read group, or -1 if none - readName (
number
): name of the read (string) - templateSize (
number
): for paired sequencing, the total size of the template - readFeatures (
array[ReadFeature]
): array of read features showing insertions, deletions, mismatches, etc. See ReadFeatures for their format. - lengthOnRef (
number
): span of the alignment along the reference sequence - mappingQuality (
number
): SAM mapping quality - qualityScores (
array[number]
): array of numeric quality scores - uniqueId (
number
): unique ID number of the record within the file - mate (
object
)- flags (
number
): CRAM mapping flags for the mate. See CRAM spec for interpretation. Some of theis*
methods below interpret this field. - sequenceId (
number
): reference sequence ID for the mate mapping - alignmentStart (
number
): start coordinate of the mate mapping. 1-based coordinates.
- flags (
Returns boolean true if the read is paired, regardless of whether both segments are mapped
Returns boolean true if the read is paired, and both segments are mapped
Returns boolean true if the read itself is unmapped; conflictive with isProperlyPaired
Returns boolean true if the read itself is unmapped; conflictive with isProperlyPaired
Returns boolean true if the read is mapped to the reverse strand
Returns boolean true if the mate is mapped to the reverse strand
Returns boolean true if this is read number 1 in a pair
Returns boolean true if this is read number 2 in a pair
Returns boolean true if this is a secondary alignment
Returns boolean true if this read has failed QC checks
Returns boolean true if the read is an optical or PCR duplicate
Returns boolean true if this is a supplementary alignment
Returns boolean true if the read is detached
Returns boolean true if the read has a mate in this same CRAM segment
Returns boolean true if the read contains qual scores
Returns boolean true if the read has no sequence bases
Get the original sequence of this read.
Returns String sequence basepairs
Annotates this feature with the given reference sequence basepair
information. This will add a sub
and a ref
item to base
subsitution read features given the actual substituted and reference
base pairs, and will make the getReadSequence()
method work.
Parameters
refRegion
objectcompressionScheme
CramContainerCompressionScheme
Returns undefined nothing
The feature objects appearing in the readFeatures
member of CramRecord objects that show insertions, deletions, substitutions, etc.
- code (
character
): One of "bqBXIDiQNSPH". See page 15 of the CRAM v3 spec for their meanings. - data (
any
): the data associated with the feature. The format of this varies depending on the feature code. - pos (
number
): location relative to the read (1-based) - refPos (
number
): location relative to the reference (1-based)
The pairing of an index and a CramFile. Supports efficient fetching of records for sections of reference sequences.
Parameters
args
objectargs.cram
CramFileargs.index
Index-like object that supports getEntriesForRange(seqId,start,end) -> Promise[Array[index entries]]args.cacheSize
number? optional maximum number of CRAM records to cache. default 20,000args.fetchSizeLimit
number? optional maximum number of bytes to fetch in a single getRecordsForRange call. Default 3 MiB.args.checkSequenceMD5
boolean? default true. if false, disables verifying the MD5 checksum of the reference sequence underlying a slice. In some applications, this check can cause an inconvenient amount (many megabases) of sequences to be fetched.
Parameters
seq
number numeric ID of the reference sequencestart
number start of the range of interest. 1-based closed coordinates.end
number end of the range of interest. 1-based closed coordinates.
Parameters
seqId
number
Returns Promise true if the CRAM file contains data for the given reference sequence numerical ID
Parameters
args
objectargs.filehandle
object? a filehandle that implements the stat() and read() methods of the Node filehandle API https://nodejs.org/api/fs.html#fs_class_filehandleargs.path
object? path to the cram fileargs.url
object? url for the cram file. also supports file:// urls for local filesargs.seqFetch
function? a function with signature(seqId, startCoordinate, endCoordinate)
that returns a promise for a string of sequence basesargs.cacheSize
number? optional maximum number of CRAM records to cache. default 20,000args.checkSequenceMD5
boolean? default true. if false, disables verifying the MD5 checksum of the reference sequence underlying a slice. In some applications, this check can cause an inconvenient amount (many megabases) of sequences to be fetched.
Returns number the number of containers in the file
Represents a .crai index.
Parameters
Parameters
seqId
number
Returns Promise true if the index contains entries for the given reference sequence ID, false otherwise
fetch index entries for the given range
Parameters
Returns Promise promise for
an array of objects of the form
{start, span, containerStart, sliceStart, sliceBytes }
@gmod/cram/errors
contains some special error classes thrown by cram-js. A list of the error classes is below.
- CramUnimplementedError
- CramMalformedError
- CramBufferOverrunError
- CramSizeLimitError
- CramArgumentError
Extends Error
Error caused by encountering a part of the CRAM spec that has not yet been implemented
Extends CramError
An error caused by malformed data.
Extends CramMalformedError
An error caused by attempting to read beyond the end of the defined data.
Extends CramError
An error caused by data being too big, exceeding a size limit.
Extends CramError
An invalid argument was supplied to a cram-js method or object.
Extends Error
Error caused by encountering a part of the CRAM spec that has not yet been implemented
Extends CramError
An error caused by malformed data.
Extends CramMalformedError
An error caused by attempting to read beyond the end of the defined data.
This package was written with funding from the NHGRI as part of the JBrowse project. If you use it in an academic project that you publish, please cite the most recent JBrowse paper, which will be linked from jbrowse.org.
MIT © Robert Buels