Skip to content
This repository has been archived by the owner on Aug 26, 2023. It is now read-only.

roadmap

Richard Smith edited this page Feb 4, 2014 · 9 revisions

IO

sequence formats

  • FASTA
  • FASTQ
  • GenBank
  • EMBL

Loads more at http://www.bioperl.org/wiki/HOWTO:SeqIO, but many of these are antiquated formats. I think we should prioritise by popularity. The sooner BioJulia is useful the better for the community.

annotation formats

  • GFF & GTF (this is messy in most languages - it would be great if we could cleanly handle all the quirks)
  • BED
  • VCF

alignment formats

  • BLAST (tabular/long form)
  • MultiFASTA aligned
  • CLUSTAL
  • BAM/SAM
  • Phylip
  • PFAM

tree formats

  • Newick (can be ported from Phylogenetics.jl)
  • Nexus
  • PhyloXML

also database connectors, for e.g. BioSQL

datastructures

We'll want to have representations of:

  • DNA, RNA and amino acid sequences
  • ranges and features of sequences (where the sequence may or may not be present)
  • alignments - pairwise and multiple
  • graph-derivative structures like phylogenetic trees, genetic networks and biochemical pathways
  • probabilistic models of sequences (e.g. motifs - perhaps this isn't a high priority)

tool wrappers

  • BLAST
  • Blat
  • bowtie/2
  • bwa
  • HMMER
  • Primer3
  • Phylogenetic tools (clustal, mafft, PAML, phylip)
  • samtools (unless we can do something faster in our own sam/bam implementation)
  • signalP/targetP
  • assemblers: velvet/oases, trinity, soapdenovo

service APIs

  • BioMart
  • Ensembl
  • EMBL
  • NCBI
  • SRA

datasets

  • genome sequences
  • genome annotations
  • gene ontologies
Clone this wiki locally