Skip to content

Commit

Permalink
Merge pull request #394 from abondrn/ioToBio-genbank
Browse files Browse the repository at this point in the history
multimap implemented
  • Loading branch information
Koeng101 authored Nov 5, 2023
2 parents 956d26e + 7e3c812 commit 35a5492
Show file tree
Hide file tree
Showing 29 changed files with 545 additions and 122 deletions.
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added
- Alternative start codons can now be used in the `synthesis/codon` DNA -> protein translation package (#305)
- Added a parser and writer for the `pileup` sequence alignment format (#329)
- Created copy methods for Feature and Location to address concerns raised by [(#342)](https://github.com/TimothyStiles/poly/issues/342)
- Created new methods to convert polyjson -> genbank.
- Created new `Feature.StoreSequence` method to enable [(#388)](https://github.com/TimothyStiles/poly/issues/388)

### Changed
- **Breaking**: Genbank parser uses new custom multimap for `Feature.Attributes`, which allows for duplicate keys. This changes the type of Features.Attributes from `map[string]string` to `MultiMap[string, string]`, an alias for `map[string]string` defined in `multimap.go`. [(#383)](https://github.com/TimothyStiles/poly/issues/383)
- Improves error reporting for genbank parse errors via a new `ParseError` struct.

### Fixed
- `fastq` parser no longer becomes de-aligned when reading (#325)
- `fastq` now handles optionals correctly (#323)
- Adds functional test and fix for [(#313)](https://github.com/TimothyStiles/poly/issues/313).
- In addition to expanding the set of genbank files which can be validly parsed, the parser is more vocal when it encounters unusual syntax in the "feature" section. This "fail fast" approach is better as there were cases where inputs triggered a codepath which would neither return a valid Genbank object nor an error, and should help with debugging.

## [0.26.0] - 2023-07-22
Oops, we weren't keeping a changelog before this tag!
Expand Down
12 changes: 12 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,18 @@ In order to simplify the development experience, and environment setup, the poly

Whether you're a beginner with Go or you're an experienced developer, You should see the suggestions popup automatically when you goto the *Plugins* tab in VSCode. Using these plugins can help accelerate the development experience and also allow you to work more collaboratively with other poly developers.

## Local Checks

Poly runs numerous CI/CD checks via Github Actions before a PR can be merged. In order to make your PR mergeable, your PR must pass all of these checks.

A quick way to check your PR will pass is to run:

```sh
gofmt -s -w . && go test ./...
```

Additionally, you may want to [install](https://golangci-lint.run/usage/install/#local-installation) and run the linter.

# How to report a bug

### Security disclosures
Expand Down
2 changes: 1 addition & 1 deletion bio/example_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,7 @@ ORIGIN
records, _ := parser.Parse()

fmt.Println(records[0].Features[2].Attributes["translation"])
// Output: MTMITPSLHACRSTLEDPRVPSSNSLAVVLQRRDWENPGVTQLNRLAAHPPFASWRNSEEARTDRPSQQLRSLNGEWRLMRYFLLTHLCGISHRIWCTLSTICSDAA
// Output: [MTMITPSLHACRSTLEDPRVPSSNSLAVVLQRRDWENPGVTQLNRLAAHPPFASWRNSEEARTDRPSQQLRSLNGEWRLMRYFLLTHLCGISHRIWCTLSTICSDAA]
}

func ExampleNewSlow5Parser() {
Expand Down
6 changes: 3 additions & 3 deletions bio/fasta/fasta.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,15 @@ Fasta Parser begins here
Many thanks to Jordan Campbell (https://github.com/0x106) for building the first
parser for Poly and thanks to Tim Stiles (https://github.com/TimothyStiles)
for helping complete that PR. This work expands on the previous work by allowing
for concurrent parsing and giving Poly a specific parser subpackage,
for concurrent parsing and giving Poly a specific parser subpackage,
as well as few bug fixes.
Fasta is a very simple file format for working with DNA, RNA, or protein sequences.
It was first released in 1985 and is still widely used in bioinformatics.
https://en.wikipedia.org/wiki/_format
https://en.wikipedia.org/wiki/FASTA_format
One interesting use of the concurrent parser is working with the Uniprot
One interesting use of the concurrent parser is working with the Uniprot
fasta dump files, which are far too large to fit into RAM. This parser is able
to easily handle those files by doing computation actively while the data dump
is getting parsed.
Expand Down
147 changes: 147 additions & 0 deletions bio/genbank/data/NC_001141.2_redux.gb
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
LOCUS NC_001141 439888 bp DNA linear CON 15-SEP-2023
DEFINITION Saccharomyces cerevisiae S288C chromosome IX, complete sequence.
ACCESSION NC_001141
VERSION NC_001141.2
DBLINK BioProject: PRJNA128
Assembly: GCF_000146045.2
KEYWORDS RefSeq.
SOURCE Saccharomyces cerevisiae S288C
ORGANISM Saccharomyces cerevisiae S288C
Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina;
Saccharomycetes; Saccharomycetales; Saccharomycetaceae;
Saccharomyces.
REFERENCE 1 (bases 1 to 439888)
AUTHORS Engel,S.R., Wong,E.D., Nash,R.S., Aleksander,S., Alexander,M.,
Douglass,E., Karra,K., Miyasato,S.R., Simison,M., Skrzypek,M.S.,
Weng,S. and Cherry,J.M.
TITLE New data and collaborations at the Saccharomyces Genome Database:
updated reference genome, alleles, and the Alliance of Genome
Resources
JOURNAL Genetics 220 (4) (2022)
PUBMED 34897464
REFERENCE 2 (bases 1 to 439888)
AUTHORS Churcher,C., Bowman,S., Badcock,K., Bankier,A., Brown,D.,
Chillingworth,T., Connor,R., Devlin,K., Gentles,S., Hamlin,N.,
Harris,D., Horsnell,T., Hunt,S., Jagels,K., Jones,M., Lye,G.,
Moule,S., Odell,C., Pearson,D., Rajandream,M., Rice,P., Rowley,N.,
Skelton,J., Smith,V., Barrell,B. et al.
TITLE The nucleotide sequence of Saccharomyces cerevisiae chromosome IX
JOURNAL Nature 387 (6632 SUPPL), 84-87 (1997)
PUBMED 9169870
REFERENCE 3 (bases 1 to 439888)
AUTHORS Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B.,
Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M.,
Louis,E.J., Mewes,H.W., Murakami,Y., Philippsen,P., Tettelin,H. and
Oliver,S.G.
TITLE Life with 6000 genes
JOURNAL Science 274 (5287), 546 (1996)
PUBMED 8849441
REFERENCE 4 (bases 1 to 439888)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (14-SEP-2023) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 5 (bases 1 to 439888)
CONSRTM Saccharomyces Genome Database
TITLE Direct Submission
JOURNAL Submitted (04-MAY-2012) Department of Genetics, Stanford
University, Stanford, CA 94305-5120, USA
REMARK Protein update by submitter
REFERENCE 6 (bases 1 to 439888)
CONSRTM Saccharomyces Genome Database
TITLE Direct Submission
JOURNAL Submitted (31-MAR-2011) Department of Genetics, Stanford
University, Stanford, CA 94305-5120, USA
REMARK Sequence update by submitter
REFERENCE 7 (bases 1 to 439888)
CONSRTM Saccharomyces Genome Database
TITLE Direct Submission
JOURNAL Submitted (14-DEC-2009) Department of Genetics, Stanford
University, Stanford, CA 94305-5120, USA
COMMENT REVIEWED REFSEQ: This record has been curated by SGD. The reference
sequence is identical to BK006942.

On Apr 26, 2011 this sequence version replaced NC_001141.1.

##Genome-Annotation-Data-START##
Annotation Provider :: SGD
Annotation Status :: Full Annotation
Annotation Version :: R64-4-1
URL :: http://www.yeastgenome.org/
##Genome-Annotation-Data-END##
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..439888
/organism="Saccharomyces cerevisiae S288C"
/mol_type="genomic DNA"
/strain="S288C"
/db_xref="taxon:559292"
/chromosome="IX"
telomere complement(1..7784)
/note="TEL09L; Telomeric region on the left arm of
Chromosome IX; composed of an X element core sequence, X
element combinatorial repeats, a long Y' element, and a
short terminal stretch of telomeric repeats"
/db_xref="SGD:S000028896"
gene complement(<483..>6147)
/locus_tag="YIL177C"
/db_xref="GeneID:854630"
mRNA complement(join(<483..4598,4987..>6147))
/locus_tag="YIL177C"
/product="Y' element ATP-dependent helicase"
/transcript_id="NM_001179522.1"
/db_xref="GeneID:854630"
CDS complement(join(483..4598,4987..6147))
/locus_tag="YIL177C"
/EC_number="3.6.4.12"
/note="Putative Y' element ATP-dependent helicase"
/codon_start=1
/product="Y' element ATP-dependent helicase"
/protein_id="NP_012092.1"
/db_xref="GeneID:854630"
/db_xref="SGD:S000001439"
/translation="MKVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVR
SFYEDEKSGLIKVVKFRTGAMDRKRSFEKVVISVMVGKNVKKFLTFVEDEPDFQGGPI
PSKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNELSFRVLERCHEIA
SARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGYNLAPYMFLLLHVDELSIFSAYQ
ASLPGEKKVDTERLKRDLCPRKPIEIKYFSQICNDMMNKKDRLGDILHIILRACALNF
GAGPRGGAGDEEDRSITNEEPIIPSVDEHGLKVCKLRSPNTPRRLRKTLDAVKALLVS
SCACTARDLDIFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGISLLAFA
GPQRNVYVDDTTRRIQLYTDYNKNGSSEPRLKTLDGLTSDYVFYFVTVLRQMQICALG
NSYDAFNHDPWMDVVGFEDPNQVTNRDISRIVLYSYMFLNTAKGCLVEYATFRQYMRE
LPKNAPQKLNFREMRQGLIALGRHCVGSRFETDLYESATSELMANHSVQTGRNIYGVD
SFSLTSVSGTTATLLQERASERWIQWLGLESDYHCSFSSTRNAEDVVAGEAASSNHHQ
KISRVTRKRPREPKSTNDILVAGQKLFGSSFEFRDLHQLRLCYEIYMADTPSVAVQAP
PGYGKTELFHLPLIALASKGDVEYVSFLFVPYTVLLANCMIRLGRCGCLNVAPVRNFI
EEGYDGVTDLYVGIYDDLASTNFTDRIAAWENIVECTFRTNNVKLGYLIVDEFHNFET
EVYRQSQFGGITNLDFDAFEKAIFLSGTAPEAVADAALQRIGLTGLAKKSMDINELKR
SEDLSRGLSSYPTRMFNLIKEKSEVPLGHVHKIRKKVESQPEEALKLLLALFESEPES
KAIVVASTTNEVEELACSWRKYFRVVWIHGKLGAAEKVSRTKEFVTDGSMQVLIGTKL
VTEGIDIKQLMMVIMLDNRLNIIELIQGVGRLRDGGLCYLLSRKNSWAARNRKGELPP
IKEGCITEQVREFYGLESKKGKKGQHVGCCGSRTDLSADTVELIERMDRLAEKQATAS
MSIVALPSSFQESNSSDRYRKYCSSDEDSNTCIHGSANASTNASTNAITTASTNVRTN
ATTNASTNATTNASTNASTNATTNASTNATTNSSTNATTTASTNVRTSATTTASINVR
TSATTTESTNSSTNATTTESTNSSTNATTTESTNSNTSATTTASINVRTSATTTESTN
SSTSATTTASINVRTSATTTKSINSSTNATTTESTNSNTNATTTESTNSSTNATTTES
TNSSTNATTTESTNSNTSAATTESTNSNTSATTTESTNASAKEDANKDGNAEDNRFHP
VTDINKESYKRKGSQMVLLERKKLKAQFPNTSENMNVLQFLGFRSDEIKHLFLYGIDI
YFCPEGVFTQYGLCKGCQKMFELCVCWAGQKVSYRRIAWEALAVERMLRNDEEYKEYL
EDIEPYHGDPVGYLKYFSVKRREIYSQIQRNYAWYLAITRRRETISVLDSTRGKQGSQ
VFRMSGRQIKELYFKVWSNLRESKTEVLQYFLNWDEKKCQEEWEAKDDTVVVEALEKG
GVFQRLRSMTSAGLQGPQYVKLQFSRHHRQLRSRYELSLGMHLRDQIALGVTPSKVPH
WTAFLSMLIGLFYNKTFRQKLEYLLEQISEVWLLPHWLDLANVEVLAADDTRVPLYML
MVAVHKELDSDDVPDGRFDILLCRDSSREVGE"
rep_origin 7470..8793
/note="ARS902; Putative replication origin; identified in
multiple array studies, not yet confirmed by plasmid-based
assay"
/db_xref="SGD:S000130156"
mRNA join(<155222,155311..>155765)
/gene="COX5B"
/locus_tag="YIL111W"
/product="cytochrome c oxidase subunit Vb"
/transcript_id="NM_001179459.1"
/db_xref="GeneID:854695"
CONTIG join(BK006942.2:1..439888)
//

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit 35a5492

Please sign in to comment.