io->bio #339

Koeng101 · 2023-08-24T02:20:06Z

This is a fairly large PR to standardize our parsing and writing of files, using generics to implement higher level functions on simplified interfaces that are standardized across all our readers/writers: genbank, gff, fasta, fastq, slow5, sam, uniprot, rebase, pileup

Koeng101 · 2023-08-24T03:15:30Z

This is a work-in-progress, btw

carreter · 2023-08-29T10:09:34Z

Why the io to bio rename?

Koeng101 · 2023-08-29T14:51:02Z

Why the io to bio rename?

So that the name doesn't conflict with standard library io

bio/fasta/example_test.go

bio/bio.go

bio/example_test.go

bio/fasta/fasta.go

bio/bio.go

bio/fastq/fastq.go

Co-authored-by: Willow Carretero Chavez <[email protected]>

Koeng101 · 2023-09-01T23:17:42Z

fasta parser testing is now at 98.5%, other than the Scanner non-EOF error, which I can't really figure out how to test after a very brief trying-to-figure-out. I removed a ton of code, and everything should be fairly simple now.

bio/bio.go

bio/fasta/fasta.go

… with the added benefit of better cmp interop

…vered bug

abondrn

Overall looks great, I only had stylistic comments. Due to the size of the PR and it being a pareto improvement (common interface, strictly greater testing, no loss in efficiency) I'm leaning towards approving ASAP. Since all of the key contributors have had their say, for the remaining change requests simply determine whether it is (1) a quick fix (2) issue worthy (3) won't do. No additional features please.

abondrn · 2023-10-31T04:06:34Z

seqhash/seqhash_test.go

@@ -66,7 +67,10 @@ func TestHash(t *testing.T) {
 }

 func TestLeastRotation(t *testing.T) {
-	sequence, _ := genbank.Read("../data/puc19.gbk")
+	file, _ := os.Open("../data/puc19.gbk")
+	defer file.Close()


This isn't strictly necessary. Go will close any resources which leaves the lexical scope, unless you plan to store this value somewhere on the heap. Yay GC!

I thought that was how it worked, but code I've seen still usually has the defer Close! Do you know why?

Started thread in Discord

abondrn · 2023-10-31T04:09:09Z

io/io.go

Not sure if this is a quirk with how Github displays diff, but I prefer using git mv as this preserves file history. Blames are an important investigation tool and I would hate to lose that information, but if you already did that then disregard.

Gotcha, will do in the future. io/io.go has basically entirely died, so I'm fine with that here.

Usually git is able to figure out if I moved a file if I just immediately commit again

abondrn · 2023-10-31T04:13:25Z

go.mod

@@ -15,8 +15,11 @@ require (

 require (
 	github.com/davecgh/go-spew v1.1.1 // indirect
+	github.com/intel-go/cpuid v0.0.0-20181003105527-1a4a6f06a1c6 // indirect


These seem interesting, what are these new dependencies for? Not opposed as they seem low-level performance-centric

Answered my question after looking at slow5, what other parsers or data structures would benefit from this [no action required]

I think it's just used in StreamVByte, but I don't think any other file format actually would benefit from this kind of compression very much. It is very nice in slow5 though, because of the integer data streaming out of the nanopore.

bio/bio.go

bio/genbank/genbank.go

bio/fasta/fasta.go

Koeng101 · 2023-10-31T15:30:07Z

TODO:

Internalize parse parameters
Add back fasta format in docs
Remove that extra space

multimap implemented

Koeng101 · 2023-11-10T07:16:37Z

TODO:

* Internalize parse parameters

* Add back fasta format in docs

* Remove that extra space

Parse parameters are actually there for reason. it's reset fairly often, so it is easier to have it outside the parser itself. We could have a separate method for fully resetting it, but I think it is fairly readable as is.

Koeng101 · 2023-11-10T07:21:52Z

Alright, looking over this it looks good to get a review from @TimothyStiles

TimothyStiles · 2023-11-21T20:45:20Z

@Koeng101

I like the idea of utilizing generics but am not too familiar with their usage and syntax. Can you add a little "how-to" guide for writing parsers and writer in this new setup?

Koeng101 · 2023-11-25T01:35:37Z

@Koeng101

I like the idea of utilizing generics but am not too familiar with their usage and syntax. Can you add a little "how-to" guide for writing parsers and writer in this new setup?

Hmmm, for both of those it doesn't really have much to do with generics (mostly just interfaces) - you only need generics once you make the top level convenience-function, which is essentially copy-pasted over each parser type, because Go generics are limited. Really what you have to be concerned about when writing parsers is the interfaces they must implement. To be clearer with the generics:

// NewFastaParser initiates a new FASTA parser from an io.Reader.
func NewFastaParser(r io.Reader) (*Parser[*fasta.Record, *fasta.Header], error) {
	return NewFastaParserWithMaxLineLength(r, DefaultMaxLengths[Fasta])
}

// NewFastaParserWithMaxLineLength initiates a new FASTA parser from an
// io.Reader and a user-given maxLineLength.
func NewFastaParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*fasta.Record, *fasta.Header], error) {
	return &Parser[*fasta.Record, *fasta.Header]{parserInterface: fasta.NewParser(r, maxLineLength)}, nil
}

// NewFastqParser initiates a new FASTQ parser from an io.Reader.
func NewFastqParser(r io.Reader) (*Parser[*fastq.Read, *fastq.Header], error) {
	return NewFastqParserWithMaxLineLength(r, DefaultMaxLengths[Fastq])
}

// NewFastqParserWithMaxLineLength initiates a new FASTQ parser from an
// io.Reader and a user-given maxLineLength.
func NewFastqParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*fastq.Read, *fastq.Header], error) {
	return &Parser[*fastq.Read, *fastq.Header]{parserInterface: fastq.NewParser(r, maxLineLength)}, nil
}

What you can notice from both of these is that the Fastq is literally the same as the Fasta, except the Data and Header now points to fastq types rather than fasta types. The only reason you have to copy and paste this all the time instead of giving the format as a variable into a function is due to the limitation of Golang types.

writers + parsers

"How to" make a parser is almost entirely contained within these few lines (as part of the type system):

type parserInterface[Data io.WriterTo, Header io.WriterTo] interface {
	Header() (Header, error)
	Next() (Data, error)
}

type Parser[Data io.WriterTo, Header io.WriterTo] struct {
	parserInterface parserInterface[Data, Header]
}

Let's dive into the Golang type system to understand it better. A Parser is a generic struct, which means it is any struct that implements a parserInterface. What does that mean? Well, it means that any struct that implements the methods Next() and Header() satisfies the parserInterface. Both Next() and Header() just return an io.WriterTo and err, which is a standard library interface + primitive.

To sum it up, a Parser is any struct that returns a io.WriterTo+err when you call Header() or Next(). It is extremely simple.

For example, the following parser would satisfy the bio.Parser:

type SimpleHeader struct {}
type SimpleData struct {}
func (h SimpleHeader) WriteTo(w io.Writer) (n int64, err error) { return 0, nil}
func (d SimpleData) WriteTo(w io.Writer) (n int64, err error) { return 0, nil}

type SimpleParser struct{}
func (p SimpleParser) Header() (SimpleHeader, error) { return SimpleHeader{}, nil}
func (p SimpleParser) Next() (SimpleData, error) { return SimpleData{}, nil }

In this example, we have all the fundamentals needed:

A Header
A Data
An io.WriteTo implemented on the Header
An io.WriteTo implemented on the Data
A Parser
A Header() implemented on the parser
A Next() implemented on the parser

The clever bit is realizing that pretty much every single biological data structure has something along these lines, and that Next() supports stream-reading - and as a side effect of implementing the io.WriteTo interface, you also have writing built-in to the type definitions in a standard-library supported way.

In this way, the interfaces here are essentially just forcing parsers to expose 4 functions (WriteTo x2, Next, Header), and by doing that on their underlying data structures, you now can have a completely standardized way of working with each parser.

wrapping it up

Let's say you want to implement SimpleParser as a top level parser. Here is how you would do that:

func NewSimpleParser() (*Parser[*SimpleData, *SimpleHeader], error) {
   return &Parser[*SimpleData, *SimpleHeader]{parserInterface: SimpleParser{}}, nil
}

All this does is:

Say that we are returning a Parser with the two types SimpleData and SimpleHeader
Implements the parserInterface needed for a Parser (because SimpleParser has those handy Next() and Header() functions)
returns the instantiated generic Parser, wrapping that underlying SimpleParser

why we couldn't do this before generics

The hard bit is that I wanted to expose the underlying data types when you do things with a Parser. To pull from some above examples:

parser, _ := bio.NewFastaParser(file)
fastaRecord, _ := parser.Next()
// you should be able to do the below!
fmt.Println(fastaRecord.Sequence)
fmt.Println(fastaRecord.Read)

Without a generic interface, we couldn't say that it was aight for Data to contain other things, like Sequence or Name, beyond the required WriteTo. Essentially, all the generics are doing is saying it is aight for Data to contain other things without having to go through complicated type conversion magic.

paleale · 2023-11-25T18:28:25Z

Hello! Can I ask you to see and apply my changes for #352 ?
I don't know a proper way for such changes to be suggested, so I've prepared a patch:
0001-remove-not-working-error-handling-from-GetSequence-f.patch
Changes in CONTRIBUTION.md were discussed in #352 also... Thank you!

Koeng101 · 2023-11-28T06:02:30Z

Hello! Can I ask you to see and apply my changes for #352 ? I don't know a proper way for such changes to be suggested, so I've prepared a patch: 0001-remove-not-working-error-handling-from-GetSequence-f.patch Changes in CONTRIBUTION.md were discussed in #352 also... Thank you!

Sure! Could you add a pull request with your changes into this branch? That would be the proper way + our tooling for reviewing changes would work well with that. Normally, you just make the changes on your own fork, and then do a pull request into this branch.

isaacguerreir · 2023-11-29T13:25:39Z

It would be interesting now that we're using generics to have a factory to decide which parser to use based on the file content, so the developer doesn't need to choose what parser to use.

Koeng101 · 2023-11-29T16:14:14Z

It would be interesting now that we're using generics to have a factory to decide which parser to use based on the file content, so the developer doesn't need to choose what parser to use.

Due to how generics work in Go, I'm not sure that you actually can do this. Basically, if the output signature is Parser[Data, Header], you can't return Parser[fasta.Name, fasta.Header].

isaacguerreir · 2023-11-29T18:16:17Z

my mistake, for some reason reading your explanation I thought it would be possible to pass generic types as outputs, but taking a better look at the documentation it seems undoable to use Golang generics like that. It would be possible if the function returns a common structure, which is not possible with the current implementation.

Koeng101 · 2023-11-29T18:22:21Z

seems undoable to use Golang generics like that. It would be possible if the function returns a common structure, which

Yes, we don't have a common structure between all of the different parsers, and I think this is a good thing. Makes each data type very specific and concise.

Keoni Gandall added 2 commits August 23, 2023 19:17

Moved io to bio

b87947b

fixed io imports

5c887a3

TimothyStiles reviewed Aug 29, 2023

View reviewed changes

bio/fasta/example_test.go Outdated Show resolved Hide resolved

TimothyStiles reviewed Aug 30, 2023

View reviewed changes

bio/bio.go Outdated Show resolved Hide resolved

TimothyStiles reviewed Aug 30, 2023

View reviewed changes

bio/bio.go Outdated Show resolved Hide resolved

TimothyStiles reviewed Aug 30, 2023

View reviewed changes

bio/example_test.go Outdated Show resolved Hide resolved

TimothyStiles reviewed Aug 30, 2023

View reviewed changes

bio/fasta/fasta.go Outdated Show resolved Hide resolved

Koeng101 added the draft label Aug 31, 2023

Add more generic definitions to bio

d8f4b38

carreter reviewed Sep 1, 2023

View reviewed changes

Koeng101 and others added 4 commits September 1, 2023 15:59

Update bio/fastq/fastq.go

4fb41ff

Co-authored-by: Willow Carretero Chavez <[email protected]>

update fasta

2452282

Merge branch 'ioToBio' of github.com:TimothyStiles/poly into ioToBio

6dda2b9

add fasta updates and parser

16fbcbd

carreter reviewed Sep 1, 2023

View reviewed changes

bio/bio.go Outdated Show resolved Hide resolved

carreter reviewed Sep 1, 2023

View reviewed changes

bio/bio.go Outdated Show resolved Hide resolved

carreter reviewed Sep 1, 2023

View reviewed changes

bio/fasta/fasta.go Outdated Show resolved Hide resolved

Keoni Gandall added 9 commits September 1, 2023 19:34

made readability improvements

382a014

changed ParseWithHeader

0bbd05e

removed int64 in reads

eb68f81

add more example tests

344220c

gotta update this for this tests!

03f8b68

integrate slow5

6199c43

have examples covering most of changes

65f0539

removed interfaces

8ff6da4

updated with NewXXXParser

00732a4

abondrn added 6 commits October 30, 2023 13:59

Add methods to convert polyjson -> genbank

9ce9f4f

Removed generic collections library in favor of hand-rolled multimap,…

89a2ba4

… with the added benefit of better cmp interop

Propogate handrolled multimap to test files

b88d7b8

Responded to more comments

b4c3a37

Reduced new example genbank file

8b82d7b

Resolved lint errors, added test StoreFeatureSequences and fixed unco…

f523651

…vered bug

abondrn reviewed Oct 31, 2023

View reviewed changes

Added multimap.go file doc

1270ec8

abondrn and others added 8 commits October 31, 2023 12:36

Responded to more comments

9c322f6

First merge attempt

f124fae

Fixed deref issue

98b6984

Merged updated branch

fc2ca75

Fixed tests, moved genbank files

25e0f61

Fixed fasta docs

60abf6d

Added changelog

7e3c812

Merge pull request #394 from abondrn/ioToBio-genbank

35a5492

multimap implemented

added to changelog

433df00

TimothyStiles mentioned this pull request Nov 29, 2023

Genbank import and export from JSON should include feature sequences #388

Closed

Koeng101 closed this Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

io->bio #339

io->bio #339

Koeng101 commented Aug 24, 2023

Koeng101 commented Aug 24, 2023

carreter commented Aug 29, 2023

Koeng101 commented Aug 29, 2023

Koeng101 commented Sep 1, 2023

abondrn left a comment

abondrn Oct 31, 2023

Koeng101 Oct 31, 2023

abondrn Oct 31, 2023

abondrn Oct 31, 2023

Koeng101 Oct 31, 2023

abondrn Oct 31, 2023

abondrn Oct 31, 2023

Koeng101 Oct 31, 2023

Koeng101 commented Oct 31, 2023

Koeng101 commented Nov 10, 2023

Koeng101 commented Nov 10, 2023

TimothyStiles commented Nov 21, 2023

Koeng101 commented Nov 25, 2023

paleale commented Nov 25, 2023

Koeng101 commented Nov 28, 2023

isaacguerreir commented Nov 29, 2023

Koeng101 commented Nov 29, 2023

isaacguerreir commented Nov 29, 2023

Koeng101 commented Nov 29, 2023

io->bio #339

io->bio #339

Conversation

Koeng101 commented Aug 24, 2023

Koeng101 commented Aug 24, 2023

carreter commented Aug 29, 2023

Koeng101 commented Aug 29, 2023

Koeng101 commented Sep 1, 2023

abondrn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Koeng101 commented Oct 31, 2023

Koeng101 commented Nov 10, 2023

Koeng101 commented Nov 10, 2023

TimothyStiles commented Nov 21, 2023

Koeng101 commented Nov 25, 2023

writers + parsers

wrapping it up

why we couldn't do this before generics

paleale commented Nov 25, 2023

Koeng101 commented Nov 28, 2023

isaacguerreir commented Nov 29, 2023

Koeng101 commented Nov 29, 2023

isaacguerreir commented Nov 29, 2023

Koeng101 commented Nov 29, 2023