-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io->bio #339
Conversation
This is a work-in-progress, btw |
Why the |
So that the name doesn't conflict with standard library io |
Co-authored-by: Willow Carretero Chavez <[email protected]>
|
… with the added benefit of better cmp interop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great, I only had stylistic comments. Due to the size of the PR and it being a pareto improvement (common interface, strictly greater testing, no loss in efficiency) I'm leaning towards approving ASAP. Since all of the key contributors have had their say, for the remaining change requests simply determine whether it is (1) a quick fix (2) issue worthy (3) won't do. No additional features please.
@@ -66,7 +67,10 @@ func TestHash(t *testing.T) { | |||
} | |||
|
|||
func TestLeastRotation(t *testing.T) { | |||
sequence, _ := genbank.Read("../data/puc19.gbk") | |||
file, _ := os.Open("../data/puc19.gbk") | |||
defer file.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't strictly necessary. Go will close any resources which leaves the lexical scope, unless you plan to store this value somewhere on the heap. Yay GC!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that was how it worked, but code I've seen still usually has the defer Close! Do you know why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started thread in Discord
io/io.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is a quirk with how Github displays diff, but I prefer using git mv as this preserves file history. Blames are an important investigation tool and I would hate to lose that information, but if you already did that then disregard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, will do in the future. io/io.go has basically entirely died, so I'm fine with that here.
Usually git is able to figure out if I moved a file if I just immediately commit again
@@ -15,8 +15,11 @@ require ( | |||
|
|||
require ( | |||
github.com/davecgh/go-spew v1.1.1 // indirect | |||
github.com/intel-go/cpuid v0.0.0-20181003105527-1a4a6f06a1c6 // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem interesting, what are these new dependencies for? Not opposed as they seem low-level performance-centric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered my question after looking at slow5, what other parsers or data structures would benefit from this [no action required]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's just used in StreamVByte
, but I don't think any other file format actually would benefit from this kind of compression very much. It is very nice in slow5 though, because of the integer data streaming out of the nanopore.
TODO:
|
Parse parameters are actually there for reason. it's reset fairly often, so it is easier to have it outside the parser itself. We could have a separate method for fully resetting it, but I think it is fairly readable as is. |
Alright, looking over this it looks good to get a review from @TimothyStiles |
I like the idea of utilizing generics but am not too familiar with their usage and syntax. Can you add a little "how-to" guide for writing parsers and writer in this new setup? |
Hmmm, for both of those it doesn't really have much to do with generics (mostly just interfaces) - you only need generics once you make the top level convenience-function, which is essentially copy-pasted over each parser type, because Go generics are limited. Really what you have to be concerned about when writing parsers is the interfaces they must implement. To be clearer with the generics: // NewFastaParser initiates a new FASTA parser from an io.Reader.
func NewFastaParser(r io.Reader) (*Parser[*fasta.Record, *fasta.Header], error) {
return NewFastaParserWithMaxLineLength(r, DefaultMaxLengths[Fasta])
}
// NewFastaParserWithMaxLineLength initiates a new FASTA parser from an
// io.Reader and a user-given maxLineLength.
func NewFastaParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*fasta.Record, *fasta.Header], error) {
return &Parser[*fasta.Record, *fasta.Header]{parserInterface: fasta.NewParser(r, maxLineLength)}, nil
}
// NewFastqParser initiates a new FASTQ parser from an io.Reader.
func NewFastqParser(r io.Reader) (*Parser[*fastq.Read, *fastq.Header], error) {
return NewFastqParserWithMaxLineLength(r, DefaultMaxLengths[Fastq])
}
// NewFastqParserWithMaxLineLength initiates a new FASTQ parser from an
// io.Reader and a user-given maxLineLength.
func NewFastqParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*fastq.Read, *fastq.Header], error) {
return &Parser[*fastq.Read, *fastq.Header]{parserInterface: fastq.NewParser(r, maxLineLength)}, nil
} What you can notice from both of these is that the Fastq is literally the same as the Fasta, except the writers + parsers"How to" make a parser is almost entirely contained within these few lines (as part of the type system): type parserInterface[Data io.WriterTo, Header io.WriterTo] interface {
Header() (Header, error)
Next() (Data, error)
}
type Parser[Data io.WriterTo, Header io.WriterTo] struct {
parserInterface parserInterface[Data, Header]
} Let's dive into the Golang type system to understand it better. A Parser is a generic struct, which means it is any struct that implements a To sum it up, a For example, the following parser would satisfy the type SimpleHeader struct {}
type SimpleData struct {}
func (h SimpleHeader) WriteTo(w io.Writer) (n int64, err error) { return 0, nil}
func (d SimpleData) WriteTo(w io.Writer) (n int64, err error) { return 0, nil}
type SimpleParser struct{}
func (p SimpleParser) Header() (SimpleHeader, error) { return SimpleHeader{}, nil}
func (p SimpleParser) Next() (SimpleData, error) { return SimpleData{}, nil } In this example, we have all the fundamentals needed:
The clever bit is realizing that pretty much every single biological data structure has something along these lines, and that In this way, the interfaces here are essentially just forcing parsers to expose 4 functions ( wrapping it upLet's say you want to implement func NewSimpleParser() (*Parser[*SimpleData, *SimpleHeader], error) {
return &Parser[*SimpleData, *SimpleHeader]{parserInterface: SimpleParser{}}, nil
} All this does is:
why we couldn't do this before genericsThe hard bit is that I wanted to expose the underlying data types when you do things with a Parser. To pull from some above examples: parser, _ := bio.NewFastaParser(file)
fastaRecord, _ := parser.Next()
// you should be able to do the below!
fmt.Println(fastaRecord.Sequence)
fmt.Println(fastaRecord.Read) Without a generic interface, we couldn't say that it was aight for |
Hello! Can I ask you to see and apply my changes for #352 ? |
Sure! Could you add a pull request with your changes into this branch? That would be the proper way + our tooling for reviewing changes would work well with that. Normally, you just make the changes on your own fork, and then do a pull request into this branch. |
It would be interesting now that we're using generics to have a factory to decide which parser to use based on the file content, so the developer doesn't need to choose what parser to use. |
Due to how generics work in Go, I'm not sure that you actually can do this. Basically, if the output signature is |
my mistake, for some reason reading your explanation I thought it would be possible to pass generic types as outputs, but taking a better look at the documentation it seems undoable to use Golang generics like that. It would be possible if the function returns a common structure, which is not possible with the current implementation. |
Yes, we don't have a common structure between all of the different parsers, and I think this is a good thing. Makes each data type very specific and concise. |
This is a fairly large PR to standardize our parsing and writing of files, using generics to implement higher level functions on simplified interfaces that are standardized across all our readers/writers: genbank, gff, fasta, fastq, slow5, sam, uniprot, rebase, pileup