Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io->bio #339

Closed
wants to merge 67 commits into from
Closed

io->bio #339

wants to merge 67 commits into from

Conversation

Koeng101
Copy link
Contributor

This is a fairly large PR to standardize our parsing and writing of files, using generics to implement higher level functions on simplified interfaces that are standardized across all our readers/writers: genbank, gff, fasta, fastq, slow5, sam, uniprot, rebase, pileup

@Koeng101
Copy link
Contributor Author

This is a work-in-progress, btw

@carreter
Copy link
Collaborator

Why the io to bio rename?

@Koeng101
Copy link
Contributor Author

Why the io to bio rename?

So that the name doesn't conflict with standard library io

bio/bio.go Outdated Show resolved Hide resolved
bio/bio.go Outdated Show resolved Hide resolved
bio/example_test.go Outdated Show resolved Hide resolved
bio/fasta/fasta.go Outdated Show resolved Hide resolved
@Koeng101 Koeng101 added the draft label Aug 31, 2023
bio/bio.go Outdated Show resolved Hide resolved
bio/bio.go Outdated Show resolved Hide resolved
bio/bio.go Outdated Show resolved Hide resolved
bio/bio.go Outdated Show resolved Hide resolved
bio/bio.go Outdated Show resolved Hide resolved
bio/bio.go Outdated Show resolved Hide resolved
bio/bio.go Outdated Show resolved Hide resolved
bio/bio.go Outdated Show resolved Hide resolved
bio/fastq/fastq.go Outdated Show resolved Hide resolved
@Koeng101
Copy link
Contributor Author

Koeng101 commented Sep 1, 2023

fasta parser testing is now at 98.5%, other than the Scanner non-EOF error, which I can't really figure out how to test after a very brief trying-to-figure-out. I removed a ton of code, and everything should be fairly simple now.

bio/bio.go Outdated Show resolved Hide resolved
bio/bio.go Outdated Show resolved Hide resolved
bio/fasta/fasta.go Outdated Show resolved Hide resolved
Copy link
Contributor

@abondrn abondrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, I only had stylistic comments. Due to the size of the PR and it being a pareto improvement (common interface, strictly greater testing, no loss in efficiency) I'm leaning towards approving ASAP. Since all of the key contributors have had their say, for the remaining change requests simply determine whether it is (1) a quick fix (2) issue worthy (3) won't do. No additional features please.

@@ -66,7 +67,10 @@ func TestHash(t *testing.T) {
}

func TestLeastRotation(t *testing.T) {
sequence, _ := genbank.Read("../data/puc19.gbk")
file, _ := os.Open("../data/puc19.gbk")
defer file.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't strictly necessary. Go will close any resources which leaves the lexical scope, unless you plan to store this value somewhere on the heap. Yay GC!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that was how it worked, but code I've seen still usually has the defer Close! Do you know why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started thread in Discord

io/io.go Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is a quirk with how Github displays diff, but I prefer using git mv as this preserves file history. Blames are an important investigation tool and I would hate to lose that information, but if you already did that then disregard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, will do in the future. io/io.go has basically entirely died, so I'm fine with that here.

Usually git is able to figure out if I moved a file if I just immediately commit again

@@ -15,8 +15,11 @@ require (

require (
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/intel-go/cpuid v0.0.0-20181003105527-1a4a6f06a1c6 // indirect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem interesting, what are these new dependencies for? Not opposed as they seem low-level performance-centric

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered my question after looking at slow5, what other parsers or data structures would benefit from this [no action required]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just used in StreamVByte, but I don't think any other file format actually would benefit from this kind of compression very much. It is very nice in slow5 though, because of the integer data streaming out of the nanopore.

bio/bio.go Show resolved Hide resolved
bio/genbank/genbank.go Show resolved Hide resolved
bio/genbank/genbank.go Show resolved Hide resolved
bio/genbank/genbank.go Show resolved Hide resolved
bio/fasta/fasta.go Outdated Show resolved Hide resolved
bio/fasta/fasta.go Outdated Show resolved Hide resolved
@Koeng101
Copy link
Contributor Author

TODO:

  • Internalize parse parameters
  • Add back fasta format in docs
  • Remove that extra space

@Koeng101
Copy link
Contributor Author

TODO:

* Internalize parse parameters

* Add back fasta format in docs

* Remove that extra space

Parse parameters are actually there for reason. it's reset fairly often, so it is easier to have it outside the parser itself. We could have a separate method for fully resetting it, but I think it is fairly readable as is.

@Koeng101
Copy link
Contributor Author

Alright, looking over this it looks good to get a review from @TimothyStiles

@TimothyStiles
Copy link
Collaborator

@Koeng101

I like the idea of utilizing generics but am not too familiar with their usage and syntax. Can you add a little "how-to" guide for writing parsers and writer in this new setup?

@Koeng101
Copy link
Contributor Author

@Koeng101

I like the idea of utilizing generics but am not too familiar with their usage and syntax. Can you add a little "how-to" guide for writing parsers and writer in this new setup?

Hmmm, for both of those it doesn't really have much to do with generics (mostly just interfaces) - you only need generics once you make the top level convenience-function, which is essentially copy-pasted over each parser type, because Go generics are limited. Really what you have to be concerned about when writing parsers is the interfaces they must implement. To be clearer with the generics:

// NewFastaParser initiates a new FASTA parser from an io.Reader.
func NewFastaParser(r io.Reader) (*Parser[*fasta.Record, *fasta.Header], error) {
	return NewFastaParserWithMaxLineLength(r, DefaultMaxLengths[Fasta])
}

// NewFastaParserWithMaxLineLength initiates a new FASTA parser from an
// io.Reader and a user-given maxLineLength.
func NewFastaParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*fasta.Record, *fasta.Header], error) {
	return &Parser[*fasta.Record, *fasta.Header]{parserInterface: fasta.NewParser(r, maxLineLength)}, nil
}

// NewFastqParser initiates a new FASTQ parser from an io.Reader.
func NewFastqParser(r io.Reader) (*Parser[*fastq.Read, *fastq.Header], error) {
	return NewFastqParserWithMaxLineLength(r, DefaultMaxLengths[Fastq])
}

// NewFastqParserWithMaxLineLength initiates a new FASTQ parser from an
// io.Reader and a user-given maxLineLength.
func NewFastqParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*fastq.Read, *fastq.Header], error) {
	return &Parser[*fastq.Read, *fastq.Header]{parserInterface: fastq.NewParser(r, maxLineLength)}, nil
}

What you can notice from both of these is that the Fastq is literally the same as the Fasta, except the Data and Header now points to fastq types rather than fasta types. The only reason you have to copy and paste this all the time instead of giving the format as a variable into a function is due to the limitation of Golang types.

writers + parsers

"How to" make a parser is almost entirely contained within these few lines (as part of the type system):

type parserInterface[Data io.WriterTo, Header io.WriterTo] interface {
	Header() (Header, error)
	Next() (Data, error)
}

type Parser[Data io.WriterTo, Header io.WriterTo] struct {
	parserInterface parserInterface[Data, Header]
}

Let's dive into the Golang type system to understand it better. A Parser is a generic struct, which means it is any struct that implements a parserInterface. What does that mean? Well, it means that any struct that implements the methods Next() and Header() satisfies the parserInterface. Both Next() and Header() just return an io.WriterTo and err, which is a standard library interface + primitive.

To sum it up, a Parser is any struct that returns a io.WriterTo+err when you call Header() or Next(). It is extremely simple.

For example, the following parser would satisfy the bio.Parser:

type SimpleHeader struct {}
type SimpleData struct {}
func (h SimpleHeader) WriteTo(w io.Writer) (n int64, err error) { return 0, nil}
func (d SimpleData) WriteTo(w io.Writer) (n int64, err error) { return 0, nil}

type SimpleParser struct{}
func (p SimpleParser) Header() (SimpleHeader, error) { return SimpleHeader{}, nil}
func (p SimpleParser) Next() (SimpleData, error) { return SimpleData{}, nil }

In this example, we have all the fundamentals needed:

  1. A Header
  2. A Data
  3. An io.WriteTo implemented on the Header
  4. An io.WriteTo implemented on the Data
  5. A Parser
  6. A Header() implemented on the parser
  7. A Next() implemented on the parser

The clever bit is realizing that pretty much every single biological data structure has something along these lines, and that Next() supports stream-reading - and as a side effect of implementing the io.WriteTo interface, you also have writing built-in to the type definitions in a standard-library supported way.

In this way, the interfaces here are essentially just forcing parsers to expose 4 functions (WriteTo x2, Next, Header), and by doing that on their underlying data structures, you now can have a completely standardized way of working with each parser.

wrapping it up

Let's say you want to implement SimpleParser as a top level parser. Here is how you would do that:

func NewSimpleParser() (*Parser[*SimpleData, *SimpleHeader], error) {
   return &Parser[*SimpleData, *SimpleHeader]{parserInterface: SimpleParser{}}, nil
}

All this does is:

  1. Say that we are returning a Parser with the two types SimpleData and SimpleHeader
  2. Implements the parserInterface needed for a Parser (because SimpleParser has those handy Next() and Header() functions)
  3. returns the instantiated generic Parser, wrapping that underlying SimpleParser

why we couldn't do this before generics

The hard bit is that I wanted to expose the underlying data types when you do things with a Parser. To pull from some above examples:

parser, _ := bio.NewFastaParser(file)
fastaRecord, _ := parser.Next()
// you should be able to do the below!
fmt.Println(fastaRecord.Sequence)
fmt.Println(fastaRecord.Read)

Without a generic interface, we couldn't say that it was aight for Data to contain other things, like Sequence or Name, beyond the required WriteTo. Essentially, all the generics are doing is saying it is aight for Data to contain other things without having to go through complicated type conversion magic.

@paleale
Copy link

paleale commented Nov 25, 2023

Hello! Can I ask you to see and apply my changes for #352 ?
I don't know a proper way for such changes to be suggested, so I've prepared a patch:
0001-remove-not-working-error-handling-from-GetSequence-f.patch
Changes in CONTRIBUTION.md were discussed in #352 also... Thank you!

@Koeng101
Copy link
Contributor Author

Hello! Can I ask you to see and apply my changes for #352 ? I don't know a proper way for such changes to be suggested, so I've prepared a patch: 0001-remove-not-working-error-handling-from-GetSequence-f.patch Changes in CONTRIBUTION.md were discussed in #352 also... Thank you!

Sure! Could you add a pull request with your changes into this branch? That would be the proper way + our tooling for reviewing changes would work well with that. Normally, you just make the changes on your own fork, and then do a pull request into this branch.

@isaacguerreir
Copy link
Contributor

It would be interesting now that we're using generics to have a factory to decide which parser to use based on the file content, so the developer doesn't need to choose what parser to use.

@Koeng101
Copy link
Contributor Author

It would be interesting now that we're using generics to have a factory to decide which parser to use based on the file content, so the developer doesn't need to choose what parser to use.

Due to how generics work in Go, I'm not sure that you actually can do this. Basically, if the output signature is Parser[Data, Header], you can't return Parser[fasta.Name, fasta.Header].

@isaacguerreir
Copy link
Contributor

my mistake, for some reason reading your explanation I thought it would be possible to pass generic types as outputs, but taking a better look at the documentation it seems undoable to use Golang generics like that. It would be possible if the function returns a common structure, which is not possible with the current implementation.

@Koeng101
Copy link
Contributor Author

seems undoable to use Golang generics like that. It would be possible if the function returns a common structure, which

Yes, we don't have a common structure between all of the different parsers, and I think this is a good thing. Makes each data type very specific and concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants