You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, both the pipeline and the webapp use reads at the record level. This is good for fine-grained access, but not ideal. We really should move to a bucketed/compressed/encrypted model, with (say) packets of 5k reads compressed and encrypted.
If we keep this relatively small, there won't be huge penalty accessing a single read. There may even be a performance improvement as we reduce the disk, I/O, and index sizes, which is actually likely.
This issue affects both the pipeline and the webapp, as the pipeline writes the data and the webapp reads it. So both Python and Java need to agree on data storage and compression systems. See: capsid/capsid-webapp#63
The text was updated successfully, but these errors were encountered:
This is going to take a little work, because right now we just dump everything into the database. We can't really do that any more, so the process ought to change to use files more, and then load files into blocks in the database which can be encrypted and compressed. The basic idea is the same: there'll be files associated with (a) a genome, (b) a set of owners (project, align, etc.), and (c) indexed by start position in blocks of a decent size, say 30-50K, which can be quickly decrypted/decompressed. All this will happen in the webapp, but the pipeline needs to write out the storage. GridFS will handle most of it, when we know what we are writing.
The issue is that the pipeline can afford to just dump shit out. We can't do that. We need to build up blocks we can handle. Note that we do not need to make the start offsets consistent and sequentially ordered, we can make them a uniform read count, or block size, if we like. That does mean, however, we need the reads sorted by start by the time we get them. The owner issue is less of an issue: we're combining multiple owners into a single DB file field. Annoyingly. we do this using an update process, so it's not trivial to manage it all sequentially.
Currently, both the pipeline and the webapp use reads at the record level. This is good for fine-grained access, but not ideal. We really should move to a bucketed/compressed/encrypted model, with (say) packets of 5k reads compressed and encrypted.
If we keep this relatively small, there won't be huge penalty accessing a single read. There may even be a performance improvement as we reduce the disk, I/O, and index sizes, which is actually likely.
This issue affects both the pipeline and the webapp, as the pipeline writes the data and the webapp reads it. So both Python and Java need to agree on data storage and compression systems. See: capsid/capsid-webapp#63
The text was updated successfully, but these errors were encountered: