Encrypt and compress read data #8

morungos · 2014-05-29T13:29:45Z

Currently, both the pipeline and the webapp use reads at the record level. This is good for fine-grained access, but not ideal. We really should move to a bucketed/compressed/encrypted model, with (say) packets of 5k reads compressed and encrypted.

If we keep this relatively small, there won't be huge penalty accessing a single read. There may even be a performance improvement as we reduce the disk, I/O, and index sizes, which is actually likely.

This issue affects both the pipeline and the webapp, as the pipeline writes the data and the webapp reads it. So both Python and Java need to agree on data storage and compression systems. See: capsid/capsid-webapp#63

morungos · 2014-05-29T13:56:08Z

Probably we need to use something as standard as deflate. We also care about speed almost as much as compression, which is why better techniques such as LZMA aren't really an option. This posting covers a similar issue, involving C# and Python. http://stackoverflow.com/questions/1089662/python-inflate-and-deflate-implementations

morungos · 2014-06-02T19:07:08Z

This is going to take a little work, because right now we just dump everything into the database. We can't really do that any more, so the process ought to change to use files more, and then load files into blocks in the database which can be encrypted and compressed. The basic idea is the same: there'll be files associated with (a) a genome, (b) a set of owners (project, align, etc.), and (c) indexed by start position in blocks of a decent size, say 30-50K, which can be quickly decrypted/decompressed. All this will happen in the webapp, but the pipeline needs to write out the storage. GridFS will handle most of it, when we know what we are writing.

The issue is that the pipeline can afford to just dump shit out. We can't do that. We need to build up blocks we can handle. Note that we do not need to make the start offsets consistent and sequentially ordered, we can make them a uniform read count, or block size, if we like. That does mean, however, we need the reads sorted by start by the time we get them. The owner issue is less of an issue: we're combining multiple owners into a single DB file field. Annoyingly. we do this using an update process, so it's not trivial to manage it all sequentially.

morungos self-assigned this May 29, 2014

morungos mentioned this issue May 29, 2014

Encrypt and compress read data capsid/capsid-webapp#63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encrypt and compress read data #8

Encrypt and compress read data #8

morungos commented May 29, 2014

morungos commented May 29, 2014

morungos commented Jun 2, 2014

Encrypt and compress read data #8

Encrypt and compress read data #8

Comments

morungos commented May 29, 2014

morungos commented May 29, 2014

morungos commented Jun 2, 2014