Background distribution handling #74

biodataganache · 2022-08-28T18:54:52Z

To address the issue of loading all kmer matrices in to memory for the model pipeline (both score and model rules do this) we can create background distributions from all kmers in a dataset. This could be constructed ahead of time - in a special pipeline 'build-dist' or can be done on the fly to build for each individual family in a thread. The score and model rules can load these background distributions - which will only be a bit bigger than the length of the kmers. Then combined, then used to score and model*. *model is something I'm not as clear about how to do.

biodataganache · 2022-08-28T19:34:31Z

This would allow the creation of generalized kmer background distribution files that could be pre-constructed and used for particular k/alphabet combinations. That would mean that the user wouldn't have to worry about supplying a background and could train a model that way. These could be included in the repo.

biodataganache assigned biodataganache and christinehc Aug 28, 2022

biodataganache added the enhancement New feature or request label Aug 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Background distribution handling #74

Background distribution handling #74

biodataganache commented Aug 28, 2022

biodataganache commented Aug 28, 2022

Background distribution handling #74

Background distribution handling #74

Comments

biodataganache commented Aug 28, 2022

biodataganache commented Aug 28, 2022