-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable background subtraction / file unzipping #118
Open
christinehc
wants to merge
35
commits into
main
Choose a base branch
from
background
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
changelog: - All files are streamed to input files, rather than just files without associated background files. - Background filenames (stripped of extensions) are no longer part of the input stream for `rules.score`, preventing odd errors - Updated associated files to pull all input files as desired.
changelog: - NOTE THAT WORKFLOW IS CURRENTLY BROKEN DUE TO SNAKEMAKE I/O REASONS AND I AM COMMITTING INTERRIM CHANGES - fix: redo file glob -- file globbing now proceeds through `glob_wildcards` to more cleanly grab input files - fix: enable unzip -- unzipping has been overhauled (these are forward changes adapted from snekmer 2.0.0 / the biotite-kmers branch). - fix: add background -- changes have been made to collate background files and use their kmer distribution to subtract a background from protein family kmer models. These fixes work piece-by-piece locally but have not been fully tested and may not work ideally yet.
- Note: changes did NOT work, hence the "broken" tag.
changelog: - snakemake now correctly builds DAG for background workflow, including file unzipping - some files have been renamed for simplicity - some instances of `skm.io.load_npz` have been replaced with `np.load` due to KeyError (perhaps due to numpy or pickle version?) - `rules.combine_background` now uses kmer basis set for each family to reshape each background vector. should make files smaller and workflow more compact - NOTE: WORKFLOW IS BROKEN AT `rules.score_with_background` due to file load / array shape issues that will be fixed in the next commit. - addresses #37
changelog: - kmer probability scoring using background subtraction is now the default scoring method - `snekmer.score.feature_class_probabilities` now performs either background subtraction based scoring, family label based scoring, or a combination thereof depending on user input - TODO: integration with `snekmer.score.KmerScorer` object
changelog: - new config parameter config['score']['method'] added for compatibility with additional new(!) scoring methods - uptick version from v1.1.1 -> v1.4.0 - upticked +3 minor versions in anticipation of two pending PRs - remove no longer needed files
changelog: - kmers can now be scored by probability score subtracting the observed kmers in a supplied background set, family set, or combining both background and family - note: some column headers have changed, which may affect downstream analysis (e.g. integration with #115, #116) - to handle user-supplied background files, new rules have been created to count background kmers and combine background kmer counts into a background matrix. The appropriate files for the new workflow have been created. - extensive changes have been made to `snekmer.score` to accommodate the new changes, including: - `snekmer.score.score` now has 3 distinct formulae to compute probability scores according to the desired scoring method - `snekmer.score.feature_class_probabilities` now also integrates the scoring method - the main scoring rule itself has been significantly altered as follows" - all references to the old and not-working "background subtraction" (e.g. separating sequences by "sample" or "background" labels) have been removed - extraneous kmer probability scores for every family are no longer calculated; only the family in question's kmer profile is scored - scoring method now integrated
TODO:
|
changelog: - fix: `snekmer.utils.get_family` now accepts `regex=None` by default as to not erroneously truncate filenames. - fix: small change to `snekmer.utils.get_family` to correctly identify directories. - refactor: overhaul `snekmer.utils.split_file_ext` to split at the point of an .faa, .fa, .fna, or .fasta extension instead of assuming at most 2 potential extensions
changelog: - file unzipping is now handled by top-level unzip code in each snakefile; thus, `process.smk` is outdated and has been deleted as it is no longer needed.
changelog: - file wildcard globbing previously proceeded through `glob.glob`, but had been updated in the model workflow to use snakemake's `glob_wildcards` utility. This method has the added benefit of preventing recursion errors with wildcard retrieval from gzipped files. The changes have now been applied to cluster and search workflows.
changelog: - refactor: move `cluster_cluster.py` -> `cluster.py` - refactor: move cluster report generation to separate script directive - fix: change cluster mode file globbing to mirror model mode changes, i.e. uses snakemake `glob_wildcards` instead of python `glob.glob`. This should also fix unzipping issues and recursion errors related to unzipping.
changelog: - fix: search file globbing updated to use snakemake's `glob_wildcards` rather than python's `glob.glob` in search mode. Should also resolve issues with file detection for files requiring unzipping and avoid recursion errors. Tested locally with a small subset of small families. - style: applied snakefmt to `cluster.smk` and `search.smk`
changelog: - feat: Snakemake `--resources` flag has been added to Snekmer CLI for all modes and tested locally. - refactor: Wrapped all snakemake command line arguments into dictionary which is now passed to all snekmer subcommands. Removes the redundancy in specifying the same command line arguments every time a subcommand is called.
changelog: - fix: resolve error with array shapes due to matrix dimensions (transpose matrix required) - refactor: renamed variables to streamline code
changelog: - basis harmonization now accounts for either 1D or 2D array cases - 1D arrays are explicitly handled to match expected shape parameters set by the assumption that input arrays are 2D - `utils.check_n_seqs` now uses boolean input arg to handle gz files rather than inferring from filename
changelog: - Workflow now accounts for cases where no background files are included when either "combined" or "background" mode are selected. (TODO: raise warning in this case) - Bypass UnicodeDecodeError for `utils.check_n_seqs`
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR (1) enables automatic gzipped file detection and unzipping as part of the main Snekmer workflow, and (2) overhauls the integration of background files into Snekmer workflows such that background files can be supplied to Snekmer and the kmer profile of background sequences be used to inform the probability of kmers appearing in a given family vs. a background, thus affecting downstream models. For (2), a parallel workflow is enabled in Snekmer that processes background files and sums the kmer profiles observed across the background for integration into the scoring and modeling steps. See the full changelog for details.
Issues
Full Changelog
rules.score
, preventing odd errorsglob_wildcards
to more cleanly grab input filesskm.io.load_npz
have been replaced withnp.load
due to KeyError (perhaps due to numpy or pickle version?)rules.combine_background
now uses kmer basis set for each family to reshape each background vector. should make files smaller and workflow more compactsnekmer.score.feature_class_probabilities
now performs either background subtraction based scoring, family label based scoring, or a combination thereof depending on user inputsnekmer.score
to accommodate the new changes, including:snekmer.score.score
now has 3 distinct formulae to compute probability scores according to the desired scoring methodsnekmer.score.feature_class_probabilities
now also integrates the scoring method