Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Background file issue #37

Open
biodataganache opened this issue Dec 16, 2021 · 2 comments · May be fixed by #118
Open

Background file issue #37

biodataganache opened this issue Dec 16, 2021 · 2 comments · May be fixed by #118
Assignees
Labels
bug Something isn't working

Comments

@biodataganache
Copy link
Collaborator

I'm not sure the background file option is working as intended? When I include a background.fasta file in the input/background/ folder it doesn't seem to do anything.

@biodataganache biodataganache added the bug Something isn't working label Dec 16, 2021
@christinehc
Copy link
Collaborator

This may be due to how background files are currently processed, under the assumption that they are also separated per-family with family names corresponding to input family filenames. I think a rework of the code to remove this assumption is due.

@christinehc christinehc added this to the Snekmer Paper Final Draft milestone Dec 16, 2021
christinehc added a commit that referenced this issue Jan 11, 2022
Previously, applying end models to a new set of sequences was complicated by several intermediate steps where the kmer vector is standardized to multiple cascading subsets of kmers, in a certain order. These subsets and their ordering were not preserved in the previous iteration of the workflow, meaning that applying those same transforms to new input vectors was not possible.

All intermediate transformation steps are now handled in the `snekmer.model.KmerScorer` object, which also preserves `snekmer.model.KmerBasis` and `snekmer.score.KmerScoreScaler` objects for intermediate reshaping. Code has been tested to successfully run locally on two files.

Creates all code for #27 (usage will be elaborated upon in a notebook). Also more comprehensively fixes #12. Currently does not integrated suggested changes in #37.
@christinehc
Copy link
Collaborator

Small update: background file integration is now underway (see 8c0f312)

christinehc added a commit that referenced this issue Nov 8, 2023
changelog:
- snakemake now correctly builds DAG for background workflow, including file unzipping
- some files have been renamed for simplicity
- some instances of `skm.io.load_npz` have been replaced with `np.load` due to KeyError (perhaps due to numpy or pickle version?)
- `rules.combine_background` now uses kmer basis set for each family to reshape each background vector. should make files smaller and workflow more compact
- NOTE: WORKFLOW IS BROKEN AT `rules.score_with_background` due to file load / array shape issues that will be fixed in the next commit.
- addresses #37
christinehc added a commit that referenced this issue Dec 12, 2023
changelog:
- kmers can now be scored by probability score subtracting the observed kmers in a supplied background set, family set, or combining both background and family
  - note: some column headers have changed, which may affect downstream analysis (e.g. integration with #115, #116)
- to handle user-supplied background files, new rules have been created to count background kmers and combine background kmer counts into a background matrix. The appropriate files for the new workflow have been created.
- extensive changes have been made to `snekmer.score` to accommodate the new changes, including:
  - `snekmer.score.score` now has 3 distinct formulae to compute probability scores according to the desired scoring method
  - `snekmer.score.feature_class_probabilities` now also integrates the scoring method
- the main scoring rule itself has been significantly altered as follows"
  - all references to the old and not-working "background subtraction" (e.g. separating sequences by "sample" or "background" labels) have been removed
  - extraneous kmer probability scores for every family are no longer calculated; only the family in question's kmer profile is scored
  - scoring method now integrated
@christinehc christinehc linked a pull request Dec 12, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants