-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paired read processing #133
Draft
Colelyman
wants to merge
407
commits into
pinellolab:master
Choose a base branch
from
edilytics:upstream-paired-processing
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Paired read processing #133
Colelyman
wants to merge
407
commits into
pinellolab:master
from
edilytics:upstream-paired-processing
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Most common alleles for each pooled target are output if the flag '--compile_postrun_references' is provided. This writes alleles with frequncy defined by the parameter to --compile_postrun_reference_allele_cutoff This file can be manually edited to remove noisy alleles, and then used to run CRISPRessoPooled again but to provide alternate alleles to each CRISPResso run by using the parameter '--alternate_alleles'. This is particularly useful in cases where control experiments are available. The running pattern would be: 1) CRISPRessoPooled --compile_postrun_references {control} 2) CRISPRessoPooled --alternate_alleles {produced in step 1} {control} CRISPRessoPooled --alternate_alleles {produced in step 1} {experiment}
Rename postrun references file to be more standardized with other output files. Output is now "CRISPResso2Pooled_postrun_references.txt"
Related to issue #61 This happens when N_ROWS < 1 which I assume has something to do with no results -> negative control
Fix a bug when generating compare plot
- Special bonus for y'all to keep you company during covid - axis ticks on most plots! - added parameter --plot_histogram_outliers to plot all insertion sizes in histogram - all insertion sizes are reported in .hist output files #64 - add HDR reference plot (may change this later to set ref1 to the longer reference of WT/HDR but for now it is always WT..) Allow reverse complement of extension seq if PE sequence is specified.
Frameshift plots don't show 0-bp changes (these dwarf all other changes). The number of reads not shown are added to the legend. Addressed cloning quantification windows when bases are inserted in the clone-ee (previously these cloned bases would be ignored. Force HDR to clone all quantification windows from Ref 1 Fix #60 and #59
Plot window for sgRNA will be the same length after cloning even if the window is shorter or longer after comparing between ref1 and HDR.
Adds CRISPRessoAggregate Adds start/end time to CRISPRessoBatch info Started removing pickle dependencies from Pooled and Report
In CRISPRessoWGS, the region file contains a 'chr_id' column which is sometimes mis-recognized as ints when read by pandas if using the chromosome notation without 'chr' (e.g. 1,2,3 in stead of chr1,chr2,chr3). This bug fix forces chr_ids to be read as strs.
Starting in version 2.1.0, insertion quantification has been changed to only include insertions completely contained by the quantification window. To use the legacy quantification method (i.e. include insertions directly adjacent to the quantification window) please use the parameter --use_legacy_insertion_quantification
Prime editing input parameters are forced to be in the RNA 3'->5' direction. This makes sure that the scaffold incorporation happens on the correct side of the extension sequence. Errors are thrown if improper directionality is detected. Fastq_out now includes alignment scores and details for every run (it may be time to upgrade that SSD to hold these new fastq output files, but it makes debugging particular reads a lot easier!) Update to linked data for plot 3b in report
Multiple amplicon names are resolved before adding the HDR amplicon -- unnamed amplicons are named Amplicon{i} for each amplicon. Plot 4g data (nuc pct table, mod pct table for all reads aligned to the first reference) is output and linked to from the plot display Ambiguous reads don't contribute to plot 4g data (which would otherwise lead to double counting and pct values > 1)
These changes implement separating reads to their corresponding amplicons via Python instead of through awk. This is to get around the maximum number of open files that is limited on many operating systems. Co-authored-by: Kendell Clement <[email protected]>
* Error out if HDR amplicon matches existing amplicon * Add check for amplicon sequence uniqueness * Fix bug with bam_input not having bam_output * Test for no returned lines in auto mode, version bump to 2.2.11 * Fix pandas deprecation of df.append
CRISPResso checks that prime editing guides are provided in the proper orientation (e.g. pegRNA 3'->5', spacer sequence 5'->3') and checks these orientations by alignment. Sometimes, the alignment can be better in the opposite direction, and this parameter allows these checks to be overridden. Otherwise, these checks would halt the program and produce the output 'The prime editing pegRNA spacer sequence appears to be given in the 3\'->5\' order. The prime editing pegRNA spacer sequence (--prime_editing_pegRNA_spacer_seq) must be given in the RNA 5\'->3\' order.'
if the user specifies the prime_editing_override_prime_edited_ref_seq, it could not contain the extension seq (if they don't provide the extension seq in the appropriate orientation), so check that here. Extension sequence should be provided reverse-complement to the prime edited sequence.
* Add FLASh and Trimmomatic deprecation notice to CLI output * Add Edilytics email address to CLI output
Previously, the bam would set the cigar string to 0 if the read was unaligned. This breaks the sam->bam conversion and causes the errors in pinellolab#235.
This change checks to see if a bam file was input, and if so it doesn't try to remove any intermediate files because there aren't any. Co-authored-by: Cole Lyman <[email protected]>
pinellolab#274) I have suffered enough trying to debug my installation, so hopefully this helps someone else. Co-authored-by: Cole Lyman <[email protected]>
In the most recent version of numpy (1.24) some of the types have been deprecated. This commit fixes these errors.
* Fixing documentation to match pooled headers * Header removal bug fix change documentation to guide_seq * Update documentation and help feature for CRISPRessoPooled * Remove extra newlines from CRISPRessoPooled -h * Make variable names as clear as my firstborn child's name * Update one more variable name Co-authored-by: Samuel Nichols <[email protected]>
* Implement logging handler to overwrite the latest log status to file * Add StatusHandler to CRISPRessoCORE log This will take the latest log output and write it to a file (`status.txt`), the catch being that with each log the file is overwritten so that one can easily tell where CRISPResso currently is and what the error is (if any). These changes include some slight refactoring in order to accomodate any potential parameter exceptions. * Add StatusHandler to CRISPRessoBatch and refactor `logger.warn` to `warn` * Add StatusHandler to CRISPRessoPooled and a little refactoring * Implement `percent_complete` to the status log * Add StatusHandler to CRISPRessoAggregate log * Add StatusHandler to CRISPRessoCompare log * Add StatusHandler to CRISPRessoPooledWGSCompare log * Add StatusHandler to CRISPRessoWGS log * Rename `status.txt` to `CRISPResso_status.txt` * Modify status log names to match the tool they are generated from * Add percent_complete stages to CRISPRessoCORE These also include log statements of each plot that is being generated as well as fixing some variable name collisions with `ind`. * Format the percentage in the log to be 2 decimal places * Change all plotting logs from `info` to `debug` and simplify progress This refactors how the progress of the plots is calculated, making it much simplier. Before this change we would of had to keep track of the number of times `percent_complete` was output, but now it simply updates the percent complete after each amplicon is finished processing. Hopefully this will make things easier to mantain even though it will be a little less "accurate" (not sure how accurate the original implementation was...). * Implemented shared console log handler across all CRISPResso* calls This allows for easy changes to logging formatting, which was inspired by having to change the default logging level. The default logging level needs to be set at `logging.DEBUG` in order for the debug log statements to not be ignored for the running and status logs. * Add ability to set the verbosity level to each CRISPResso* tool This allows users to set a verbosity level between 1 and 4 using the `-v`/`--verbosity` CLI parameter. If the `--debug` flag is present, then the level will default to 4, being the most verbose. * Implement showing the last seen `percent_compelte` when none is provided * Keep track of and log when multiple parallel runs are completed These changes modify `CRISPRessoMultiProcessing.run_crispresso_cmds` such that we can now display when a run is completed. This potentially breaks how signals and interupts are handled with multiple runs happening, but this needs to be reviewed. * Add debug and percentage complete to CRISPRessoBatch * Add percent complete to CRISPRessoPooled * Add debug and percent_complete message to CRISPRessoAggregate * Add `percent_complete` to CRISPRessoCompare * Add `percent_complete` to CRISPRessoPooledWGSCompare * Add status and `percent_complete` to CRISPRessoMeta * Add `verbosity` arguments to CRISPRessoCompare and CRISPRessoPooledWGSCompare * Fixing documentation to match pooled headers * Header removal bug fix change documentation to guide_seq * Update documentation and help feature for CRISPRessoPooled * Remove extra newlines from CRISPRessoPooled -h * Make variable names as clear as my firstborn child's name * Update one more variable name * Fix bug to flow CRISPRessoPooled options to sub command * Make amplicon file args variable name clear * Update how parameters are set and retrieved from parameter object The refactor in the previous commit changed the type of the arguments to a dictionary which doesn't have the parameters as attributes, and this commit fixes that error. * Add note in output header for change in default CRISPRessoPooled In the next release (2.3.0) the `--demultiplex_only_at_amplicons` will be the default when running in mixed-mode. This is to allow for inexact alignments of the reads and the amplicons to the genome. For more context, see this issue pinellolab#276 * Clarify the verbosity parameter help message * Separate out parameters to `normalize_name` in CRISPRessoCORE * Separate out parameters to `normalize_name` in CRISPRessoWGS * Separate out parameters to `normalize_name` in CRISPRessoPooled * Separate out parameters to `normalize_name` in CRISPRessoCompare * Fix bug in CRISPRessoPooled by replacing `database_id` with `normalize_name` * Refactor `run_crispresso_cmds` to not require a `logger` This commit implements the functionality to make the `logger` object optional by seeing which module called the `run_crispresso_cmds` function and obtaining the correct object from that module name. The function also immediately returns when no commands are passed to it. * Add amplicon name to plotting debug statements in CRISPRessoCORE --------- Co-authored-by: Cole Lyman <[email protected]> Co-authored-by: Cole Lyman <[email protected]> Co-authored-by: Cole Lyman <[email protected]> Co-authored-by: Samuel Nichols <[email protected]>
…un if one fails. Use `conda install -c conda-forge pytest-check` to install the dependencies
Colelyman
force-pushed
the
upstream-paired-processing
branch
from
September 21, 2023 22:29
b074699
to
41bfcaa
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The algorithm runs the unit tests successfully now, I imagine that it needs to be integrated into the core so that the newly aligned reads are actually used, right? If so, could you point me to where this would need to be done?