Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paired read processing #133

Draft
wants to merge 407 commits into
base: master
Choose a base branch
from

Conversation

Colelyman
Copy link
Contributor

The algorithm runs the unit tests successfully now, I imagine that it needs to be integrated into the core so that the newly aligned reads are actually used, right? If so, could you point me to where this would need to be done?

kclem and others added 30 commits September 23, 2020 23:32
Most common alleles for each pooled target are output if the flag '--compile_postrun_references' is provided. This writes alleles with frequncy defined by the parameter to --compile_postrun_reference_allele_cutoff

This file can be manually edited to remove noisy alleles, and then used to run CRISPRessoPooled again but to provide alternate alleles to each CRISPResso run by using the parameter '--alternate_alleles'.

This is particularly useful in cases where control experiments are available. The running pattern would be:
1) CRISPRessoPooled --compile_postrun_references {control}
2) CRISPRessoPooled --alternate_alleles {produced in step 1} {control}
CRISPRessoPooled --alternate_alleles {produced in step 1} {experiment}
Rename postrun references file to be more standardized with other output files.
Output is now "CRISPResso2Pooled_postrun_references.txt"
Related to issue #61
This happens when N_ROWS < 1 which I assume has something to do with no results -> negative control
Fix a bug when generating compare plot
- Special bonus for y'all to keep you company during covid - axis ticks on most plots!
- added parameter --plot_histogram_outliers to plot all insertion sizes in histogram
- all insertion sizes are reported in .hist output files #64
- add HDR reference plot (may change this later to set ref1 to the longer reference of WT/HDR but for now it is always WT..)
Allow reverse complement of extension seq if PE sequence is specified.
Frameshift plots don't show 0-bp changes (these dwarf all other changes). The number of reads not shown are added to the legend. Addressed cloning quantification windows when bases are inserted in the clone-ee (previously these cloned bases would be ignored.
Force HDR to clone all quantification windows from Ref 1
Fix #60 and #59
Plot window for sgRNA will be the same length after cloning even if the window is shorter or longer after comparing between ref1 and HDR.
Adds CRISPRessoAggregate
Adds start/end time to CRISPRessoBatch info
Started removing pickle dependencies from Pooled and Report
In CRISPRessoWGS, the region file contains a 'chr_id' column which is sometimes mis-recognized as ints when read by pandas if using the chromosome notation without 'chr' (e.g. 1,2,3 in stead of chr1,chr2,chr3). This bug fix forces chr_ids to be read as strs.
Starting in version 2.1.0, insertion quantification has been changed to only include insertions completely contained by the quantification window.
To use the legacy quantification method (i.e. include insertions directly adjacent to the quantification window) please use the parameter --use_legacy_insertion_quantification
Prime editing input parameters are forced to be in the RNA 3'->5' direction. This makes sure that the scaffold incorporation happens on the correct side of the extension sequence. Errors are thrown if improper directionality is detected.
Fastq_out now includes alignment scores and details for every run (it may be time to upgrade that SSD to hold these new fastq output files, but it makes debugging particular reads a lot easier!)
Update to linked data for plot 3b in report
Multiple amplicon names are resolved before adding the HDR amplicon -- unnamed amplicons are named Amplicon{i} for each amplicon.
Plot 4g data (nuc pct table, mod pct table for all reads aligned to the first reference) is output and linked to from the plot display
Ambiguous reads don't contribute to plot 4g data (which would otherwise lead to double counting and pct values > 1)
These changes implement separating reads to their corresponding amplicons via
Python instead of through awk. This is to get around the maximum number of open
files that is limited on many operating systems.

Co-authored-by: Kendell Clement <[email protected]>
kclem and others added 12 commits October 6, 2022 16:32
* Error out if HDR amplicon matches existing amplicon

* Add check for amplicon sequence uniqueness

* Fix bug with bam_input not having bam_output

* Test for no returned lines in auto mode, version bump to 2.2.11

* Fix pandas deprecation of df.append
CRISPResso checks that prime editing guides are provided in the proper orientation (e.g. pegRNA 3'->5', spacer sequence 5'->3') and checks these orientations by alignment. Sometimes, the alignment can be better in the opposite direction, and this parameter allows these checks to be overridden. Otherwise, these checks would halt the program and produce the output 'The prime editing pegRNA spacer sequence appears to be given in the 3\'->5\' order. The prime editing pegRNA spacer sequence (--prime_editing_pegRNA_spacer_seq) must be given in the RNA 5\'->3\' order.'
if the user specifies the prime_editing_override_prime_edited_ref_seq, it could not contain the extension seq (if they don't provide the extension seq in the appropriate orientation), so check that here. Extension sequence should be provided reverse-complement to the prime edited sequence.
* Add FLASh and Trimmomatic deprecation notice to CLI output

* Add Edilytics email address to CLI output
@Colelyman Colelyman marked this pull request as draft December 8, 2022 20:49
kclem and others added 16 commits December 19, 2022 13:28
Previously, the bam would set the cigar string to 0 if the read was unaligned. This breaks the sam->bam conversion and causes the errors in pinellolab#235.
This change checks to see if a bam file was input, and if so it doesn't try to
remove any intermediate files because there aren't any.

Co-authored-by: Cole Lyman <[email protected]>
pinellolab#274)

I have suffered enough trying to debug my installation, so hopefully this helps
someone else.

Co-authored-by: Cole Lyman <[email protected]>
In the most recent version of numpy (1.24) some of the types have been
deprecated. This commit fixes these errors.
* Fixing documentation to match pooled headers

* Header removal bug fix change documentation to guide_seq

* Update documentation and help feature for CRISPRessoPooled

* Remove extra newlines from CRISPRessoPooled -h

* Make variable names as clear as my firstborn child's name

* Update one more variable name

Co-authored-by: Samuel Nichols <[email protected]>
* Implement logging handler to overwrite the latest log status to file

* Add StatusHandler to CRISPRessoCORE log

This will take the latest log output and write it to a file (`status.txt`), the
catch being that with each log the file is overwritten so that one can easily
tell where CRISPResso currently is and what the error is (if any). These changes
include some slight refactoring in order to accomodate any potential parameter
exceptions.

* Add StatusHandler to CRISPRessoBatch and refactor `logger.warn` to `warn`

* Add StatusHandler to CRISPRessoPooled and a little refactoring

* Implement `percent_complete` to the status log

* Add StatusHandler to CRISPRessoAggregate log

* Add StatusHandler to CRISPRessoCompare log

* Add StatusHandler to CRISPRessoPooledWGSCompare log

* Add StatusHandler to CRISPRessoWGS log

* Rename `status.txt` to `CRISPResso_status.txt`

* Modify status log names to match the tool they are generated from

* Add percent_complete stages to CRISPRessoCORE

These also include log statements of each plot that is being generated as well
as fixing some variable name collisions with `ind`.

* Format the percentage in the log to be 2 decimal places

* Change all plotting logs from `info` to `debug` and simplify progress

This refactors how the progress of the plots is calculated, making it much
simplier. Before this change we would of had to keep track of the number of
times `percent_complete` was output, but now it simply updates the percent
complete after each amplicon is finished processing. Hopefully this will make
things easier to mantain even though it will be a little less "accurate" (not
sure how accurate the original implementation was...).

* Implemented shared console log handler across all CRISPResso* calls

This allows for easy changes to logging formatting, which was inspired by having
to change the default logging level. The default logging level needs to be set
at `logging.DEBUG` in order for the debug log statements to not be ignored for
the running and status logs.

* Add ability to set the verbosity level to each CRISPResso* tool

This allows users to set a verbosity level between 1 and 4 using the
`-v`/`--verbosity` CLI parameter. If the `--debug` flag is present, then the
level will default to 4, being the most verbose.

* Implement showing the last seen `percent_compelte` when none is provided

* Keep track of and log when multiple parallel runs are completed

These changes modify `CRISPRessoMultiProcessing.run_crispresso_cmds` such that
we can now display when a run is completed. This potentially breaks how
signals and interupts are handled with multiple runs happening, but this needs
to be reviewed.

* Add debug and percentage complete to CRISPRessoBatch

* Add percent complete to CRISPRessoPooled

* Add debug and percent_complete message to CRISPRessoAggregate

* Add `percent_complete` to CRISPRessoCompare

* Add `percent_complete` to CRISPRessoPooledWGSCompare

* Add status and `percent_complete` to CRISPRessoMeta

* Add `verbosity` arguments to CRISPRessoCompare and CRISPRessoPooledWGSCompare

* Fixing documentation to match pooled headers

* Header removal bug fix change documentation to guide_seq

* Update documentation and help feature for CRISPRessoPooled

* Remove extra newlines from CRISPRessoPooled -h

* Make variable names as clear as my firstborn child's name

* Update one more variable name

* Fix bug to flow CRISPRessoPooled options to sub command

* Make amplicon file args variable name clear

* Update how parameters are set and retrieved from parameter object

The refactor in the previous commit changed the type of the arguments to a
dictionary which doesn't have the parameters as attributes, and this commit
fixes that error.

* Add note in output header for change in default CRISPRessoPooled

In the next release (2.3.0) the `--demultiplex_only_at_amplicons` will be the
default when running in mixed-mode. This is to allow for inexact alignments of
the reads and the amplicons to the genome. For more context, see this issue
pinellolab#276

* Clarify the verbosity parameter help message

* Separate out parameters to `normalize_name` in CRISPRessoCORE

* Separate out parameters to `normalize_name` in CRISPRessoWGS

* Separate out parameters to `normalize_name` in CRISPRessoPooled

* Separate out parameters to `normalize_name` in CRISPRessoCompare

* Fix bug in CRISPRessoPooled by replacing `database_id` with `normalize_name`

* Refactor `run_crispresso_cmds` to not require a `logger`

This commit implements the functionality to make the `logger` object optional by
seeing which module called the `run_crispresso_cmds` function and obtaining the
correct object from that module name.

The function also immediately returns when no commands are passed to it.

* Add amplicon name to plotting debug statements in CRISPRessoCORE

---------

Co-authored-by: Cole Lyman <[email protected]>
Co-authored-by: Cole Lyman <[email protected]>
Co-authored-by: Cole Lyman <[email protected]>
Co-authored-by: Samuel Nichols <[email protected]>
…un if one

fails.

Use `conda install -c conda-forge pytest-check` to install the dependencies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants