Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substitutions and N's outside quantification window counted in alleles frequency table #356

Open
GreenSeaBug opened this issue Nov 29, 2023 · 8 comments

Comments

@GreenSeaBug
Copy link

GreenSeaBug commented Nov 29, 2023

Hello,

I have -w 1 and -wc -3, yet substitutions and N's outside the quantification window seem to be counted as edited in the allele frequency table. For example, in the image below, none of the sequences have indels within the 2 bp quantification window, and the substitutions / N's are at least 19 bp away from the cut site...

image

What is going on here? Is there a way to exclude those reads from the analysis?

Everything else in the analysis worked as expected.

Thanks for your help.

@kclem
Copy link
Member

kclem commented Nov 29, 2023

Hi @GreenSeaBug,

Thanks for using CRISPResso, and sorry about the confusion with the allele display of N's and substitutions.

The allele plot will show substitutions and N's outside of the quantification window as different alleles, but they won't make the corresponding reads count as 'modified' or 'edited'. The allele plot is only for visualization, and if we were to collapse substitutions or N's to only show a single unedited allele it would not be an accurate representation of the data.

If you open the text file associated with the allele plot (e.g. Alleles_frequency_table_around_sgRNA_GAG...txt) you will see a table where the rows should correspond to alleles in your allele plot. This table includes a column 'Unedited' which is set to 'False' for reads that are 'Modified'. The rows that contain N's or substitutions outside the quantification window should be set to 'True' meaning that although the sequence of the allele is not the same as the reference sequence, the read is not classified as 'modified'.

If you'd like, you can annotate all the unmodified alleles using the command --annotate_wildtype_allele ** for example.

If you still think there is a problem, could you upload the allele table and provide the command you used to run CRISPResso, as well as the alleles you believe are problematic?

@GreenSeaBug
Copy link
Author

GreenSeaBug commented Nov 30, 2023

Thank you for the reply. That all makes sense. However, it seems the data in the .txt file do not match what is shown in the alleles visualisation plot.

For example in the table below it says that 83.68% are edited with -2, but in the visualisation plot it shows 88.86% are unedited (perfectly match reference in the quantification window).

image

Does this seem strange or am I missing something obvious?

Also, is there any way to exclude from the analysis reads with substitutions or N's outside the quantification window?

@kclem
Copy link
Member

kclem commented Nov 30, 2023

The mismatch of numbers (88.86% vs 83.86%) is because alleles with the same visual sequence have been collapsed to a single allele for plotting. That is, there are 88.86% of reads with the sequence shown in the plot, but the alleles couldn't be collapsed in the table because they have differences (snps or N's) that are outside of the plotting window.

For example imagine a sample with the reads in the allele frequency table:

ACTGAG - 80%
TCTGAG - 12%
AC-GAG - 8%

If the plotting window were the 2nd to 5th bases, the first two alleles would be collapsed so the alleles plotted would be:

CTGA - 92%
C-GA - 8%

I'm not sure what you mean to exclude the reads with substitutions or Ns. Do you mean that they would be collapsed in the allele plots so the N or substitution would visually be replaced by a base in the reference sequence? If so, I'd be wary of doing that because it doesn't represent the underlying data.

If you want reads with substitutions or N's to not make the read 'Modified' you can use the flag --ignore_substitutions.

@GreenSeaBug
Copy link
Author

OK, that makes sense and explains the discrepancy in percentages. However, I don't think it explains why the table says those 83.86 are edited with -2 bp deletion, while the plot says those 88.86% are unedited WT. What do you think?

As for excluding reads with substitutions or N's, no I am not wanting to collapse those reads in the allele plot to visually replace the substitutions and N's. I agree that would not be a good idea. Nor am I wanting to prevent reads with substitutions or N's within the quantification window from being classified as modified. Rather, I am wondering if it is possible to exclude these reads from the analysis entirely, that is, filter them out. In my case, and I would think in a lot of cases, they are just sequencing errors or reads derived from chimeric amplicons that are an artefact of PCR.

@kclem
Copy link
Member

kclem commented Nov 30, 2023

I assumed the plot showing 88% unedited was away from your quantification window - is that not the case?

You can exclude reads with N by filtering them out before CRISPResso analysis, and passing CRISPResso your filtered reads. Here's a script to filter reads based on the presence of a specific sequence: filterReadsOnSequencePresence.py - try running with --exclude_seq N.

@GreenSeaBug
Copy link
Author

GreenSeaBug commented Nov 30, 2023

Yes, the part of the plot that I showed is away from the quantification window. Here is the quantification window...

image

As you can see, nothing is modified. So I don't understand why the table says almost everything is modified (mostly -2 bp deletion).

Thank you for the script for the N's! Is there a way to also exclude reads with substitutions outside the quantification window?

@kclem
Copy link
Member

kclem commented Dec 1, 2023

Is the plot above the entire quantification window? If you look at the entire quantification window you should be able to see the 2bp deletion. If you'd prefer not to post here you can email me at [email protected].

For filtering, if you run with '--write_detailed_allele_table' CRISPResso will add a column to the 'Alleles_frequency_table.zip' file for "all_substitution_positions". You can filter for only alleles where this column is empty ("[]")

@GreenSeaBug
Copy link
Author

The plot above includes more than the entire quantification window. I have -w set to the default, 1. So the quantification window is 2 bp. As you can see there are no edits either side of the quantification window centre. There is no -2 bp deletion. So it seems like a complete mismatch with the allele frequency table .txt file.

OK thank you for the tip on substitutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants