-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper #7
Comments
Adding up to the above data, I am attaching the above Excel file with additional sheet that includes CRISPR similarity as well: edited - see file in link #7 (comment) There are several columns added, primarily:
All added columns have short explaination above column name. First look insights (pairs with significant correlation both in ORF and CRISPR and without strong KG evidence): Edited: see the top pairs in comment : #7 (comment). |
Awesome! For your filter for "strong similarity in both CRISPR and ORF" - did you require that they are the same direction, or did you take absolute value? I see all of these pairs are positive correlations for both (except one pair is neg for both) so I wondered if your filtering would have allowed a strong neg in one and pos in the other to come through? It would be great to see a heatmap of the correlations among this set of ~15 genes for CRISPR and another heatmap for ORF because it appears there are actually mostly falling into a few blobs rather than 15 very independent relationships. |
In fact, I did allow for any direction of relationship. Specifically, I took: and afterwards filtered against knowledge graphs (assumed that the average of [CC,MF,PT,BP] KG scores needs to be below 0.4). It seems that most of the pairs with inconsistent direction between CRISPR and ORF (~30-40% of all pairs in the table ) are filtered by the KG condition. If we would to take only top 100 pairs with respect to absolute correlations, we would get the following results (only those with inconsistent direction, other are similar): To be precise: I needed to edit the previous post and table and add two rows due to omission of filtering with respect to MF-based KG score. So there is now one pair with inconsistent directions in the previous procedure. Heatmaps - in progress. |
Heatmaps corresponding to the table: Edited: see heatmaps in #7 (comment) I see only two repeating clusters:
|
Anne will examine these two plots and choose gene pairs to experimentally followup w collaborators. |
I have updated the above plots. Unfortunately, I do not have the full KG data for all pairs - only for the top ones - as in the original excel files. So, the annotations are scarce. Alternatively I can plot the average KG score instead of letters. |
Ah, ok, I will ask Evotec if they can provide that, although maybe we only need this for our own exploration and it isn't necessary for the paper and what we have is enough for exploration. I will think about this when I dive into looking at these connections. Thanks! |
(I've asked - and BTW it would be even better to show the actual value (average of KG columns) on the heatmap so we have a sense of the strength of the scores. |
Sorry for asking, but does this Evotec KG is different from stringDB PPI data, because otherwise I have some some code to get values from a list of genes... just in case if needed. |
THanks for offering! But indeed the Evotec KG is very different, it combines many sources of info (including PPI but also others) |
From Andrey Zinovyev of Evotec: Hi Anne, Thank you for this information, very exciting to see the progress along several lines! Here is a folder with some materials that I hope can address most of your requests It contains :
Please note that we decided to change the functional scoring of KG relations from the L2 percentile-based to Pearson correlation, it appears to be more interpretable in the end but does not strongly affect the gene pair selection. Also of note, so far there is no confidential data used in this work, all is based on publicly available knowledge graph analysis. If any explanations will be needed, we will be happy to connect via email or a call. Andrei |
@AnneCarpenter I wil take care of it and the full plots today - sorry last two days were crazy busy. @cyrenaique regarding stringDb - in fact, I have already merged it within the excel shared in the comment #7 (comment) (last column). But the Evotec is much more comprehensive/sensitive. Edit: in progress, trying to figure out incosistencies with previous list togeter with Niranji. |
Yes - I can elaborate on the Evotec knowledge graph: They take existing annotated sources (biological processes, pathways, molecular functions) as ground truth to train the graph (which is based on lots of underlying data sources) to properly predict those connections. |
Thanks Anne for the precisions. |
The above connections were filtered as being strong in both ORF and CRISPR. For ORF or CRISPR connections, we move to new issues: #11 for ORFs and soon a new one for CRISPRs when he's ready. I think we should pursue the two clusters that @tjetkaARD noted above - these have strong (+/-) correlation in both ORF and CRISPR but are not (completely) strongly connected in the KG so I am making new issues for these: (this issue can be closed as soon as @tjetkaARD makes the new issue for CRISPR-only connections) |
Unfortunately, we need another iteration for this issue. There has been two relevant changes for the final output:
Fortunately, it does not impact the qualitative conclusions much (see the last section). However, in order to clean up everything and not allow any confusion - I will edit all above comments linking to the confirmed and most recent results below. Methodology
SummaryIn terms of intersected ORF/CRISPR replicable genes:
Plots of CRISPR vs. ORF similarities & Distribution of KG mean score (Q-value replicable ORFs vs. Q-value replicable CRISPR) The above procedure gives:
Plots of CRISPR vs. ORF similarities & with annotated top unknown pairs with strogest signal between ORF&CRISPR Source files:
HeatmapsThe values in the square indicate the average KG score Conclusions
|
Thanks for all this analysis! I think it will help to discuss the methodology and rationale when we are together. I want to summarize that I think all 3 of these are interesting:
In each case, we don't want to pay attention to genes that do not 'have a phenotype' (ie are not replicable). In each case, we will want some examples that are well-known (high KG) and some that are novel (low KG), but emphasizing the latter for now because they are harder to find and will take time to followup with biology experiments. So you think we should pause work on #15 #16 #17 until after we meet? |
Following this! I have started compiling information for the previously defined gene clusters, and from scanning the updated info it looks like some of it will still be useful, but I'll wait for confirmation before continuing. |
@AnneCarpenter - accounting for the comments I have added to each specific issue #15 #16 #17 I think it is safe to proceed. |
I'm afraid that I've gotten quite confused! I'll try and summarize what I do and don't understand.
Just want to clarify all of this before diving into databases/literature. Thanks in in advance! |
blue is positive, red is negative correlation. See #20 for our basic protocol to make the heatmaps. I believe #15 #16 #17 have all come from this issue (where a signal was seen in ORF+CRISPR data, both) but it is possible that in some cases after revising our analysis one or the other result 'fell apart' (this may also happen with the chromosome arm correction currently happening for CRISPR data). |
Probably the actual task is finished in this issue (find clusters interesting in both ORF + CRISPR datasets) because we expect the clusters we already found to remain. But leaving it open and assigning @zahrahanifehlou so that when the chromosome arm corrections are done, Tomasz can make the 'final' versions of these heatmaps. |
(though please see my note on https://github.com/jump-cellpainting/morphmap/issues/162 before proceeding to use them) |
Separate issues were created for the new connections in this issue and those were included in the morphmap paper |
Here is the updated list they provided Dec 15 2023.
We are generally interested to find gene pairs where:
We already found many connections between SLC and OR gene families and will pursue those (#6) but we would like more.
Here is the email thread in Anne's email "Broad/Evotec collaboration on MorphMap & knowledge graphs" https://mail.google.com/mail/u/0/#inbox/FMfcgzGtwDFhnZvdwPWGMLLbdpgNZBSM with excel file
Here are the meeting notes: https://docs.google.com/document/d/1iIwJ1V5ig8KtTD7P0vV-GH16f-AvprqU/edit
MorphMap_gene_gene_scoring_data.xlsx
The text was updated successfully, but these errors were encountered: