Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper #7

AnneCarpenter · 2023-12-15T21:11:21Z

Here is the updated list they provided Dec 15 2023.

We are generally interested to find gene pairs where:

there is a strong Cell Painting ORF pos or neg correlation (column E: ORF_similarity_abs)
there is NOT a strong knowledge graph connection (there are 4 KG-only models, columns F-I)

We already found many connections between SLC and OR gene families and will pursue those (#6) but we would like more.

Here is the email thread in Anne's email "Broad/Evotec collaboration on MorphMap & knowledge graphs" https://mail.google.com/mail/u/0/#inbox/FMfcgzGtwDFhnZvdwPWGMLLbdpgNZBSM with excel file

Here are the meeting notes: https://docs.google.com/document/d/1iIwJ1V5ig8KtTD7P0vV-GH16f-AvprqU/edit

MorphMap_gene_gene_scoring_data.xlsx

tjetkaARD · 2024-01-05T17:36:03Z

Adding up to the above data, I am attaching the above Excel file with additional sheet that includes CRISPR similarity as well: edited - see file in link #7 (comment)

There are several columns added, primarily:

CRISPR_similarity: the value of cosine similarity between CP profiles
crispr_status: specifying the status of the gene pair in the dataset (replicable / not replicable / not present)
coexpression in RNA-seq data
correlation from pooled CRISPR KOs (DepMap)
strenth in STRINGdb knowledge graph

All added columns have short explaination above column name.

First look insights (pairs with significant correlation both in ORF and CRISPR and without strong KG evidence):

Edited: see the top pairs in comment : #7 (comment).

AnneCarpenter · 2024-01-05T21:37:01Z

Awesome! For your filter for "strong similarity in both CRISPR and ORF" - did you require that they are the same direction, or did you take absolute value? I see all of these pairs are positive correlations for both (except one pair is neg for both) so I wondered if your filtering would have allowed a strong neg in one and pos in the other to come through?

It would be great to see a heatmap of the correlations among this set of ~15 genes for CRISPR and another heatmap for ORF because it appears there are actually mostly falling into a few blobs rather than 15 very independent relationships.

tjetkaARD · 2024-01-07T04:13:54Z

Awesome! For your filter for "strong similarity in both CRISPR and ORF" - did you require that they are the same direction, or did you take absolute value? I see all of these pairs are positive correlations for both (except one pair is neg for both) so I wondered if your filtering would have allowed a strong neg in one and pos in the other to come through?

In fact, I did allow for any direction of relationship. Specifically, I took:
edited: see methodology in #7 (comment)

and afterwards filtered against knowledge graphs (assumed that the average of [CC,MF,PT,BP] KG scores needs to be below 0.4). It seems that most of the pairs with inconsistent direction between CRISPR and ORF (~30-40% of all pairs in the table ) are filtered by the KG condition.

If we would to take only top 100 pairs with respect to absolute correlations, we would get the following results (only those with inconsistent direction, other are similar):
edited: see results in #7 (comment)

To be precise: I needed to edit the previous post and table and add two rows due to omission of filtering with respect to MF-based KG score. So there is now one pair with inconsistent directions in the previous procedure.

Heatmaps - in progress.

tjetkaARD · 2024-01-08T13:52:44Z

Heatmaps corresponding to the table:

Edited: see heatmaps in #7 (comment)

I see only two repeating clusters:

GPR176, TSC22D1, DPAT1, CHRM4
ISOC2, ECH1, UQCRFS1, BCAT2, SARS2

AnneCarpenter · 2024-01-08T14:13:08Z

Anne will examine these two plots and choose gene pairs to experimentally followup w collaborators.
@tjetkaARD will create exactly these plots but removing the constraint that it's BOTH Orf and CRISPR-correlated. He will start a new thread with those and at the least, those will be in the paper. Anne may also identify for vignettes in those.

tjetkaARD · 2024-01-08T18:45:37Z

I have updated the above plots.

Unfortunately, I do not have the full KG data for all pairs - only for the top ones - as in the original excel files. So, the annotations are scarce. Alternatively I can plot the average KG score instead of letters.

AnneCarpenter · 2024-01-08T18:49:11Z

Ah, ok, I will ask Evotec if they can provide that, although maybe we only need this for our own exploration and it isn't necessary for the paper and what we have is enough for exploration. I will think about this when I dive into looking at these connections. Thanks!

AnneCarpenter · 2024-01-09T12:25:33Z

(I've asked - and BTW it would be even better to show the actual value (average of KG columns) on the heatmap so we have a sense of the strength of the scores.

cyrenaique · 2024-01-09T12:51:27Z

Sorry for asking, but does this Evotec KG is different from stringDB PPI data, because otherwise I have some some code to get values from a list of genes... just in case if needed.

AnneCarpenter · 2024-01-09T13:17:48Z

THanks for offering! But indeed the Evotec KG is very different, it combines many sources of info (including PPI but also others)

AnneCarpenter · 2024-01-10T12:40:23Z

From Andrey Zinovyev of Evotec:

Hi Anne,

Thank you for this information, very exciting to see the progress along several lines!

Here is a folder with some materials that I hope can address most of your requests
https://drive.google.com/drive/folders/1kKqx5B9VJGq47yN03P7z8CikHOWyWMwg?usp=sharing

It contains :

All scores from KG models (orf_scores_merged.zip file) merged with QC filtered ORF scores that Niranj sent to us on Monday. This merging does not contain the CRISPR-derived scores, but we can add them as well as other columns : however, I do not seem to have access to this github Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper #7 . Just in case my nickname is ‘auranic’
Top filtered links with large ORF scores and small KG scores (toplinks_unexplained.xls file), accordingly to unsupervised ‘Biological process’ model. For example, some of the CYP* connections I saw in the heatmaps in your PDF are indeed there.
Powerpoint presentation with some analysis results, including the analysis of SLC/OR pairs or genome-wide (but restricted to the QC filtered genes) scatterplots (ORF sim vs KG score). Also, pay attention to the network figures in the end where we highlight some of the clusters of “unexplained links” (including SLC/OR related but other with, for example, CYP* genes as hubs).

Please note that we decided to change the functional scoring of KG relations from the L2 percentile-based to Pearson correlation, it appears to be more interpretable in the end but does not strongly affect the gene pair selection. Also of note, so far there is no confidential data used in this work, all is based on publicly available knowledge graph analysis.

If any explanations will be needed, we will be happy to connect via email or a call.
Best regards,

Andrei

tjetkaARD · 2024-01-11T12:59:02Z

@AnneCarpenter I wil take care of it and the full plots today - sorry last two days were crazy busy.

@cyrenaique regarding stringDb - in fact, I have already merged it within the excel shared in the comment #7 (comment) (last column). But the Evotec is much more comprehensive/sensitive.

Edit: in progress, trying to figure out incosistencies with previous list togeter with Niranji.

AnneCarpenter · 2024-01-11T13:02:20Z

Yes - I can elaborate on the Evotec knowledge graph: They take existing annotated sources (biological processes, pathways, molecular functions) as ground truth to train the graph (which is based on lots of underlying data sources) to properly predict those connections.

cyrenaique · 2024-01-11T13:24:05Z

Thanks Anne for the precisions.
https://pubmed.ncbi.nlm.nih.gov/36370105/
it seems that stringdb also updated 01/2023 their way of computing/predicting scores, interesting...

AnneCarpenter · 2024-01-18T17:27:30Z

The above connections were filtered as being strong in both ORF and CRISPR. For ORF or CRISPR connections, we move to new issues: #11 for ORFs and soon a new one for CRISPRs when he's ready.

I think we should pursue the two clusters that @tjetkaARD noted above - these have strong (+/-) correlation in both ORF and CRISPR but are not (completely) strongly connected in the KG so I am making new issues for these:
GPR176, TSC22D1, DPAT1, CHRM4: #15
ISOC2, ECH1, UQCRFS1, BCAT2, SARS2 #16

(this issue can be closed as soon as @tjetkaARD makes the new issue for CRISPR-only connections)

tjetkaARD · 2024-01-19T22:49:10Z

@AnneCarpenter

Unfortunately, we need another iteration for this issue. There has been two relevant changes for the final output:

The Knowledge Graph methodology changed
In the Excel file, shared here: Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper #7 (comment), the ORFs are not filtered according to their replicability (only correlation strength)

Fortunately, it does not impact the qualitative conclusions much (see the last section). However, in order to clean up everything and not allow any confusion - I will edit all above comments linking to the confirmed and most recent results below.

Methodology

Replicability:

ORFs: only q-value replicable genes are included (based on https://github.com/jump-cellpainting/morphmap/blob/24839193460b9107e09bbf0480e50ee9faef4698/05.retrieve-orf-annotations/output/replicate-retrieval-mAP-transformed-inf-eff-filtered.csv.gz)
CRISPRs: I will present results separately for both q-value and p-value replicable genes (due to very small intersection between ORFs and CRISPRs)

Knowledge Graph filter: based on file shared in: Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper #7 (comment), gene pairs are included if:

average of (gene_mf, gene_bp, gene_pathway) is below 0.5 AND max of (gene_mf, gene_bp, gene_pathway) is below 0.8

Choice of top pairs:

Top 8 pairs according to sum of absolute orf/crispr similarities AND
Top 8 pairs according to sum of scaled absolute orf/crispr similarities AND
Top 4 pairs from each quadrant (to allow associations of different signs) according to sum of absolute orf/crispr similarities

Summary

In terms of intersected ORF/CRISPR replicable genes:

Q-value replicable ORFs vs. Q-value replicable CRISPR: 284 common genes
Q-value replicable ORFs vs. P-value replicable CRISPR: 673 common genes

Plots of CRISPR vs. ORF similarities & Distribution of KG mean score (Q-value replicable ORFs vs. Q-value replicable CRISPR)

The above procedure gives:

p-val replicable CRISPRs: 23 gene pairs; 30 unique genes
p-val replicable CRISPRs: 28 gene pairs; 37 unique genes

Plots of CRISPR vs. ORF similarities & with annotated top unknown pairs with strogest signal between ORF&CRISPR

Source files:

Merged orf, crispr, kg file (q-value replicable ORF and q-value replicable CRISPR):
orf_crispr_pairs_q_replicable.csv

Heatmaps

The values in the square indicate the average KG score

ORFs similarities

CRISPRs similarities

Conclusions

Despite updated methodology and updated computations, similar clusters are identified, possibly with slightly different specific genes highlighted. Specifically:

Cluster "GPR176, TSC22D1, DPAT1, CHRM4" - TSC22D1, DPAGT1 still in top results; CHRM4&GPR176 are included, if (p-value replicable CRISPRS are considered);
Cluster "ISOC2, ECH1, UQCRFS1, BCAT2, SARS2": SARS2, ECH1 still in top results; UQCRFS1 are included, if (p-value replicable CRISPRS are considered); ISOC2 is not replicable in ORFs; BCAT2 vs. other interactions have significantly increased in KG assessment.

AnneCarpenter · 2024-01-20T14:33:05Z

Thanks for all this analysis! I think it will help to discuss the methodology and rationale when we are together.

I want to summarize that I think all 3 of these are interesting:

clusters/anti-correlations in ORF data only
clusters/anti-correlations in CRISPR data only
clusters/anti-correlations in both

In each case, we don't want to pay attention to genes that do not 'have a phenotype' (ie are not replicable).

In each case, we will want some examples that are well-known (high KG) and some that are novel (low KG), but emphasizing the latter for now because they are harder to find and will take time to followup with biology experiments.

So you think we should pause work on #15 #16 #17 until after we meet?

jessica-ewald · 2024-01-21T20:57:24Z

Following this! I have started compiling information for the previously defined gene clusters, and from scanning the updated info it looks like some of it will still be useful, but I'll wait for confirmation before continuing.

tjetkaARD · 2024-01-22T12:45:14Z

@AnneCarpenter - accounting for the comments I have added to each specific issue #15 #16 #17

I think it is safe to proceed.

jessica-ewald · 2024-01-22T17:10:19Z

I'm afraid that I've gotten quite confused! I'll try and summarize what I do and don't understand.

There are often two paired heat maps with the same genes, one showing pairwise correlation in the ORF data and one showing pairwise correlation in the CRISPR data. The value in each cell corresponds to the strength of the KG connections between those two genes, with a "?" if that connection is not present in the KG. I assume that the color of each heat map cell corresponds to the correlation strength and direction (+/-) based on either the ORF or CRISPR morphological data. I'm unsure of:

Which color corresponds to positive correlations?
How were the list of genes in the heat maps chosen - were they the genes with the greatest disparity between the magnitude of morphological and KG similarity in the ORF data, the CRISPR data, or some combination of the two?

I'm unclear where exactly the three lists of genes (Cluster GPR176, TSC22D1, DPAGT1, CHRM4: exploration for MorphMap paper (ORF+CRISPR) #15 ; Cluster ECH1, UQCRFS1, SARS2: exploration for MorphMap paper (ORF+CRISPR) #16 ; POLRID, SPATA25 connected to many genes: exploration for MorphMap paper (ORF, but want to check CRISPR) #17 are coming from. There are six different heat maps in the relevant issues (here, here, here, here, here, and here). I'm unsure of:

whether all the heatmaps are still valid, or if some should be deleted because they are based on the previous KG / data that was not filtered for replicability
which of the five heatmap posts each cluster comes from, and whether I should be looking at the ORF or CRISPR heatmaps (or both)
sometimes I can find a cluster of genes in a heatmap that seems to correspond to one of the lists, but then there are other genes in the same cluster that are not included in the lists. Were the clusters filtered to remove connections that were explained by the KG? For example, the only heatmap that I can find with both POLRID and SPATA25 is here, but there are three other genes in that cluster.

Just want to clarify all of this before diving into databases/literature. Thanks in in advance!

AnneCarpenter · 2024-01-24T16:56:52Z

blue is positive, red is negative correlation. See #20 for our basic protocol to make the heatmaps.

I believe #15 #16 #17 have all come from this issue (where a signal was seen in ORF+CRISPR data, both) but it is possible that in some cases after revising our analysis one or the other result 'fell apart' (this may also happen with the chromosome arm correction currently happening for CRISPR data).
This will all become verifiable when we have the power to make our own heatmaps and filter genes into them as we like, Tomasz is working on that. It will also allow expanding which genes are in a cluster (to include knowledge graph-positive gene pairs, which should provide helpful context.)
Tagging @jessica-ewald and @Zitong-Chen-16

AnneCarpenter · 2024-01-24T16:58:34Z

Probably the actual task is finished in this issue (find clusters interesting in both ORF + CRISPR datasets) because we expect the clusters we already found to remain.

But leaving it open and assigning @zahrahanifehlou so that when the chromosome arm corrections are done, Tomasz can make the 'final' versions of these heatmaps.

zahrahanifehlou · 2024-01-26T17:37:25Z

the chromosome arm corrections are done(notebook). Also I calculated the replicated genes and their similarities in the original CRISPR profile and corrected profile. You can find them on this link

AnneCarpenter · 2024-01-26T17:53:07Z

(though please see my note on https://github.com/jump-cellpainting/morphmap/issues/162 before proceeding to use them)

niranjchandrasekaran · 2024-11-20T01:51:39Z

Separate issues were created for the new connections in this issue and those were included in the morphmap paper

AnneCarpenter added the Evotec Vignettes that stem from Evotec's findings label Dec 15, 2023

AnneCarpenter self-assigned this Dec 15, 2023

AnneCarpenter changed the title ~~Find gene connections to pursue: exploration for MorphMap paper~~ Find Evotec gene connections to pursue: exploration for MorphMap paper Dec 15, 2023

AnneCarpenter mentioned this issue Jan 8, 2024

YAP1 connections: exploration for MorphMap paper (ORF but need to check CRISPR) #10

Closed

AnneCarpenter mentioned this issue Jan 9, 2024

ZBTB16 & SLC39A1: exploration for MorphMap paper #12

Closed

This was referenced Jan 18, 2024

Cluster GPR176, TSC22D1, DPAGT1, CHRM4: exploration for MorphMap paper (ORF+CRISPR) #15

Closed

Cluster ECH1, UQCRFS1, SARS2: exploration for MorphMap paper (ORF+CRISPR) #16

Closed

AnneCarpenter removed the Evotec Vignettes that stem from Evotec's findings label Jan 18, 2024

AnneCarpenter removed their assignment Jan 18, 2024

tjetkaARD mentioned this issue Jan 19, 2024

HOOK2 opposite effect than PAFAH1B1, NDE1, NDEL1: exploration for MorphMap paper (ORF) #5

Closed

AnneCarpenter changed the title ~~Find Evotec gene connections to pursue: exploration for MorphMap paper~~ Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper Jan 24, 2024

AnneCarpenter assigned tjetkaARD Jan 24, 2024

AnneCarpenter assigned zahrahanifehlou and unassigned tjetkaARD Jan 24, 2024

afermg added orf Uses ORF data crispr Uses crispr data labels Feb 1, 2024

shntnu added the internal Internal discussions (but publicly accessible) label Oct 17, 2024

niranjchandrasekaran closed this as completed Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper #7

Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper #7

AnneCarpenter commented Dec 15, 2023 •

edited

Loading

tjetkaARD commented Jan 5, 2024 •

edited

Loading

AnneCarpenter commented Jan 5, 2024

tjetkaARD commented Jan 7, 2024 •

edited

Loading

tjetkaARD commented Jan 8, 2024 •

edited

Loading

AnneCarpenter commented Jan 8, 2024 •

edited

Loading

tjetkaARD commented Jan 8, 2024

AnneCarpenter commented Jan 8, 2024

AnneCarpenter commented Jan 9, 2024

cyrenaique commented Jan 9, 2024

AnneCarpenter commented Jan 9, 2024

AnneCarpenter commented Jan 10, 2024

tjetkaARD commented Jan 11, 2024 •

edited

Loading

AnneCarpenter commented Jan 11, 2024

cyrenaique commented Jan 11, 2024

AnneCarpenter commented Jan 18, 2024 •

edited

Loading

tjetkaARD commented Jan 19, 2024 •

edited

Loading

AnneCarpenter commented Jan 20, 2024

jessica-ewald commented Jan 21, 2024

tjetkaARD commented Jan 22, 2024

jessica-ewald commented Jan 22, 2024 •

edited

Loading

AnneCarpenter commented Jan 24, 2024

AnneCarpenter commented Jan 24, 2024

zahrahanifehlou commented Jan 26, 2024

AnneCarpenter commented Jan 26, 2024

niranjchandrasekaran commented Nov 20, 2024

Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper #7

Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper #7

Comments

AnneCarpenter commented Dec 15, 2023 • edited Loading

tjetkaARD commented Jan 5, 2024 • edited Loading

AnneCarpenter commented Jan 5, 2024

tjetkaARD commented Jan 7, 2024 • edited Loading

tjetkaARD commented Jan 8, 2024 • edited Loading

AnneCarpenter commented Jan 8, 2024 • edited Loading

tjetkaARD commented Jan 8, 2024

AnneCarpenter commented Jan 8, 2024

AnneCarpenter commented Jan 9, 2024

cyrenaique commented Jan 9, 2024

AnneCarpenter commented Jan 9, 2024

AnneCarpenter commented Jan 10, 2024

tjetkaARD commented Jan 11, 2024 • edited Loading

AnneCarpenter commented Jan 11, 2024

cyrenaique commented Jan 11, 2024

AnneCarpenter commented Jan 18, 2024 • edited Loading

tjetkaARD commented Jan 19, 2024 • edited Loading

Methodology

Summary

Source files:

Heatmaps

Conclusions

AnneCarpenter commented Jan 20, 2024

jessica-ewald commented Jan 21, 2024

tjetkaARD commented Jan 22, 2024

jessica-ewald commented Jan 22, 2024 • edited Loading

AnneCarpenter commented Jan 24, 2024

AnneCarpenter commented Jan 24, 2024

zahrahanifehlou commented Jan 26, 2024

AnneCarpenter commented Jan 26, 2024

niranjchandrasekaran commented Nov 20, 2024

AnneCarpenter commented Dec 15, 2023 •

edited

Loading

tjetkaARD commented Jan 5, 2024 •

edited

Loading

tjetkaARD commented Jan 7, 2024 •

edited

Loading

tjetkaARD commented Jan 8, 2024 •

edited

Loading

AnneCarpenter commented Jan 8, 2024 •

edited

Loading

tjetkaARD commented Jan 11, 2024 •

edited

Loading

AnneCarpenter commented Jan 18, 2024 •

edited

Loading

tjetkaARD commented Jan 19, 2024 •

edited

Loading

jessica-ewald commented Jan 22, 2024 •

edited

Loading