Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper #7

Closed
AnneCarpenter opened this issue Dec 15, 2023 · 25 comments
Assignees
Labels
crispr Uses crispr data internal Internal discussions (but publicly accessible) orf Uses ORF data

Comments

@AnneCarpenter
Copy link
Contributor

AnneCarpenter commented Dec 15, 2023

Here is the updated list they provided Dec 15 2023.

We are generally interested to find gene pairs where:

  • there is a strong Cell Painting ORF pos or neg correlation (column E: ORF_similarity_abs)
  • there is NOT a strong knowledge graph connection (there are 4 KG-only models, columns F-I)

We already found many connections between SLC and OR gene families and will pursue those (#6) but we would like more.

Here is the email thread in Anne's email "Broad/Evotec collaboration on MorphMap & knowledge graphs" https://mail.google.com/mail/u/0/#inbox/FMfcgzGtwDFhnZvdwPWGMLLbdpgNZBSM with excel file

Here are the meeting notes: https://docs.google.com/document/d/1iIwJ1V5ig8KtTD7P0vV-GH16f-AvprqU/edit

MorphMap_gene_gene_scoring_data.xlsx

@AnneCarpenter AnneCarpenter added the Evotec Vignettes that stem from Evotec's findings label Dec 15, 2023
@AnneCarpenter AnneCarpenter self-assigned this Dec 15, 2023
@AnneCarpenter AnneCarpenter changed the title Find gene connections to pursue: exploration for MorphMap paper Find Evotec gene connections to pursue: exploration for MorphMap paper Dec 15, 2023
@tjetkaARD
Copy link
Collaborator

tjetkaARD commented Jan 5, 2024

Adding up to the above data, I am attaching the above Excel file with additional sheet that includes CRISPR similarity as well: edited - see file in link #7 (comment)

There are several columns added, primarily:

  • CRISPR_similarity: the value of cosine similarity between CP profiles
  • crispr_status: specifying the status of the gene pair in the dataset (replicable / not replicable / not present)
  • coexpression in RNA-seq data
  • correlation from pooled CRISPR KOs (DepMap)
  • strenth in STRINGdb knowledge graph

All added columns have short explaination above column name.

First look insights (pairs with significant correlation both in ORF and CRISPR and without strong KG evidence):

Edited: see the top pairs in comment : #7 (comment).

@AnneCarpenter
Copy link
Contributor Author

Awesome! For your filter for "strong similarity in both CRISPR and ORF" - did you require that they are the same direction, or did you take absolute value? I see all of these pairs are positive correlations for both (except one pair is neg for both) so I wondered if your filtering would have allowed a strong neg in one and pos in the other to come through?

It would be great to see a heatmap of the correlations among this set of ~15 genes for CRISPR and another heatmap for ORF because it appears there are actually mostly falling into a few blobs rather than 15 very independent relationships.

@tjetkaARD
Copy link
Collaborator

tjetkaARD commented Jan 7, 2024

Awesome! For your filter for "strong similarity in both CRISPR and ORF" - did you require that they are the same direction, or did you take absolute value? I see all of these pairs are positive correlations for both (except one pair is neg for both) so I wondered if your filtering would have allowed a strong neg in one and pos in the other to come through?

In fact, I did allow for any direction of relationship. Specifically, I took:
edited: see methodology in #7 (comment)

and afterwards filtered against knowledge graphs (assumed that the average of [CC,MF,PT,BP] KG scores needs to be below 0.4). It seems that most of the pairs with inconsistent direction between CRISPR and ORF (~30-40% of all pairs in the table ) are filtered by the KG condition.

If we would to take only top 100 pairs with respect to absolute correlations, we would get the following results (only those with inconsistent direction, other are similar):
edited: see results in #7 (comment)

To be precise: I needed to edit the previous post and table and add two rows due to omission of filtering with respect to MF-based KG score. So there is now one pair with inconsistent directions in the previous procedure.

Heatmaps - in progress.

@tjetkaARD
Copy link
Collaborator

tjetkaARD commented Jan 8, 2024

Heatmaps corresponding to the table:

Edited: see heatmaps in #7 (comment)

I see only two repeating clusters:

  1. GPR176, TSC22D1, DPAT1, CHRM4
  2. ISOC2, ECH1, UQCRFS1, BCAT2, SARS2

@AnneCarpenter
Copy link
Contributor Author

AnneCarpenter commented Jan 8, 2024

Anne will examine these two plots and choose gene pairs to experimentally followup w collaborators.
@tjetkaARD will create exactly these plots but removing the constraint that it's BOTH Orf and CRISPR-correlated. He will start a new thread with those and at the least, those will be in the paper. Anne may also identify for vignettes in those.

@tjetkaARD
Copy link
Collaborator

I have updated the above plots.

Unfortunately, I do not have the full KG data for all pairs - only for the top ones - as in the original excel files. So, the annotations are scarce. Alternatively I can plot the average KG score instead of letters.

@AnneCarpenter
Copy link
Contributor Author

Ah, ok, I will ask Evotec if they can provide that, although maybe we only need this for our own exploration and it isn't necessary for the paper and what we have is enough for exploration. I will think about this when I dive into looking at these connections. Thanks!

@AnneCarpenter
Copy link
Contributor Author

(I've asked - and BTW it would be even better to show the actual value (average of KG columns) on the heatmap so we have a sense of the strength of the scores.

@cyrenaique
Copy link
Collaborator

Sorry for asking, but does this Evotec KG is different from stringDB PPI data, because otherwise I have some some code to get values from a list of genes... just in case if needed.

@AnneCarpenter
Copy link
Contributor Author

THanks for offering! But indeed the Evotec KG is very different, it combines many sources of info (including PPI but also others)

@AnneCarpenter
Copy link
Contributor Author

From Andrey Zinovyev of Evotec:

Hi Anne,

Thank you for this information, very exciting to see the progress along several lines!

Here is a folder with some materials that I hope can address most of your requests
https://drive.google.com/drive/folders/1kKqx5B9VJGq47yN03P7z8CikHOWyWMwg?usp=sharing

It contains :

  • All scores from KG models (orf_scores_merged.zip file) merged with QC filtered ORF scores that Niranj sent to us on Monday. This merging does not contain the CRISPR-derived scores, but we can add them as well as other columns : however, I do not seem to have access to this github Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper #7 . Just in case my nickname is ‘auranic’
  • Top filtered links with large ORF scores and small KG scores (toplinks_unexplained.xls file), accordingly to unsupervised ‘Biological process’ model. For example, some of the CYP* connections I saw in the heatmaps in your PDF are indeed there.
  • Powerpoint presentation with some analysis results, including the analysis of SLC/OR pairs or genome-wide (but restricted to the QC filtered genes) scatterplots (ORF sim vs KG score). Also, pay attention to the network figures in the end where we highlight some of the clusters of “unexplained links” (including SLC/OR related but other with, for example, CYP* genes as hubs).

Please note that we decided to change the functional scoring of KG relations from the L2 percentile-based to Pearson correlation, it appears to be more interpretable in the end but does not strongly affect the gene pair selection. Also of note, so far there is no confidential data used in this work, all is based on publicly available knowledge graph analysis.

If any explanations will be needed, we will be happy to connect via email or a call.
Best regards,

Andrei

@tjetkaARD
Copy link
Collaborator

tjetkaARD commented Jan 11, 2024

@AnneCarpenter I wil take care of it and the full plots today - sorry last two days were crazy busy.

@cyrenaique regarding stringDb - in fact, I have already merged it within the excel shared in the comment #7 (comment) (last column). But the Evotec is much more comprehensive/sensitive.

Edit: in progress, trying to figure out incosistencies with previous list togeter with Niranji.

@AnneCarpenter
Copy link
Contributor Author

Yes - I can elaborate on the Evotec knowledge graph: They take existing annotated sources (biological processes, pathways, molecular functions) as ground truth to train the graph (which is based on lots of underlying data sources) to properly predict those connections.

@cyrenaique
Copy link
Collaborator

Thanks Anne for the precisions.
https://pubmed.ncbi.nlm.nih.gov/36370105/
it seems that stringdb also updated 01/2023 their way of computing/predicting scores, interesting...

@AnneCarpenter
Copy link
Contributor Author

AnneCarpenter commented Jan 18, 2024

The above connections were filtered as being strong in both ORF and CRISPR. For ORF or CRISPR connections, we move to new issues: #11 for ORFs and soon a new one for CRISPRs when he's ready.

I think we should pursue the two clusters that @tjetkaARD noted above - these have strong (+/-) correlation in both ORF and CRISPR but are not (completely) strongly connected in the KG so I am making new issues for these:
GPR176, TSC22D1, DPAT1, CHRM4: #15
ISOC2, ECH1, UQCRFS1, BCAT2, SARS2 #16

(this issue can be closed as soon as @tjetkaARD makes the new issue for CRISPR-only connections)

@AnneCarpenter AnneCarpenter removed the Evotec Vignettes that stem from Evotec's findings label Jan 18, 2024
@AnneCarpenter AnneCarpenter removed their assignment Jan 18, 2024
@tjetkaARD
Copy link
Collaborator

tjetkaARD commented Jan 19, 2024

@AnneCarpenter

Unfortunately, we need another iteration for this issue. There has been two relevant changes for the final output:

  1. The Knowledge Graph methodology changed
  2. In the Excel file, shared here: Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper #7 (comment), the ORFs are not filtered according to their replicability (only correlation strength)

Fortunately, it does not impact the qualitative conclusions much (see the last section). However, in order to clean up everything and not allow any confusion - I will edit all above comments linking to the confirmed and most recent results below.

Methodology

  1. Replicability:
  1. Knowledge Graph filter: based on file shared in: Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper #7 (comment), gene pairs are included if:
  • average of (gene_mf, gene_bp, gene_pathway) is below 0.5 AND max of (gene_mf, gene_bp, gene_pathway) is below 0.8
  1. Choice of top pairs:
  • Top 8 pairs according to sum of absolute orf/crispr similarities AND
  • Top 8 pairs according to sum of scaled absolute orf/crispr similarities AND
  • Top 4 pairs from each quadrant (to allow associations of different signs) according to sum of absolute orf/crispr similarities

Summary

In terms of intersected ORF/CRISPR replicable genes:

  • Q-value replicable ORFs vs. Q-value replicable CRISPR: 284 common genes
  • Q-value replicable ORFs vs. P-value replicable CRISPR: 673 common genes

Plots of CRISPR vs. ORF similarities & Distribution of KG mean score (Q-value replicable ORFs vs. Q-value replicable CRISPR)
image

The above procedure gives:

  • p-val replicable CRISPRs: 23 gene pairs; 30 unique genes
  • p-val replicable CRISPRs: 28 gene pairs; 37 unique genes

Plots of CRISPR vs. ORF similarities & with annotated top unknown pairs with strogest signal between ORF&CRISPR
scatter_similarity_orf_crispr_q_replicable_annot

Source files:

Heatmaps

The values in the square indicate the average KG score

ORFs similarities
orf_heatmap_cosine_Unknown CRISPR ORF Top 4_labels

CRISPRs similarities
crispr_heatmap_cosine_Unknown CRISPR ORF Top 4_labels

Conclusions

  1. Despite updated methodology and updated computations, similar clusters are identified, possibly with slightly different specific genes highlighted. Specifically:
  • Cluster "GPR176, TSC22D1, DPAT1, CHRM4" - TSC22D1, DPAGT1 still in top results; CHRM4&GPR176 are included, if (p-value replicable CRISPRS are considered);
  • Cluster "ISOC2, ECH1, UQCRFS1, BCAT2, SARS2": SARS2, ECH1 still in top results; UQCRFS1 are included, if (p-value replicable CRISPRS are considered); ISOC2 is not replicable in ORFs; BCAT2 vs. other interactions have significantly increased in KG assessment.

@AnneCarpenter
Copy link
Contributor Author

Thanks for all this analysis! I think it will help to discuss the methodology and rationale when we are together.

I want to summarize that I think all 3 of these are interesting:

  1. clusters/anti-correlations in ORF data only
  2. clusters/anti-correlations in CRISPR data only
  3. clusters/anti-correlations in both

In each case, we don't want to pay attention to genes that do not 'have a phenotype' (ie are not replicable).

In each case, we will want some examples that are well-known (high KG) and some that are novel (low KG), but emphasizing the latter for now because they are harder to find and will take time to followup with biology experiments.

So you think we should pause work on #15 #16 #17 until after we meet?

@jessica-ewald
Copy link
Contributor

Following this! I have started compiling information for the previously defined gene clusters, and from scanning the updated info it looks like some of it will still be useful, but I'll wait for confirmation before continuing.

@tjetkaARD
Copy link
Collaborator

@AnneCarpenter - accounting for the comments I have added to each specific issue #15 #16 #17

I think it is safe to proceed.

@jessica-ewald
Copy link
Contributor

jessica-ewald commented Jan 22, 2024

I'm afraid that I've gotten quite confused! I'll try and summarize what I do and don't understand.

  1. There are often two paired heat maps with the same genes, one showing pairwise correlation in the ORF data and one showing pairwise correlation in the CRISPR data. The value in each cell corresponds to the strength of the KG connections between those two genes, with a "?" if that connection is not present in the KG. I assume that the color of each heat map cell corresponds to the correlation strength and direction (+/-) based on either the ORF or CRISPR morphological data. I'm unsure of:
  • Which color corresponds to positive correlations?
  • How were the list of genes in the heat maps chosen - were they the genes with the greatest disparity between the magnitude of morphological and KG similarity in the ORF data, the CRISPR data, or some combination of the two?
  1. I'm unclear where exactly the three lists of genes (Cluster GPR176, TSC22D1, DPAGT1, CHRM4: exploration for MorphMap paper (ORF+CRISPR) #15 ; Cluster ECH1, UQCRFS1, SARS2: exploration for MorphMap paper (ORF+CRISPR) #16 ; POLRID, SPATA25 connected to many genes: exploration for MorphMap paper (ORF, but want to check CRISPR) #17 are coming from. There are six different heat maps in the relevant issues (here, here, here, here, here, and here). I'm unsure of:
  • whether all the heatmaps are still valid, or if some should be deleted because they are based on the previous KG / data that was not filtered for replicability
  • which of the five heatmap posts each cluster comes from, and whether I should be looking at the ORF or CRISPR heatmaps (or both)
  • sometimes I can find a cluster of genes in a heatmap that seems to correspond to one of the lists, but then there are other genes in the same cluster that are not included in the lists. Were the clusters filtered to remove connections that were explained by the KG? For example, the only heatmap that I can find with both POLRID and SPATA25 is here, but there are three other genes in that cluster.

Just want to clarify all of this before diving into databases/literature. Thanks in in advance!

@AnneCarpenter AnneCarpenter changed the title Find Evotec gene connections to pursue: exploration for MorphMap paper Find gene connections [ORF+CRISPR both] to pursue: exploration for MorphMap paper Jan 24, 2024
@AnneCarpenter
Copy link
Contributor Author

blue is positive, red is negative correlation. See #20 for our basic protocol to make the heatmaps.

I believe #15 #16 #17 have all come from this issue (where a signal was seen in ORF+CRISPR data, both) but it is possible that in some cases after revising our analysis one or the other result 'fell apart' (this may also happen with the chromosome arm correction currently happening for CRISPR data).
This will all become verifiable when we have the power to make our own heatmaps and filter genes into them as we like, Tomasz is working on that. It will also allow expanding which genes are in a cluster (to include knowledge graph-positive gene pairs, which should provide helpful context.)
Tagging @jessica-ewald and @Zitong-Chen-16

@AnneCarpenter
Copy link
Contributor Author

Probably the actual task is finished in this issue (find clusters interesting in both ORF + CRISPR datasets) because we expect the clusters we already found to remain.

But leaving it open and assigning @zahrahanifehlou so that when the chromosome arm corrections are done, Tomasz can make the 'final' versions of these heatmaps.

@zahrahanifehlou
Copy link
Collaborator

the chromosome arm corrections are done(notebook). Also I calculated the replicated genes and their similarities in the original CRISPR profile and corrected profile. You can find them on this link

@AnneCarpenter
Copy link
Contributor Author

(though please see my note on https://github.com/jump-cellpainting/morphmap/issues/162 before proceeding to use them)

@afermg afermg added orf Uses ORF data crispr Uses crispr data labels Feb 1, 2024
@shntnu shntnu added the internal Internal discussions (but publicly accessible) label Oct 17, 2024
@niranjchandrasekaran
Copy link
Member

Separate issues were created for the new connections in this issue and those were included in the morphmap paper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crispr Uses crispr data internal Internal discussions (but publicly accessible) orf Uses ORF data
Projects
None yet
Development

No branches or pull requests

8 participants