Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

[Discussion] How should we define a recurrent mutation and how do our mutation results compare to the literature? #517

Open
jaclyn-taroni opened this issue Feb 5, 2020 · 3 comments
Labels
discussion snv Related to or requires SNV data

Comments

@jaclyn-taroni
Copy link
Member

There are a number of ways to define or identify recurrent mutations. The purpose of this issue is to discuss how to define a "recurrent mutation" throughout the project, with some acknowledgment that the answer might be "it depends."

My goal is to document some of the things that I've been thinking about or reading recently (which is almost certainly not a complete look at all available literature) to get the discussion started.

Here are a few examples of analyses that use or may use the concept of a recurrent mutation:

All this to say - is a recurrent mutation a specific alteration, e.g., H3F3A K28M, or is it any mutation in a gene given some constraints (e.g., drop synonymous mutations)?

I think the interaction-plots and recurrent-VUS are good examples of why the answer may depend on the specific analysis, but it would be good to get some discussion around this going.

Significantly mutated genes

Beyond recurrent mutations, there is also the question of whether or not a gene is "significantly mutated" and what method could be used to make that determination. Here, I'll link to relevant literature and software/code.

From Ma et al. Nature 2018.:

By analysing the enrichment [12, 13] of somatic alterations within each histotype or the pan-cancer cohort (see Methods), we identified 142 significantly mutated driver genes (Fig. 2a, Supplementary Table 2, Extended Data Fig. 3a).

Where the methods state

We discovered 142 candidate driver genes by this approach (Supplementary Table 2). Of these, 133 were significant by GRIN analysis (87 genes common to both GRIN and MutSigCV) and nine were significant only by MutSigCV.

The GRIN R package is available here: https://www.stjuderesearch.org/site/depts/biostats/grin

MutSigCV v1 is available as a GenePattern module: https://www.genepattern.org/modules/docs/MutSigCV

Note I happened upon some R code that implements the MutSig1.0 statistic: https://github.com/lixiangchun/lxctk/blob/ea74021f49393c65993b28f6a11a4c5cccbf66ae/R/mutsig.gene.R#L102

And Maftools seems like it has some functionality to use the output of MutSigCV based on my skimming of Mayakonda et al. Genome Research. 2018.

From Gröbner et al. Nature. 2018:

MuSiC identified 77 significantly mutated genes (SMGs), which were ranked according to their pan-cancer mutation frequency [24] (Fig. 4, Supplementary Tables 9, 10). Most SMGs were mutually exclusively mutated across cancer types, demonstrating specificity of single putative driver genes in childhood cancers as compared to more frequent co-mutation in adult cancers in the TCGA study [7] (Extended Data Fig. 4c–e).

And from the methods:

Significantly mutated genes based on somatic SNVs and indels were identified with the SMG module of the MuSiC tools suite [24] separately from all cancer types and from the pan-cancer cohort, and then merged.

This kind of significance analysis often produces false positive hits (for example, very large genes), despite normalization procedures, and thus several filters were applied to the raw output [30].

MuSiC2 is available on GitHub: https://github.com/ding-lab/MuSiC2

Some of the tests proposed by the MuSiC paper (Dees et al. Genome Research. 2012.), namely the Fisher's combined p-value test and likelihood ratio test, are implemented in the same function I linked to above: https://github.com/lixiangchun/lxctk/blob/ea74021f49393c65993b28f6a11a4c5cccbf66ae/R/mutsig.gene.R, where the method labeled PCT is from Kan et al. Nature. 2010. per the documentation.

Comparison to other literature

The Gröbner et al. Nature. 2018 cohort is enriched for CNS tumors

This study is biased towards central nervous system tumours, and is complemented by an additional study of a non-overlapping paediatric cohort with mainly leukaemias and extracranial solid tumours [9].

A comparison to their results seems like a good thing to do as part of this project. Here's a link from that paper: http://www.pedpancan.com/ which mentions PedcBioPortal when you follow it!

@jaclyn-taroni jaclyn-taroni added snv Related to or requires SNV data discussion labels Feb 5, 2020
@jaclyn-taroni
Copy link
Member Author

I was looking more into what functionality maftools has in their documentation, specifically Detecting cancer driver genes based on positional clustering which states:

oncodrive is a based on algorithm oncodriveCLUST which was originally implemented in Python. Concept is based on the fact that most of the variants in cancer causing genes are enriched at few specific loci (aka hot-spots). This method takes advantage of such positions to identify cancer genes.

Following that to the OncodriveCLUST website, a couple things caught my attention -

  • There's now a new version called OncodriveCLUSTL available via pip (publication, bitbucket)
  • The method does not assume that the baseline mutation probability is homogeneous across all gene positions but it creates a background model using silent mutations. Coding silent mutations are supposed to be under no positive selection and may reflect the baseline clustering of somatic mutations. Given recent evidences of non-random mutation processes along the genome, the assumption of homogenous mutation probabilities is likely an oversimplication introducing bias in the detection of meaningful events.

@jharenza
Copy link
Collaborator

jharenza commented Feb 17, 2020

I came across DriverPower with the PCAWG pan-cancer paper releases: https://www.nature.com/articles/s41467-019-13929-1#code-availability (code), but would entail liftover from hg38 to hg19.

@kgaonkar6
Copy link
Collaborator

kgaonkar6 commented Apr 1, 2020

I was reading through this PCAWG paper https://www.nature.com/articles/s41586-020-1965-x.pdf and found their methods to look for driver mutations, might be useful:

Candidate-driver-mutation identification methods and combination of results

We obtained results (P values) from 13 methods of driver discovery, including ActiveDriverWGS54, CompositeDriver, DriverPower55, dndscv46, ExInAtor56, LARVA57, MutSig tools3, NBR10, ncdDetect58, ncDriver59, OncodriveFML60 and regDriver61. We integrated the results of all these methods using a custom framework based on a previously published method62 for combining P values. Results from individual methods that showed large deviations from the expected uniform null distribution of P values were excluded. This approach was evaluated on real and simulated data.

Code availability

P value combination from multiple driver methods is available from
https://github.com/broadinstitute/getzlab-PCAWG-pvalue_combination/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
discussion snv Related to or requires SNV data
Projects
None yet
Development

No branches or pull requests

3 participants