Variable gene copy number in cancer-related pathways is associated with cancer prevalence across mammals

Repository information

This repository contains the scripts for analysis carried out in the paper "Variable gene copy number in cancer-related pathways is associated with cancer prevalence across mammals".

Workflow

Data Collection
Identification of orthogroups
PGLS (individual genes)
Aggregate PGLS (gene sets)
Randomisation test
GSEA
ORA

Conda env: environment.yml

1. Data Collection

This study uses 105 species, of these 105 species, 54 have available proteomes. These proteomes are downloaded from NCBI using ncbi-genome-download, in script download_proteome.sh.

For the remaining species, proteomes were created from annotations using gfftk (v23.11.2), in script gfftk_proteome.sh.

Keep only the longest transcript for each protein, using the Orthofinder script primary_transcript.py.

BUSCO is run to assess the quality of proteomes, in script busco_proteome.sh.

2. Identification of Orthogroups

Gene orthogroups were inferred using Orthofinder (v2.5.5), to estimate the gene copy number for all protein coding genes, in script orthofinder.sh.

Species with low BUSCO scores were removed from analysis in orthofinder_remove_species.sh.

Identification of which genes corresponded to each orthogroup was done using house mouse (Mus musculus) gene annotations, in script mouse_orthogroup_genes.sh.

3. PGLS (individual genes)

PGLS to test for association between gene copy number and life history traits. Scripts in pgls_indv folder.

4. Aggregate PGLS (gene sets)

PGLS to test for association between the aggregate copy number of genes within gene sets and life history traits. Scripts in pgls_sets folder.

5. Randomisation test

Randomisation test: given a set of interest, produce 1000 replicates of the set by randomly selecting genes with a similar variance to those found within the original set. Carry out PGLS on the replicated sets to compare test statistics to original results, in script pgls_sets_simulation.R.

6. GSEA

Input: results from PGLS testing for association between individual copy number and life history traits. Genes are seperated into two groups depending on their association with the trait is positive or negative, and are ranked by p-value.

GSEA was run locally using the desktop application.

Parameters:

Gene sets database: set to the relevant mouse gene set database (e.g. mh, m2cp or m5)
Number of permutations: 1000 (default)
Ranked List: ranked gene list from pgls p-values
Collapse/Remap to gene symbols: No_Collapse
Chip platform: NA
Enrichment statistic: Classic
Max size: 500 (default)
Min size: 5

7. ORA

ORA is carried out using the R package clusterProfiler, in script ORA.qmd.

Contact information

If you have any questions about this code please reach out to me: [email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Variable gene copy number in cancer-related pathways is associated with cancer prevalence across mammals

Repository information

Workflow

1. Data Collection

2. Identification of Orthogroups

3. PGLS (individual genes)

4. Aggregate PGLS (gene sets)

5. Randomisation test

6. GSEA

7. ORA

Contact information

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
pgls_indv		pgls_indv
pgls_sets		pgls_sets
ORA.qmd		ORA.qmd
README.md		README.md
busco_proteome.sh		busco_proteome.sh
download_proteome.sh		download_proteome.sh
environment.yml		environment.yml
gfftk_proteome.sh		gfftk_proteome.sh
mouse_orthogroup_genes.sh		mouse_orthogroup_genes.sh
orthofinder.sh		orthofinder.sh
orthofinder_remove_species.sh		orthofinder_remove_species.sh
pgls_sets_simulation.R		pgls_sets_simulation.R
phenotype_simulation.R		phenotype_simulation.R
primary_transcripts.sh		primary_transcripts.sh

sophie-03/gene_cn_lifehistory

Folders and files

Latest commit

History

Repository files navigation

Variable gene copy number in cancer-related pathways is associated with cancer prevalence across mammals

Repository information

Workflow

1. Data Collection

2. Identification of Orthogroups

3. PGLS (individual genes)

4. Aggregate PGLS (gene sets)

5. Randomisation test

6. GSEA

7. ORA

Contact information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages