Structural Phylome

From a given taxon sampling and a given seed species compute a normal sequence based phylome and a novel structure based phylome.

Usage

Installation

The first thing must be installing snakemake and you can easily do this with this conda command

conda create -c conda-forge -c bioconda -n snakemake snakemake hdf5 snakefmt snakedeploy

Then from now one I'd reccomend to manage further dependencies with conda using the --sdm conda flag in snakemake. Otherwise you can find in workflow/envs/ different yaml files to recreate what is needed at each step.

The only dependency not automatically managed in the pipeline is gsutil: follow these instructions for gsutil installation.

Data preparation

First of all you need to download the data, to do this you need an input file with 3 columns (Uniprot ID\tTaxid\tmnemonic), see data/input/Test_set.txt. You also need a config file where you can specify various parameters, see config/test.yaml for example. The analysis in the paper were done using config/Hsapopi_test.yaml.

# Input file
dataset: 'example'
taxids: 'data/input/Test_set.txt'
species_tree: 'data/sptrees/Test.nwk'
seed: ['UP000000625']
root: 'ecoli'

# this is the number of seed genes to run the pipeline
test_seeds: 15
n_examples: 3

Importantly, global parameters that are likely to be used across different datasets are in config/params.yaml. Note that the values in the first yaml are prioritary to the ones in this last file!

Once you have these you can simply run:

snakemake -s workflow/download_data.smk --configfile config/test.yaml -p -j2 --sdm conda

This first pipeline is necessary to get all input files. From the input table we can download all the pdbs from google and then consider only those entries with mean average quality > params["low_confidence"] value. These will be moved into the high_cif folder for each proteome.

You can change the directory where all these data are stored with params["data_dir"] parameter but I would just use the default one.

Phylogeny pipeline

To run the actual pipeline you just need as input the species tree and the previous uniprot table file. Once you have these you can decide the target sets and the methodological implementations that you'd like to explore:

combinations: ["3Di_3Di", "aa_FM_", "aa_LG", "3Di_GTR", "3Di_FT", "3Di_LLM", "3Di_AF"]
#, "3Di_FTPY"]
# Subset of the first one that requires ML 
combinations_ML: ["3Di_3Di", "aa_LG", "3Di_GTR", "3Di_LLM", "3Di_AF"]

Another important parameter is the seed. You can specify one or a list of many (although this feature is more or less untested) and the pipeline will create one output per seed species.

To run the pipeline you can simply run this command and monitor that everything is more or less running fine.

snakemake --configfile config/test.yaml -p -j2 -k --sdm conda

Outputs

results/{dataset}/trees/{seed}_unrooted_trees.txt: has 5 columns (gene ID, target set, alphabet, model and tree text). This file is easily parsable in R or Python to do further analyses. You can find some R scripts to do that in workflow/scripts/
results/{dataset}/plots/{seed}*pdf: here you will find various plot that should inform you about the homology search and tree reconstruction quality.
results/data: here you will find very useful files for specific downstream analyses including the fastas and a cif file, a prediction file and the PAE file per protein. You will also find a gffs folder where uniprot gffs are stored and a cath folder where each proteome has a file that connects protein id to Pfam and Gene3D entries. In the ids folder you can also see various phylogenomic useful links in case you would like to benchmark with different homology sets as inferred by OMA, EggNOG or PhylomeDB (among others).

TODOs

Run OMA benchmark at eukaryotic level and compute TCS and our metrics: https://github.com/DessimozLab/fold_tree?tab=readme-ov-file#benchmarking-experiments YOU HAVE A PROPER CONDA ENVIRONMENT NAMED foldtree TO DO THIS! (Ran on my workstation as you need internet)
Correlation between RF and average pLDDT of a tree
Why DL is different in union and foldseek
implement notebook instead of scripts to plot that can take both eggnoglike or phylomedb.

DONE

Check that everything is alright in the two test seeds.
Do a proper eggnog and oma benchmark
Agree on final set of models
check overlap of blast and foldseek sets with various orthologous sets
Better representation of CATH and Pfam analysis
Switch from quicktree to FastME
3di + aa: finish rules for partition and add pipeline to the schema -> combination= {comb_part}
use list of seeds as input.
replace X with - when masking characters
File structure is a bit complicated and there is a bit of redundancy, for example rerunning blast everytime or the metadata: done

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
config		config
data		data
resources		resources
workflow		workflow
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Structural Phylome

Usage

Installation

Data preparation

Phylogeny pipeline

Outputs

TODOs

DONE

About

Languages

License

Gabaldonlab/structural_phylome

Folders and files

Latest commit

History

Repository files navigation

Structural Phylome

Usage

Installation

Data preparation

Phylogeny pipeline

Outputs

TODOs

DONE

About

Resources

License

Stars

Watchers

Forks

Languages